I2I-Galip: Unsupervised Medical Image Translation Using Generative Adversarial CLIP

Yilmaz Korkmaz, Vishal M. Patel

Abstract

Unpaired image-to-image translation is a challenging task due to the absence of paired examples, which complicates learning the complex mappings between the distinct distributions of the source and target domains. One of the most commonly used approach for this task is CycleGAN which requires the training of a new pair of generator-discriminator networks for each domain pair. In this paper, we propose a new image-to-image translation framework named Image-to-Image-Generative-Adversarial-CLIP (I2I-Galip) where we utilize a pre-trained multi-model foundation model (i.e., CLIP) to mitigate the need of separate generator-discriminator pairs for each source-target mapping while achieving better and more efficient multi-domain translation. By utilizing the massive knowledge gathered during pre-training a foundation model, our approach makes use of a single lightweight generator network with $\approx$ 13M parameters for the multi-domain image translation task. Comprehensive experiments on translation performance in public MRI and CT datasets show the superior performance of the proposed framework over the existing approaches. Code will be available here.

1 Introduction

Medical image translation is a challenging task because of significantly different domain distributions, necessitating the learning of very complex mappings between different imaging modalities [28]. Many supervised deep learning-based image translation methods have been proposed to address this problem [5, 12, 1]. However, these methods are limited due to the requirement of paired training data which might be challenging to acquire in real case scenarios. To overcome this constraint, various unsupervised image translation methods have been introduced for both general computer vision and medical imaging tasks [4, 17, 11, 21, 9, 34, 31]. CycleGAN [38] was one of the first approaches that proposed unpaired image translation which loosened the requirement for paired datasets by enforcing cycle-consistency among inverse translations. However, in the case of multiple modalities, CycleGAN introduces significant computational requirements as separate generator-discriminator pairs are required for each new modality. To mitigate the need of separate network pairs several multi-domain translation frameworks have been proposed [2, 3, 11, 16]. Nonetheless, these methods generally lag in performance compared to uni-modal approaches.

More recently, a couple of text-driven diffusion based image-to-image translation frameworks have been proposed that integrate large vision-language pre-trained models as guidance [33, 27, 10, 15], enabling robust translation across multiple domains. While these models provide zero-shot editing capabilities for various text conditions, they are limited in delivering fidelity necessary for the medical tasks. Moreover, these methods impose a significant computational burden due to the requirement for large denoiser backbones and extended inference times in their backward diffusion processes.

In this paper, we propose a cycle-consistent generative adversarial model to address the aforementioned limitations. Our model integrates BiomedCLIP (see Sec. 3.2), a pre-trained multi-modal vision-language model specifically trained in the medical domain, within a cycle-consistent feed-forward framework. By leveraging contrastive information from this large pre-trained network, we eliminate the need to train a new generator network for each translation task and reduce the requirement for large discriminators in feature extraction. Furthermore, our model enhances overall translation performance compared to existing unsupervised approaches in both single and multi-domain translation tasks.

Our main contributions can be summarized as follows:

•

We introduce an adversarial framework for language-driven image translation for medical images.
•

Our framework employs a novel CLIP driven cycle-consistent image translation model.
•

Our model outperforms existing unsupervised baselines with a relatively lightweight backbone. Extensive experiments demonstrate its superior performance across various publicly available datasets from different modalities.

2 Related Works

Cycle-consistent image translation.

Zhu et al. revolutionized the field of unsupervised image translation with their proposal of CycleGAN [38]. Yi et al. proposed DualGAN [34] which is a concurrent work with CycleGAN offering the same cycle-consistency loss. Various studies followed the cycle-consistency constraint for more faithful translation in the unsupervised setting. Liu et al. proposed UNIT [17] for uni-modal translation where a shared latent space is assumed between source and target modalities. Huang et al. proposed MUNIT [11] where UNIT’s assumption of shared latent space is divided into content and style for multi-domain translation. Lee et al. [16] introduced DRIT, which shares a similar approach to MUNIT by using disentangled content and attribute latents for multi-domain translation. Choi et al. proposed StarGANv1 [2] and StarGANv2 [3] where they utilized a separate style encoder network to generate distinct style codes to be used in generator for multi-domain translation. Perera et al. [25] proposed an alternative method where they utilize multi-domain input modalities with a latent-consistency loss. Kim et al. proposed U-GAT-IT [14] with an advanced generator equipped with adaptive layer instance normalization layers and attention. Torbunov et al. proposed UVCGan [31] employing a pre-trained vision transformer as generator in a cycle-consistent framework for improved translation performance.

Text-guided image translation.

Following the advancements in vision-language models [26] several text-guided unsupervised image translation methods proposed with or without cycle-consistency constraint. Park et al. proposed LANIT [22] where they use CLIP to generate pseudo labels for unlabeled images with a similar approach in Starganv2. Gal et al. proposed StyleGAN-NADA [7] for CLIP driven adaptation of Stylegan2 generator [13]. Patashnik et al. proposed StyleCLIP [24] where they invert source image to find its latent code for CLIP guided feature manipulation.

Diffusion-based image translation.

More recently, building on the success of diffusion models in image generation, various unsupervised image translation methods utilizing diffusion-based backbones have been proposed. Zhao et al. proposed EGSDE [37] where they utilize energy-guided translation between diversely trained diffusion models. Özbey et al. proposed SynDiff [21], where they use multiple cycle-consistent diffusive and non-diffusive generators for improved translation performance. Kwon et al. proposed DiffuseIT [15] and used pre-trained vision transformers as guidance in image manipulation. Tumanyan et al. [33] offered a plug and play framework to adapt pre-trained text-to-image diffusion models in image translation. Zhan et al. proposed MedM2G [35], where they proposed a unified multi-modal diffusive framework for text to image, image to text synthesis and image translation tasks. Parmar et al. [23] proposed a one step diffusion model for unsupervised image translation adapting the pre-trained latent diffusion weights. Liu et al. [18] offered an adversarial network utilizing diffusion supervision in latent space.

Our approach shares similarities with MedM2G [35] in employing a multi-modal text-guided framework for image translation. However, our model is over an order of magnitude smaller, leveraging a feed-forward generative adversarial network architecture and enforcing cycle-consistency across translations. We also incorporate common loss terms with DiffuseIT [15], utilizing CLS tokens from pre-trained vision transformers for semantically meaningful information extraction. Nonetheless, our approach differs in its use of cycle-consistency and the feed forward generative adversarial methodology adopted. We named our method in reference to the text-to-image generative adversarial model Galip [30]. However, apart from the CLIP based feature extraction utilized for the Discriminator, our method does not share further similarities with Galip in terms of architecture or training methodology.

3 Background

3.1 Cycle-Consistent Generative Adversarial Networks (CycleGAN)

CycleGAN [38] models the unpaired image translation problem between domain A and B using two translators. First, two translators (G : $A$ → $B$ ) and (F : $B$ → $A$ ) are defined. Then G and F are forced to be inverses of each other, thus making both mappings to be approximately bijections. CycleGAN achieves remarkable performance using this cycle-consistency combined with the adversarial loss which encourages $F(G(X_{A}))\approx X_{A}$ and $G(F(X_{B}))\approx X_{B}$ .

3.2 BiomedCLIP

In this paper, we utilize BiomedCLIP [36] as our pre-trained vision-language model. BiomedCLIP is trained on PMC-15M dataset using pairs of figures and captions from biomedical research articles in PubMed Central and outperforms other medical vision-language models in various tasks [36]. BiomedCLIP utilizes a ViT-B [6] based image encoder while utilizing PubMedBERT [8] for the text embeddings.

4 Methodology

4.1 I2I-Galip

Refer to caption — Figure 1: Training scheme and overall model architecture of I2I-Galip is illustrated when input image is from domain B. Part a illustrates $L_{clip}$ , ${L}_{cls}$ and ${L}_{adv}$ losses along with U-Net based generator, discriminator head and BiomedCLIP’s ViT-B. Part b and c shows the definition of ${L}_{cycle}$ and ${L}_{identity}$ losses respectively. BiomedCLIP’s ViT-B and text encoder parameters are frozen during training. ”This MRI Image is T₂-weighted” corresponds to a sample caption used in T₁ to T₂ translation.

We design a lightweight generator network which is a very thin variant of the latent diffusion U-Net [27] (with only $\approx 13$ M parameters). Our discriminator network is using the projections of intermediate Vision Transformer (ViT) features as input, adapted from text-to-image model Stylegan-T [29]. This discriminator design allow us to utilize the output of different layers in BiomedCLIP’s ViT, capturing different level of details. We modified this design by dividing the discriminator heads into distinct sets, tailored specifically for a target translation domain. We also utilize BiomedCLIP’s text encoder to generate target text embeddings using captions for each modality (see Fig. 1a), which controls the generated image features via cross-attention transformers while serving as a regularizer in the training (shown in Fig. 1a). Overall training objective for the generator can be expressed as follows

$\displaystyle\mathcal{L}_{total}=$	$\displaystyle\lambda_{cycle}\cdot\mathcal{L}_{cycle}+\lambda_{adv}\cdot% \mathcal{L}_{adv_{G}}$	(1)
	$\displaystyle+\lambda_{clip}\cdot\mathcal{L}_{clip}+\lambda_{cls}\cdot\mathcal% {L}_{cls}$
	$\displaystyle+\lambda_{identity}\cdot\mathcal{L}_{identity},$

where $\lambda_{cycle},\lambda_{adv},\lambda_{clip},\lambda_{cls},\lambda_{identity}$ are coefficients to control the contribution from each loss. We denote the loss associated with the discriminator as $\mathcal{L}_{adv_{D}}$ , apart from that other loss terms are utilized in generator training. In what follows, we describe each of these loss terms in detail.

Adversarial Loss: By leveraging intermediate features from the ViT, direct feature extraction from images becomes unnecessary, enabling the use of lightweight discriminator heads for each feature level. We utilize the least squares GAN loss [19] to enhance the stability of training instead of Hinge loss used in StyleGAN-T, which can be defined as follows

	$\displaystyle\mathcal{L}_{adv_{G}}=$	$\displaystyle\mathbb{E}[\left(Head_{A}(E_{X_{A}}^{out},E_{T_{A}})-1\right)^{2}]$		(2)
		$\displaystyle+\mathbb{E}[\left(Head_{B}(E_{X_{B}}^{out},E_{T_{B}})-1\right)^{2% }],$		(2)

$\displaystyle\mathcal{L}_{adv_{D}}=$	$\displaystyle\mathbb{E}\left[\left(Head_{A}(E_{X_{A}}^{input},E_{T_{A}})-1% \right)^{2}\right.$	(3)
	$\displaystyle\left.+\left(Head_{A}(E_{X_{A}}^{out},E_{T_{A}})\right)^{2}\right]$
	$\displaystyle+\mathbb{E}\left[\left(Head_{B}(E_{X_{B}}^{input},E_{T_{B}})-1% \right)^{2}\right.$
	$\displaystyle\left.+\left(Head_{B}(E_{X_{B}}^{out},E_{T_{B}})\right)^{2}\right]$

where $Head_{A}$ and $Head_{B}$ corresponds to the discriminator heads allocated for the specific target domain, $E_{T_{A}}$ and $E_{T_{B}}$ are corresponding text encodings (i.e., text encodings of captions for each domain), $E_{X_{A}}^{input}$ , $E_{X_{B}}^{input}$ , $E_{X_{A}}^{out}$ and $E_{X_{B}}^{out}$ are feature maps from ViT for input and generated images from domain A and B, respectively. We describe how these terms are generated as follows

	$\displaystyle E_{X_{A}}^{input}=ViT_{int}(X_{A}^{input})$		(4)
	$\displaystyle E_{X_{B}}^{input}=ViT_{int}(X_{B}^{input})$		(5)
	$\displaystyle E_{X_{A}}^{out}=ViT_{int}(G(X_{B}^{input},{T_{A}},Y_{A}))$		(6)
	$\displaystyle E_{X_{B}}^{out}=ViT_{int}(G(X_{A}^{input},{T_{B}},Y_{B}))$		(7)

where $ViT_{int}$ corresponds to intermediate layers of ViT, $X_{A}^{input}$ and $X_{B}^{input}$ are unpaired input images, $Y_{A}$ and $Y_{B}$ are binary class labels for domain A and B, respectively.

Cycle Loss: We enforce cycle-consistency, shown in Fig. 1b, to encourage more faithful translation between source and target domains for each pair which is defined as follows

	$\displaystyle\mathcal{L}_{\text{cycle}}=$	$\displaystyle\mathbb{E}\left[\\|X_{B}^{input}-G(X_{A}^{input},E_{T_{B}},Y_{B})% \\|_{1}\right]$		(9)
		$\displaystyle+\mathbb{E}\left[\\|X_{A}^{input}-G(X_{B}^{input},E_{T_{A}},Y_{A})% \\|_{1}\right].$		(9)

CLIP Guidance Loss: We minimize cosine distance between the text encoding corresponds to the caption of target domain (e.g., ”This MRI image is T₁-Weighted” or ”This is pelvic CT”) and the encoding from ViT for the generated images to enable the utilization of CLIP’s joint embedding space similarly with [24], which can be defined as follows

\mathcal{L}_{\text{clip}}=-\frac{\langle E_{T_{A}},E_{X_{A}}^{out^{last}}% \rangle}{\|E_{T_{A}}\|\cdot\|E_{X_{A}}^{out^{last}}\|}-\frac{\langle E_{T_{B}}% ,E_{X_{B}}^{out^{last}}\rangle}{\|E_{T_{B}}\|\cdot\|E_{X_{B}}^{out^{last}}\|},

(10)

where $E_{X_{A}}^{out^{last}}$ , $E_{X_{B}}^{out^{last}}$ are the image encodings from the last layer of ViT for the generated images from domain A and B, respectively. Generally, $\mathcal{L}_{clip}$ is dominated by cycle and adversarial losses giving comparingly small benefits (see Sec. 5.1 for details).

CLIP Encoding Loss: The CLS tokens in the final layers of vision transformers are recognized for containing semantically rich information, as highlighted by [32], which is typically leveraged for downstream classification tasks and shown to be beneficial in image translation [15]. Therefore, we enforce cosine similarity between the CLS tokens in the ViT for the generated and target domain’s images to enforce semantic similarity among these images, which can be written as follows

\mathcal{L}_{\text{cls}}=-\frac{\langle cls_{X_{A}}^{input},cls_{X_{A}}^{out}% \rangle}{\|cls_{X_{A}}^{input}\|\cdot\|cls_{X_{A}}^{out}\|}-\frac{\langle cls_% {X_{B}}^{input},cls_{X_{B}}^{out}\rangle}{\|cls_{X_{B}}^{input}\|\cdot\|cls_{X% _{B}}^{out}\|},

(11)

where $cls_{X_{A}}^{input}$ and $cls_{X_{B}}^{input}$ are the CLS tokens in the last layers of ViT for input images from domain A and B respectively. $cls_{X_{A}}^{out}$ and $cls_{X_{B}}^{out}$ are corresponding CLS tokens for the generated images.

Identity Loss: Identity loss is found to be beneficial to maintain source image structure in translation by enforcing the pixel-level equality when target and source domains match [38]. We enforce it via using same labels and text embeddings corresponding to the input image domain (see Fig. 1c). It can be defined using our framework as follows

	$\displaystyle\mathcal{L}_{\text{identity}}=$	$\displaystyle\mathbb{E}\left[\\|X_{A}^{input}-G(X_{A}^{input},E_{T_{A}},Y_{A}))% \\|_{1}\right]$		(12)
		$\displaystyle+\mathbb{E}\left[\\|X_{B}^{input}-G(X_{B}^{input},E_{T_{B}},Y_{B})% \\|_{1}\right].$		(12)

Table 1: Model complexities are illustrated using total number of parameters for each competing method. The third row indicates the number of required generator and discriminator networks, given the specified number of domain.

T

and P(.) represents the number of domains for a multi-modal translation problem and permutation operator respectively. Total parameters are calculated for a representative case where

T=4

Network/Model	I2I-Galip	CycleGAN	U-GAT-IT	StarGAN	UNIT	EGSDE
Generator (G)	13.2M	11.3M	278.9M	8.4M	5.4M + 5.4M	164M
Discriminator (D)	23.9M	2.7M	56.4M	44.8M	2.8M	0
Times (G, D)	1, T	P(T,2), P(T,2)	P(T,2), P(T,2)	1, 1	P(T,2), P(T,2)	T, 0
Total	108.8M	169.5M	4023.6M	53.2M	162.2M	657.2M

Table 2: Multi-domain image translation results in IXI dataset. T₁-, T₂- and PD-weighted images are considered.

One-to-one task	T₁->T₂	T₂->T₁	T₂->PD	PD->T₂	T₁->PD	PD->T₁
IXI	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM
I2I-Galip-M	27.22 $\|$ 90.18	27.30 $\|$ 90.86	32.34 $\|$ 95.74	33.12 $\|$ 95.39	26.76 $\|$ 90.75	27.70 $\|$ 91.20
I2I-Galip-S	27.47 $\|$ 90.54	27.33 $\|$ 91.06	32.11 $\|$ 95.65	32.87 $\|$ 95.62	26.99 $\|$ 90.80	27.75 $\|$ 91.07
CycleGAN	26.10 $\|$ 87.36	26.31 $\|$ 88.51	27.43 $\|$ 93.68	31.07 $\|$ 93.81	24.56 $\|$ 88.26	25.91 $\|$ 89.47
U-GAT-IT	24.44 $\|$ 86.19	24.51 $\|$ 86.85	26.81 $\|$ 91.39	29.03 $\|$ 92.11	22.98 $\|$ 85.16	24.83 $\|$ 87.44
StarGAN	20.96 $\|$ 71.40	21.00 $\|$ 71.44	26.18 $\|$ 91.80	27.33 $\|$ 91.52	21.69 $\|$ 72.92	21.51 $\|$ 72.89
UNIT	23.59 $\|$ 84.40	24.76 $\|$ 86.63	25.22 $\|$ 91.42	29.10 $\|$ 93.30	23.20 $\|$ 86.00	23.50 $\|$ 80.05
EGSDE	16.93 $\|$ 53.32	17.44 $\|$ 57.54	17.98 $\|$ 75.93	16.40 $\|$ 57.55	19.70 $\|$ 71.21	19.71 $\|$ 59.73

4.2 Datasets

We conduct experiments on the following datasets to demonstrate the performance of our approach.

1.

IXI: Translation performance demonstrated in a single-coil brain MRI dataset from (http://brain-development.org/ixi-dataset/). T₁-, T₂- and PD-weighted acquisitions are considered. In IXI, 25 subjects are used for training, 5 for validation and 10 for testing.
2.

CT-MRI: Translation performance demonstrated in pelvic T₁- and T₂-weighted MRI and CT data from [20]. In CT-MRI dataset, 9 subjects are used for training, 1 for validation and 4 for testing.

We consider the IXI dataset in both multi-domain and single-domain translation contexts. In the multi-domain scenario, we use a single network for all translation tasks, whereas in the single-domain scenario, we utilize distinct networks for each individual task. On the other hand, CT-MRI dataset only allows us to use single-domain translation context.

4.3 Competing Methods

We illustrate the model complexities using the number of parameters in each competing method in the Tab. 1. A single NVIDIA RTX A5000 GPU with PyTorch framework is utilized in all experiments.

1.

I2I-Galip: Our model is trained with Adam optimizer with an initial learning rate set at 0.0002, which is linearly decreased to 0 after the $50th$ epoch. Number of discriminator head sets are determined according to the number of domains in the translation problem, where for IXI it is 3, and 2 for CT-MRI. We utilize hyperparameters 10, 1, 1, 1, 1 for $\lambda_{cycle}$ , $\lambda_{adv}$ , $\lambda_{cls}$ , $\lambda_{clip}$ , and $\lambda_{identity}$ respectively.
2.

CycleGAN: Cycle-consistent generative adversarial model is considered [38]. The Adam optimizer is utilized for training with an initial learning rate set at 0.0002, which linearly decreased to 0 after the $50th$ epoch. The training process spans a total of 100 epochs. Weights for adversarial, cycle, identity losses are selected as 1, 10, 0.5 respectively.
3.

U-GAT-IT: An attention guided GAN model with adaptive layer-instance normalization designed for unsupervised image translation is considered [14]. Adam optimizer is utilized for training with a learning rate of 0.0001. Training lasts for 100 epochs. Weights for adversarial, cycle, identity and CAM losses are selected as 1, 10, 10, and 1000 respectively.
4.

StarGAN: A unified unsupervised image translation GAN model is considered [2]. Adam optimizer is used for training with a learning rate of 0.0001. Training length is 100 epochs. Weights for domain classification loss, reconstruction loss and gradient penalty are selected as 1, 10, 10 respectively. A single StarGAN model is trained and tested for all domain pairs in each dataset.
5.

UNIT: An unsupervised GAN model designed for unsupervised image translation is considered [17]. Adam optimizer is utilized for training with a learning rate of 0.0001 for 100 epochs. Weights for adversarial, image, style, and content reconstruction losses are selected as 1, 10, 1, 1 respectively.
6.

EGSDE: A diffusion based unpaired image translation model is considered [37]. Seperate DDPM models are trained for each translation domain to be utilized in EGSDE model. 500,000 diffusion steps are used for training of the DDPMs and T is selected as 150 to maintain source structure, and cross-validated weight parameters $\lambda_{s}$ and $\lambda_{i}$ are selected as $1\times 10^{-7}$ and 10.

5 Results

Table 3: Single-domain image translation results in CT-MRI dataset for T₁- and T₂-weighted images.

	T₁->CT		T₂->CT
	PSNR	SSIM	PSNR	SSIM
I2I-Galip	26.13	90.86	27.08	91.30
CycleGAN	24.55	78.63	27.39	89.92
U-GAT-IT	25.79	89.34	26.01	87.48
StarGAN	25.00	79.96	24.67	78.37
UNIT	26.02	79.22	25.15	75.30
EGSDE	19.03	74.63	14.74	66.67

We used Peak-Signal-to-Noise-Ratio (PSNR, dB) and Structural Similarity Index Measure (SSIM, %) to compare the translation performances of competing methods. Results are presented for both single- and multi-domain case in IXI for I2I-Galip to show the effectiveness of the proposed approach for both cases. CT-MRI results are presented as the single-domain translation. I2I-Galip-S (Single), CycleGAN, U-GAT-IT, UNIT are separately trained for all possible domain pairs while I2I-Galip-M (Multi) and StarGAN are trained once per dataset. EGSDE is a training free method for image translation but it requires separately trained diffusion models for each target domain. Tab. 2 and Tab. 3 shows the translation performance in IXI and CT-MRI datasets respectively. We show the corresponding translated images for each competing methods for distinct translation tasks in Fig. 2, Fig. 3, Fig. 4 and Fig. 5. Best performances are highlighted as bold in each table for each metric. Overall, I2I-Galip-M yields 2.17dB better PSNR and over 2% better SSIM than the second best competing method on average in IXI. I2I-Galip-S yields 0.10 dB better PSNR and 1.52% better SSIM in T₁ to CT, while providing 1.38% better SSIM in T₂ to CT task. Our method excels in capturing high-frequency details more effectively than competing approaches, providing superior image clarity and precision. Unlike other methods, it does not suffer from the noise artifacts that often degrade the quality of the output, leading to more accurate and visually appealing results even with its low computational budget as shown in Tab. 1.

5.1 Ablation Studies

We illustrate the effect of individual loss components in the proposed model for single- and multi-domain cases in Tab. 4. We observe that the most of the performance gain comes from adversarial and cycle losses although clip guidance, identity and clip encoding losses are beneficial especially in the multi-domain setting. In a single-domain setting, the adversarial loss tends to dominate, thereby reducing the influence of other loss terms. We discuss the reasons behind these results in Sec. 6.

Table 4: Single- and multi-domain ablation results in IXI dataset. PSNR and SSIM values are averaged across the whole test set.

	I2I-Galip-S		I2I-Galip-M
	IXI		IXI
	PSNR	SSIM	PSNR	SSIM
Proposed	29.09	92.48	29.07	92.35
$\lambda_{adv}=0$	19.80	60.80	18.62	49.05
$\lambda_{cls}=0$	29.00	92.26	28.88	91.99
$\lambda_{cycle}=0$	27.91	90.93	28.08	90.88
$\lambda_{clip}=0$	28.90	92.23	28.99	92.18
$\lambda_{identity}=0$	28.74	91.12	28.76	92.01

6 Discussion and Limitations

We observed only minimal benefits from using CLIP guidance, identity, and CLIP encoding losses in our experiments. Despite experimenting with various metrics to define these losses, such as Cosine, L2, and Contrastive losses, we observed similar behavior. This outcome is likely due to the dominating effect of the adversarial loss. Our discriminator, leveraging the powerful feature extraction capabilities of the pre-trained BiomedCLIP’s ViT and MSE loss, can distinguish fake images early in the training process, effectively acting as a regularizer. This regularization may limit the efficacy of CLIP guidance, which can sometimes provide incorrect directions in translation, as shown in earlier studies [29, 15]. Consequently, the impact of CLIP guidance is constrained. Additionally, our model incorporates BiomedCLIP as the multi-modal foundation model to utilize the extensive domain knowledge it gathered during pre-training. Thus, the translation performance of our method is also limited by the contrastive pre-training strategy, where the ViT, focusing on semantically meaningful feature extraction, may suboptimally extract low-level image features. Furthermore, our model may show sensitivity to the captions chosen for the target domain description. Although we tested a range of caption styles, we did not see notable benefits. Therefore, we opted for the simplest combinations employed in BiomedCLIP [36], such as ”This is a photo of XX-weighted MRI” and ”This is pelvic MRI” or ”This is pelvic CT”. We leave exploring this aspect further as a future research direction.

Our model employs a generative adversarial network thus, could be fragile to known problems in GAN training like mode collapse. Additionally, GAN training can be unstable and sensitive to hyper-parameter choices, further complicating the training process and potentially hindering the model’s effectiveness.

Our method can be readily adapted to non-medical domains with an expanding range of translations. The primary constraints would be the inherent capacity of the lightweight generator and the domain knowledge embedded in the utilized pre-trained vision-language model.

7 Conclusion

We proposed an unsupervised multi-modal image translation framework employing a generative adversarial network which is empowered with a pre-trained vision-language model. Our framework improves upon the cycle-consistent translation models while enhancing the multi-domain translation performance with a reduced computational budget.

References

[1] Karim Armanious, Chenming Jiang, Marc Fischer, Thomas Küstner, Tobias Hepp, Konstantin Nikolaou, Sergios Gatidis, and Bin Yang. Medgan: Medical image translation using gans. Computerized medical imaging and graphics, 79:101684, 2020.
[2] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
[3] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
[4] Xianjin Dai, Yang Lei, Yabo Fu, Walter J Curran, Tian Liu, Hui Mao, and Xiaofeng Yang. Multimodal mri synthesis using unified generative adversarial networks. Medical physics, 47(12):6343–6354, 2020.
[5] Salman UH Dar, Mahmut Yurt, Levent Karacan, Aykut Erdem, Erkut Erdem, and Tolga Cukur. Image synthesis in multi-contrast mri with conditional generative adversarial networks. IEEE transactions on medical imaging, 38(10):2375–2388, 2019.
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[7] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
[8] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
[9] Junlin Han, Mehrdad Shoeiby, Lars Petersson, and Mohammad Ali Armin. Dual contrastive learning for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 746–755, 2021.
[10] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
[11] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018.
[12] Lan Jiang, Ye Mao, Xiangfeng Wang, Xi Chen, and Chao Li. Cola-diff: Conditional latent diffusion model for multi-modal mri synthesis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 398–408. Springer, 2023.
[13] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
[14] Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwanghee Lee. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830, 2019.
[15] Gihyun Kwon and Jong Chul Ye. Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264, 2022.
[16] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pages 35–51, 2018.
[17] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. Advances in neural information processing systems, 30, 2017.
[18] Xuhui Liu, Bohan Zeng, Sicheng Gao, Shanglin Li, Yutang Feng, Hong Li, Boyu Liu, Jianzhuang Liu, and Baochang Zhang. Ladiffgan: Training gans with diffusion supervision in latent spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1115–1125, 2024.
[19] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017.
[20] Tufve Nyholm, Stina Svensson, Sebastian Andersson, Joakim Jonsson, Maja Sohlin, Christian Gustafsson, Elisabeth Kjellén, Karin Söderström, Per Albertsson, Lennart Blomqvist, et al. Mr and ct data with multiobserver delineations of organs in the pelvic area—part of the gold atlas project. Medical physics, 45(3):1295–1300, 2018.
[21] Muzaffer Özbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, Şaban Özturk, Alper Güngör, and Tolga Çukur. Unsupervised medical image translation with adversarial diffusion models. IEEE Transactions on Medical Imaging, 2023.
[22] Jihye Park, Sunwoo Kim, Soohyun Kim, Seokju Cho, Jaejun Yoo, Youngjung Uh, and Seungryong Kim. Lanit: Language-driven image-to-image translation for unlabeled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23401–23411, 2023.
[23] Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024.
[24] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2085–2094, 2021.
[25] Pramuditha Perera, Mahdi Abavisani, and Vishal M Patel. In2i: Unsupervised multi-image-to-image translation using generative adversarial networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 140–146. IEEE, 2018.
[26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[27] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[28] Snehashis Roy, Amod Jog, Aaron Carass, and Jerry L Prince. Atlas based intensity transformation of brain mr images. In Multimodal Brain Image Analysis: Third International Workshop, MBIA 2013, Held in Conjunction with MICCAI 2013, Nagoya, Japan, September 22, 2013, Proceedings 3, pages 51–62. Springer, 2013.
[29] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
[30] Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Galip: Generative adversarial clips for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14214–14223, 2023.
[31] Dmitrii Torbunov, Yi Huang, Haiwang Yu, Jin Huang, Shinjae Yoo, Meifeng Lin, Brett Viren, and Yihui Ren. Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 702–712, 2023.
[32] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022.
[33] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
[34] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017.
[35] Chenlu Zhan, Yu Lin, Gaoang Wang, Hongwei Wang, and Jian Wu. Medm2g: Unifying medical multi-modal generation via cross-guided diffusion with visual invariant. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11502–11512, 2024.
[36] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023.
[37] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
[38] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.

One-to-one task	T₁->T₂	T₂->T₁	T₂->PD	PD->T₂	T₁->PD	PD->T₁
IXI	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM	PSNR $\|$ SSIM
I2I-Galip-M	27.22 $\|$ 90.18	27.30 $\|$ 90.86	32.34 $\|$ 95.74	33.12 $\|$ 95.39	26.76 $\|$ 90.75	27.70 $\|$ 91.20
I2I-Galip-S	27.47 $\|$ 90.54	27.33 $\|$ 91.06	32.11 $\|$ 95.65	32.87 $\|$ 95.62	26.99 $\|$ 90.80	27.75 $\|$ 91.07
CycleGAN	26.10 $\|$ 87.36	26.31 $\|$ 88.51	27.43 $\|$ 93.68	31.07 $\|$ 93.81	24.56 $\|$ 88.26	25.91 $\|$ 89.47
U-GAT-IT	24.44 $\|$ 86.19	24.51 $\|$ 86.85	26.81 $\|$ 91.39	29.03 $\|$ 92.11	22.98 $\|$ 85.16	24.83 $\|$ 87.44
StarGAN	20.96 $\|$ 71.40	21.00 $\|$ 71.44	26.18 $\|$ 91.80	27.33 $\|$ 91.52	21.69 $\|$ 72.92	21.51 $\|$ 72.89
UNIT	23.59 $\|$ 84.40	24.76 $\|$ 86.63	25.22 $\|$ 91.42	29.10 $\|$ 93.30	23.20 $\|$ 86.00	23.50 $\|$ 80.05
EGSDE	16.93 $\|$ 53.32	17.44 $\|$ 57.54	17.98 $\|$ 75.93	16.40 $\|$ 57.55	19.70 $\|$ 71.21	19.71 $\|$ 59.73