I2I-Galip: Unsupervised Medical Image Translation Using Generative Adversarial CLIP

Yilmaz Korkmaz, Vishal M. Patel
Abstract

Unpaired image-to-image translation is a challenging task due to the absence of paired examples, which complicates learning the complex mappings between the distinct distributions of the source and target domains. One of the most commonly used approach for this task is CycleGAN which requires the training of a new pair of generator-discriminator networks for each domain pair. In this paper, we propose a new image-to-image translation framework named Image-to-Image-Generative-Adversarial-CLIP (I2I-Galip) where we utilize a pre-trained multi-model foundation model (i.e., CLIP) to mitigate the need of separate generator-discriminator pairs for each source-target mapping while achieving better and more efficient multi-domain translation. By utilizing the massive knowledge gathered during pre-training a foundation model, our approach makes use of a single lightweight generator network with \approx13M parameters for the multi-domain image translation task. Comprehensive experiments on translation performance in public MRI and CT datasets show the superior performance of the proposed framework over the existing approaches. Code will be available here.

1   Introduction

Medical image translation is a challenging task because of significantly different domain distributions, necessitating the learning of very complex mappings between different imaging modalities [28]. Many supervised deep learning-based image translation methods have been proposed to address this problem [5, 12, 1]. However, these methods are limited due to the requirement of paired training data which might be challenging to acquire in real case scenarios. To overcome this constraint, various unsupervised image translation methods have been introduced for both general computer vision and medical imaging tasks [4, 17, 11, 21, 9, 34, 31]. CycleGAN [38] was one of the first approaches that proposed unpaired image translation which loosened the requirement for paired datasets by enforcing cycle-consistency among inverse translations. However, in the case of multiple modalities, CycleGAN introduces significant computational requirements as separate generator-discriminator pairs are required for each new modality. To mitigate the need of separate network pairs several multi-domain translation frameworks have been proposed [2, 3, 11, 16]. Nonetheless, these methods generally lag in performance compared to uni-modal approaches.

More recently, a couple of text-driven diffusion based image-to-image translation frameworks have been proposed that integrate large vision-language pre-trained models as guidance [33, 27, 10, 15], enabling robust translation across multiple domains. While these models provide zero-shot editing capabilities for various text conditions, they are limited in delivering fidelity necessary for the medical tasks. Moreover, these methods impose a significant computational burden due to the requirement for large denoiser backbones and extended inference times in their backward diffusion processes.

In this paper, we propose a cycle-consistent generative adversarial model to address the aforementioned limitations. Our model integrates BiomedCLIP (see Sec. 3.2), a pre-trained multi-modal vision-language model specifically trained in the medical domain, within a cycle-consistent feed-forward framework. By leveraging contrastive information from this large pre-trained network, we eliminate the need to train a new generator network for each translation task and reduce the requirement for large discriminators in feature extraction. Furthermore, our model enhances overall translation performance compared to existing unsupervised approaches in both single and multi-domain translation tasks.

Our main contributions can be summarized as follows:

  • We introduce an adversarial framework for language-driven image translation for medical images.

  • Our framework employs a novel CLIP driven cycle-consistent image translation model.

  • Our model outperforms existing unsupervised baselines with a relatively lightweight backbone. Extensive experiments demonstrate its superior performance across various publicly available datasets from different modalities.

2   Related Works

Cycle-consistent image translation.

Zhu et al. revolutionized the field of unsupervised image translation with their proposal of CycleGAN [38]. Yi et al. proposed DualGAN [34] which is a concurrent work with CycleGAN offering the same cycle-consistency loss. Various studies followed the cycle-consistency constraint for more faithful translation in the unsupervised setting. Liu et al. proposed UNIT [17] for uni-modal translation where a shared latent space is assumed between source and target modalities. Huang et al. proposed MUNIT [11] where UNIT’s assumption of shared latent space is divided into content and style for multi-domain translation. Lee et al. [16] introduced DRIT, which shares a similar approach to MUNIT by using disentangled content and attribute latents for multi-domain translation. Choi et al. proposed StarGANv1 [2] and StarGANv2 [3] where they utilized a separate style encoder network to generate distinct style codes to be used in generator for multi-domain translation. Perera et al. [25] proposed an alternative method where they utilize multi-domain input modalities with a latent-consistency loss. Kim et al. proposed U-GAT-IT [14] with an advanced generator equipped with adaptive layer instance normalization layers and attention. Torbunov et al. proposed UVCGan [31] employing a pre-trained vision transformer as generator in a cycle-consistent framework for improved translation performance.

Text-guided image translation.

Following the advancements in vision-language models [26] several text-guided unsupervised image translation methods proposed with or without cycle-consistency constraint. Park et al. proposed LANIT [22] where they use CLIP to generate pseudo labels for unlabeled images with a similar approach in Starganv2. Gal et al. proposed StyleGAN-NADA [7] for CLIP driven adaptation of Stylegan2 generator [13]. Patashnik et al. proposed StyleCLIP [24] where they invert source image to find its latent code for CLIP guided feature manipulation.

Diffusion-based image translation.

More recently, building on the success of diffusion models in image generation, various unsupervised image translation methods utilizing diffusion-based backbones have been proposed. Zhao et al. proposed EGSDE [37] where they utilize energy-guided translation between diversely trained diffusion models. Özbey et al. proposed SynDiff [21], where they use multiple cycle-consistent diffusive and non-diffusive generators for improved translation performance. Kwon et al. proposed DiffuseIT [15] and used pre-trained vision transformers as guidance in image manipulation. Tumanyan et al. [33] offered a plug and play framework to adapt pre-trained text-to-image diffusion models in image translation. Zhan et al. proposed MedM2G [35], where they proposed a unified multi-modal diffusive framework for text to image, image to text synthesis and image translation tasks. Parmar et al. [23] proposed a one step diffusion model for unsupervised image translation adapting the pre-trained latent diffusion weights. Liu et al. [18] offered an adversarial network utilizing diffusion supervision in latent space.

Our approach shares similarities with MedM2G [35] in employing a multi-modal text-guided framework for image translation. However, our model is over an order of magnitude smaller, leveraging a feed-forward generative adversarial network architecture and enforcing cycle-consistency across translations. We also incorporate common loss terms with DiffuseIT [15], utilizing CLS tokens from pre-trained vision transformers for semantically meaningful information extraction. Nonetheless, our approach differs in its use of cycle-consistency and the feed forward generative adversarial methodology adopted. We named our method in reference to the text-to-image generative adversarial model Galip [30]. However, apart from the CLIP based feature extraction utilized for the Discriminator, our method does not share further similarities with Galip in terms of architecture or training methodology.

3   Background

3.1   Cycle-Consistent Generative Adversarial Networks (CycleGAN)

CycleGAN [38] models the unpaired image translation problem between domain A and B using two translators. First, two translators (G : A𝐴Aitalic_AB𝐵Bitalic_B) and (F : B𝐵Bitalic_BA𝐴Aitalic_A) are defined. Then G and F are forced to be inverses of each other, thus making both mappings to be approximately bijections. CycleGAN achieves remarkable performance using this cycle-consistency combined with the adversarial loss which encourages F(G(XA))XA𝐹𝐺subscript𝑋𝐴subscript𝑋𝐴F(G(X_{A}))\approx X_{A}italic_F ( italic_G ( italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ) ≈ italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and G(F(XB))XB𝐺𝐹subscript𝑋𝐵subscript𝑋𝐵G(F(X_{B}))\approx X_{B}italic_G ( italic_F ( italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ) ≈ italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT.

3.2   BiomedCLIP

In this paper, we utilize BiomedCLIP [36] as our pre-trained vision-language model. BiomedCLIP is trained on PMC-15M dataset using pairs of figures and captions from biomedical research articles in PubMed Central and outperforms other medical vision-language models in various tasks [36]. BiomedCLIP utilizes a ViT-B [6] based image encoder while utilizing PubMedBERT [8] for the text embeddings.

4   Methodology

4.1   I2I-Galip

Refer to caption
Figure 1: Training scheme and overall model architecture of I2I-Galip is illustrated when input image is from domain B. Part a illustrates Lclipsubscript𝐿𝑐𝑙𝑖𝑝L_{clip}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT, Lclssubscript𝐿𝑐𝑙𝑠{L}_{cls}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and Ladvsubscript𝐿𝑎𝑑𝑣{L}_{adv}italic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT losses along with U-Net based generator, discriminator head and BiomedCLIP’s ViT-B. Part b and c shows the definition of Lcyclesubscript𝐿𝑐𝑦𝑐𝑙𝑒{L}_{cycle}italic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT and Lidentitysubscript𝐿𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦{L}_{identity}italic_L start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT losses respectively. BiomedCLIP’s ViT-B and text encoder parameters are frozen during training. ”This MRI Image is T2-weighted” corresponds to a sample caption used in T1 to T2 translation.

We design a lightweight generator network which is a very thin variant of the latent diffusion U-Net [27] (with only 13absent13\approx 13≈ 13M parameters). Our discriminator network is using the projections of intermediate Vision Transformer (ViT) features as input, adapted from text-to-image model Stylegan-T [29]. This discriminator design allow us to utilize the output of different layers in BiomedCLIP’s ViT, capturing different level of details. We modified this design by dividing the discriminator heads into distinct sets, tailored specifically for a target translation domain. We also utilize BiomedCLIP’s text encoder to generate target text embeddings using captions for each modality (see Fig. 1a), which controls the generated image features via cross-attention transformers while serving as a regularizer in the training (shown in Fig. 1a). Overall training objective for the generator can be expressed as follows

total=subscript𝑡𝑜𝑡𝑎𝑙absent\displaystyle\mathcal{L}_{total}=caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = λcyclecycle+λadvadvGsubscript𝜆𝑐𝑦𝑐𝑙𝑒subscript𝑐𝑦𝑐𝑙𝑒subscript𝜆𝑎𝑑𝑣subscript𝑎𝑑subscript𝑣𝐺\displaystyle\lambda_{cycle}\cdot\mathcal{L}_{cycle}+\lambda_{adv}\cdot% \mathcal{L}_{adv_{G}}italic_λ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT (1)
+λclipclip+λclsclssubscript𝜆𝑐𝑙𝑖𝑝subscript𝑐𝑙𝑖𝑝subscript𝜆𝑐𝑙𝑠subscript𝑐𝑙𝑠\displaystyle+\lambda_{clip}\cdot\mathcal{L}_{clip}+\lambda_{cls}\cdot\mathcal% {L}_{cls}+ italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT
+λidentityidentity,subscript𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦subscript𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦\displaystyle+\lambda_{identity}\cdot\mathcal{L}_{identity},+ italic_λ start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT ,

where λcycle,λadv,λclip,λcls,λidentitysubscript𝜆𝑐𝑦𝑐𝑙𝑒subscript𝜆𝑎𝑑𝑣subscript𝜆𝑐𝑙𝑖𝑝subscript𝜆𝑐𝑙𝑠subscript𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦\lambda_{cycle},\lambda_{adv},\lambda_{clip},\lambda_{cls},\lambda_{identity}italic_λ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT are coefficients to control the contribution from each loss. We denote the loss associated with the discriminator as advDsubscript𝑎𝑑subscript𝑣𝐷\mathcal{L}_{adv_{D}}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT, apart from that other loss terms are utilized in generator training. In what follows, we describe each of these loss terms in detail.

  1. 1.

    Adversarial Loss: By leveraging intermediate features from the ViT, direct feature extraction from images becomes unnecessary, enabling the use of lightweight discriminator heads for each feature level. We utilize the least squares GAN loss [19] to enhance the stability of training instead of Hinge loss used in StyleGAN-T, which can be defined as follows

    advG=subscript𝑎𝑑subscript𝑣𝐺absent\displaystyle\mathcal{L}_{adv_{G}}=caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝔼[(HeadA(EXAout,ETA)1)2]𝔼delimited-[]superscript𝐻𝑒𝑎subscript𝑑𝐴superscriptsubscript𝐸subscript𝑋𝐴𝑜𝑢𝑡subscript𝐸subscript𝑇𝐴12\displaystyle\mathbb{E}[\left(Head_{A}(E_{X_{A}}^{out},E_{T_{A}})-1\right)^{2}]blackboard_E [ ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (2)
    +𝔼[(HeadB(EXBout,ETB)1)2],𝔼delimited-[]superscript𝐻𝑒𝑎subscript𝑑𝐵superscriptsubscript𝐸subscript𝑋𝐵𝑜𝑢𝑡subscript𝐸subscript𝑇𝐵12\displaystyle+\mathbb{E}[\left(Head_{B}(E_{X_{B}}^{out},E_{T_{B}})-1\right)^{2% }],+ blackboard_E [ ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,
    advD=subscript𝑎𝑑subscript𝑣𝐷absent\displaystyle\mathcal{L}_{adv_{D}}=caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝔼[(HeadA(EXAinput,ETA)1)2\displaystyle\mathbb{E}\left[\left(Head_{A}(E_{X_{A}}^{input},E_{T_{A}})-1% \right)^{2}\right.blackboard_E [ ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)
    +(HeadA(EXAout,ETA))2]\displaystyle\left.+\left(Head_{A}(E_{X_{A}}^{out},E_{T_{A}})\right)^{2}\right]+ ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
    +𝔼[(HeadB(EXBinput,ETB)1)2\displaystyle+\mathbb{E}\left[\left(Head_{B}(E_{X_{B}}^{input},E_{T_{B}})-1% \right)^{2}\right.+ blackboard_E [ ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
    +(HeadB(EXBout,ETB))2]\displaystyle\left.+\left(Head_{B}(E_{X_{B}}^{out},E_{T_{B}})\right)^{2}\right]+ ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

    where HeadA𝐻𝑒𝑎subscript𝑑𝐴Head_{A}italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and HeadB𝐻𝑒𝑎subscript𝑑𝐵Head_{B}italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT corresponds to the discriminator heads allocated for the specific target domain, ETAsubscript𝐸subscript𝑇𝐴E_{T_{A}}italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ETBsubscript𝐸subscript𝑇𝐵E_{T_{B}}italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT are corresponding text encodings (i.e., text encodings of captions for each domain), EXAinputsuperscriptsubscript𝐸subscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡E_{X_{A}}^{input}italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT, EXBinputsuperscriptsubscript𝐸subscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡E_{X_{B}}^{input}italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT, EXAoutsuperscriptsubscript𝐸subscript𝑋𝐴𝑜𝑢𝑡E_{X_{A}}^{out}italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT and EXBoutsuperscriptsubscript𝐸subscript𝑋𝐵𝑜𝑢𝑡E_{X_{B}}^{out}italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT are feature maps from ViT for input and generated images from domain A and B, respectively. We describe how these terms are generated as follows

    EXAinput=ViTint(XAinput)superscriptsubscript𝐸subscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡𝑉𝑖subscript𝑇𝑖𝑛𝑡superscriptsubscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡\displaystyle E_{X_{A}}^{input}=ViT_{int}(X_{A}^{input})italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT = italic_V italic_i italic_T start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT ) (4)
    EXBinput=ViTint(XBinput)superscriptsubscript𝐸subscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡𝑉𝑖subscript𝑇𝑖𝑛𝑡superscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡\displaystyle E_{X_{B}}^{input}=ViT_{int}(X_{B}^{input})italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT = italic_V italic_i italic_T start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT ) (5)
    EXAout=ViTint(G(XBinput,TA,YA))superscriptsubscript𝐸subscript𝑋𝐴𝑜𝑢𝑡𝑉𝑖subscript𝑇𝑖𝑛𝑡𝐺superscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡subscript𝑇𝐴subscript𝑌𝐴\displaystyle E_{X_{A}}^{out}=ViT_{int}(G(X_{B}^{input},{T_{A}},Y_{A}))italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT = italic_V italic_i italic_T start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ( italic_G ( italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ) (6)
    EXBout=ViTint(G(XAinput,TB,YB))superscriptsubscript𝐸subscript𝑋𝐵𝑜𝑢𝑡𝑉𝑖subscript𝑇𝑖𝑛𝑡𝐺superscriptsubscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡subscript𝑇𝐵subscript𝑌𝐵\displaystyle E_{X_{B}}^{out}=ViT_{int}(G(X_{A}^{input},{T_{B}},Y_{B}))italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT = italic_V italic_i italic_T start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ( italic_G ( italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ) (7)

    where ViTint𝑉𝑖subscript𝑇𝑖𝑛𝑡ViT_{int}italic_V italic_i italic_T start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT corresponds to intermediate layers of ViT, XAinputsuperscriptsubscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡X_{A}^{input}italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT and XBinputsuperscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡X_{B}^{input}italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT are unpaired input images, YAsubscript𝑌𝐴Y_{A}italic_Y start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and YBsubscript𝑌𝐵Y_{B}italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are binary class labels for domain A and B, respectively.

  2. 2.

    Cycle Loss: We enforce cycle-consistency, shown in Fig. 1b, to encourage more faithful translation between source and target domains for each pair which is defined as follows

    cycle=subscriptcycleabsent\displaystyle\mathcal{L}_{\text{cycle}}=caligraphic_L start_POSTSUBSCRIPT cycle end_POSTSUBSCRIPT = 𝔼[XBinputG(XAinput,ETB,YB)1]𝔼delimited-[]subscriptnormsuperscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡𝐺superscriptsubscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡subscript𝐸subscript𝑇𝐵subscript𝑌𝐵1\displaystyle\mathbb{E}\left[\|X_{B}^{input}-G(X_{A}^{input},E_{T_{B}},Y_{B})% \|_{1}\right]blackboard_E [ ∥ italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT - italic_G ( italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] (9)
    +𝔼[XAinputG(XBinput,ETA,YA)1].𝔼delimited-[]subscriptnormsuperscriptsubscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡𝐺superscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡subscript𝐸subscript𝑇𝐴subscript𝑌𝐴1\displaystyle+\mathbb{E}\left[\|X_{A}^{input}-G(X_{B}^{input},E_{T_{A}},Y_{A})% \|_{1}\right].+ blackboard_E [ ∥ italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT - italic_G ( italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] .
  3. 3.

    CLIP Guidance Loss: We minimize cosine distance between the text encoding corresponds to the caption of target domain (e.g., ”This MRI image is T1-Weighted” or ”This is pelvic CT”) and the encoding from ViT for the generated images to enable the utilization of CLIP’s joint embedding space similarly with [24], which can be defined as follows

    clip=ETA,EXAoutlastETAEXAoutlastETB,EXBoutlastETBEXBoutlast,subscriptclipsubscript𝐸subscript𝑇𝐴superscriptsubscript𝐸subscript𝑋𝐴𝑜𝑢superscript𝑡𝑙𝑎𝑠𝑡normsubscript𝐸subscript𝑇𝐴normsuperscriptsubscript𝐸subscript𝑋𝐴𝑜𝑢superscript𝑡𝑙𝑎𝑠𝑡subscript𝐸subscript𝑇𝐵superscriptsubscript𝐸subscript𝑋𝐵𝑜𝑢superscript𝑡𝑙𝑎𝑠𝑡normsubscript𝐸subscript𝑇𝐵normsuperscriptsubscript𝐸subscript𝑋𝐵𝑜𝑢superscript𝑡𝑙𝑎𝑠𝑡\mathcal{L}_{\text{clip}}=-\frac{\langle E_{T_{A}},E_{X_{A}}^{out^{last}}% \rangle}{\|E_{T_{A}}\|\cdot\|E_{X_{A}}^{out^{last}}\|}-\frac{\langle E_{T_{B}}% ,E_{X_{B}}^{out^{last}}\rangle}{\|E_{T_{B}}\|\cdot\|E_{X_{B}}^{out^{last}}\|},caligraphic_L start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT = - divide start_ARG ⟨ italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUPERSCRIPT italic_l italic_a italic_s italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ⋅ ∥ italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUPERSCRIPT italic_l italic_a italic_s italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ end_ARG - divide start_ARG ⟨ italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUPERSCRIPT italic_l italic_a italic_s italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ⋅ ∥ italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUPERSCRIPT italic_l italic_a italic_s italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ end_ARG , (10)

    where EXAoutlastsuperscriptsubscript𝐸subscript𝑋𝐴𝑜𝑢superscript𝑡𝑙𝑎𝑠𝑡E_{X_{A}}^{out^{last}}italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUPERSCRIPT italic_l italic_a italic_s italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, EXBoutlastsuperscriptsubscript𝐸subscript𝑋𝐵𝑜𝑢superscript𝑡𝑙𝑎𝑠𝑡E_{X_{B}}^{out^{last}}italic_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUPERSCRIPT italic_l italic_a italic_s italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the image encodings from the last layer of ViT for the generated images from domain A and B, respectively. Generally, clipsubscript𝑐𝑙𝑖𝑝\mathcal{L}_{clip}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT is dominated by cycle and adversarial losses giving comparingly small benefits (see Sec. 5.1 for details).

  4. 4.

    CLIP Encoding Loss: The CLS tokens in the final layers of vision transformers are recognized for containing semantically rich information, as highlighted by [32], which is typically leveraged for downstream classification tasks and shown to be beneficial in image translation [15]. Therefore, we enforce cosine similarity between the CLS tokens in the ViT for the generated and target domain’s images to enforce semantic similarity among these images, which can be written as follows

    cls=clsXAinput,clsXAoutclsXAinputclsXAoutclsXBinput,clsXBoutclsXBinputclsXBout,subscriptcls𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐴𝑜𝑢𝑡norm𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡norm𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐴𝑜𝑢𝑡𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐵𝑜𝑢𝑡norm𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡norm𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐵𝑜𝑢𝑡\mathcal{L}_{\text{cls}}=-\frac{\langle cls_{X_{A}}^{input},cls_{X_{A}}^{out}% \rangle}{\|cls_{X_{A}}^{input}\|\cdot\|cls_{X_{A}}^{out}\|}-\frac{\langle cls_% {X_{B}}^{input},cls_{X_{B}}^{out}\rangle}{\|cls_{X_{B}}^{input}\|\cdot\|cls_{X% _{B}}^{out}\|},caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = - divide start_ARG ⟨ italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT ∥ ⋅ ∥ italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT ∥ end_ARG - divide start_ARG ⟨ italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT ∥ ⋅ ∥ italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT ∥ end_ARG , (11)

    where clsXAinput𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐴𝑖𝑛𝑝𝑢𝑡cls_{X_{A}}^{input}italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT and clsXBinput𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡cls_{X_{B}}^{input}italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT are the CLS tokens in the last layers of ViT for input images from domain A and B respectively. clsXAout𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐴𝑜𝑢𝑡cls_{X_{A}}^{out}italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT and clsXBout𝑐𝑙superscriptsubscript𝑠subscript𝑋𝐵𝑜𝑢𝑡cls_{X_{B}}^{out}italic_c italic_l italic_s start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT are corresponding CLS tokens for the generated images.

  5. 5.

    Identity Loss: Identity loss is found to be beneficial to maintain source image structure in translation by enforcing the pixel-level equality when target and source domains match [38]. We enforce it via using same labels and text embeddings corresponding to the input image domain (see Fig. 1c). It can be defined using our framework as follows

    identity=subscriptidentityabsent\displaystyle\mathcal{L}_{\text{identity}}=caligraphic_L start_POSTSUBSCRIPT identity end_POSTSUBSCRIPT = 𝔼[XAinputG(XAinput,ETA,YA))1]\displaystyle\mathbb{E}\left[\|X_{A}^{input}-G(X_{A}^{input},E_{T_{A}},Y_{A}))% \|_{1}\right]blackboard_E [ ∥ italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT - italic_G ( italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] (12)
    +𝔼[XBinputG(XBinput,ETB,YB)1].𝔼delimited-[]subscriptnormsuperscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡𝐺superscriptsubscript𝑋𝐵𝑖𝑛𝑝𝑢𝑡subscript𝐸subscript𝑇𝐵subscript𝑌𝐵1\displaystyle+\mathbb{E}\left[\|X_{B}^{input}-G(X_{B}^{input},E_{T_{B}},Y_{B})% \|_{1}\right].+ blackboard_E [ ∥ italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT - italic_G ( italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] .
Refer to caption
Figure 2: Multi-domain translation illustrations from PD to T1-weighted image in IXI dataset. Accompanying this are error maps and magnified sections, positioned below and above each translation, respectively.
Refer to caption
Figure 3: Single-domain translation from T1-weighted Pelvic MRI to CT images. Accompanying this are error maps and magnified sections, positioned below and above each translation, respectively.
Table 1: Model complexities are illustrated using total number of parameters for each competing method. The third row indicates the number of required generator and discriminator networks, given the specified number of domain. T𝑇Titalic_T and P(.) represents the number of domains for a multi-modal translation problem and permutation operator respectively. Total parameters are calculated for a representative case where T=4𝑇4T=4italic_T = 4.
Network/Model I2I-Galip CycleGAN U-GAT-IT StarGAN UNIT EGSDE
Generator (G) 13.2M 11.3M 278.9M 8.4M 5.4M + 5.4M 164M
Discriminator (D) 23.9M 2.7M 56.4M 44.8M 2.8M 0
Times (G, D) 1, T P(T,2), P(T,2) P(T,2), P(T,2) 1, 1 P(T,2), P(T,2) T, 0
Total 108.8M 169.5M 4023.6M 53.2M 162.2M 657.2M
Table 2: Multi-domain image translation results in IXI dataset. T1-, T2- and PD-weighted images are considered.
One-to-one task T1->T2 T2->T1 T2->PD PD->T2 T1->PD PD->T1
IXI PSNR |||| SSIM PSNR |||| SSIM PSNR |||| SSIM PSNR |||| SSIM PSNR |||| SSIM PSNR |||| SSIM
I2I-Galip-M 27.22 |||| 90.18 27.30 |||| 90.86 32.34 |||| 95.74 33.12 |||| 95.39 26.76 |||| 90.75 27.70 |||| 91.20
I2I-Galip-S 27.47|||| 90.54 27.33 |||| 91.06 32.11 |||| 95.65 32.87 |||| 95.62 26.99 |||| 90.80 27.75|||| 91.07
CycleGAN 26.10 |||| 87.36 26.31 |||| 88.51 27.43 |||| 93.68 31.07 |||| 93.81 24.56 |||| 88.26 25.91 |||| 89.47
U-GAT-IT 24.44 |||| 86.19 24.51 |||| 86.85 26.81 |||| 91.39 29.03 |||| 92.11 22.98 |||| 85.16 24.83 |||| 87.44
StarGAN 20.96 |||| 71.40 21.00 |||| 71.44 26.18 |||| 91.80 27.33 |||| 91.52 21.69 |||| 72.92 21.51 |||| 72.89
UNIT 23.59 |||| 84.40 24.76 |||| 86.63 25.22 |||| 91.42 29.10 |||| 93.30 23.20 |||| 86.00 23.50 |||| 80.05
EGSDE 16.93 |||| 53.32 17.44 |||| 57.54 17.98 |||| 75.93 16.40 |||| 57.55 19.70 |||| 71.21 19.71 |||| 59.73

4.2   Datasets

We conduct experiments on the following datasets to demonstrate the performance of our approach.

  1. 1.

    IXI: Translation performance demonstrated in a single-coil brain MRI dataset from (http://brain-development.org/ixi-dataset/). T1-, T2- and PD-weighted acquisitions are considered. In IXI, 25 subjects are used for training, 5 for validation and 10 for testing.

  2. 2.

    CT-MRI: Translation performance demonstrated in pelvic T1- and T2-weighted MRI and CT data from [20]. In CT-MRI dataset, 9 subjects are used for training, 1 for validation and 4 for testing.

We consider the IXI dataset in both multi-domain and single-domain translation contexts. In the multi-domain scenario, we use a single network for all translation tasks, whereas in the single-domain scenario, we utilize distinct networks for each individual task. On the other hand, CT-MRI dataset only allows us to use single-domain translation context.

Refer to caption
Figure 4: Multi-domain translation illustrations from T1-weighted to T2-weighted image in IXI dataset. Accompanying this are error maps and magnified sections, positioned below and above each translation, respectively.

4.3   Competing Methods

We illustrate the model complexities using the number of parameters in each competing method in the Tab. 1. A single NVIDIA RTX A5000 GPU with PyTorch framework is utilized in all experiments.

  1. 1.

    I2I-Galip: Our model is trained with Adam optimizer with an initial learning rate set at 0.0002, which is linearly decreased to 0 after the 50th50𝑡50th50 italic_t italic_h epoch. Number of discriminator head sets are determined according to the number of domains in the translation problem, where for IXI it is 3, and 2 for CT-MRI. We utilize hyperparameters 10, 1, 1, 1, 1 for λcyclesubscript𝜆𝑐𝑦𝑐𝑙𝑒\lambda_{cycle}italic_λ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT, λadvsubscript𝜆𝑎𝑑𝑣\lambda_{adv}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, λclssubscript𝜆𝑐𝑙𝑠\lambda_{cls}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, λclipsubscript𝜆𝑐𝑙𝑖𝑝\lambda_{clip}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT, and λidentitysubscript𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦\lambda_{identity}italic_λ start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT respectively.

  2. 2.

    CycleGAN: Cycle-consistent generative adversarial model is considered [38]. The Adam optimizer is utilized for training with an initial learning rate set at 0.0002, which linearly decreased to 0 after the 50th50𝑡50th50 italic_t italic_h epoch. The training process spans a total of 100 epochs. Weights for adversarial, cycle, identity losses are selected as 1, 10, 0.5 respectively.

  3. 3.

    U-GAT-IT: An attention guided GAN model with adaptive layer-instance normalization designed for unsupervised image translation is considered [14]. Adam optimizer is utilized for training with a learning rate of 0.0001. Training lasts for 100 epochs. Weights for adversarial, cycle, identity and CAM losses are selected as 1, 10, 10, and 1000 respectively.

  4. 4.

    StarGAN: A unified unsupervised image translation GAN model is considered [2]. Adam optimizer is used for training with a learning rate of 0.0001. Training length is 100 epochs. Weights for domain classification loss, reconstruction loss and gradient penalty are selected as 1, 10, 10 respectively. A single StarGAN model is trained and tested for all domain pairs in each dataset.

  5. 5.

    UNIT: An unsupervised GAN model designed for unsupervised image translation is considered [17]. Adam optimizer is utilized for training with a learning rate of 0.0001 for 100 epochs. Weights for adversarial, image, style, and content reconstruction losses are selected as 1, 10, 1, 1 respectively.

  6. 6.

    EGSDE: A diffusion based unpaired image translation model is considered [37]. Seperate DDPM models are trained for each translation domain to be utilized in EGSDE model. 500,000 diffusion steps are used for training of the DDPMs and T is selected as 150 to maintain source structure, and cross-validated weight parameters λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are selected as 1×1071superscript1071\times 10^{-7}1 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT and 10.

Refer to caption
Figure 5: Multi-domain translation illustrations from T2-weighted to PD image in IXI dataset. Accompanying this are error maps and magnified sections, positioned below and above each translation, respectively.

5   Results

Table 3: Single-domain image translation results in CT-MRI dataset for T1- and T2-weighted images.
T1->CT T2->CT
PSNR SSIM PSNR SSIM
I2I-Galip 26.13 90.86 27.08 91.30
CycleGAN 24.55 78.63 27.39 89.92
U-GAT-IT 25.79 89.34 26.01 87.48
StarGAN 25.00 79.96 24.67 78.37
UNIT 26.02 79.22 25.15 75.30
EGSDE 19.03 74.63 14.74 66.67

We used Peak-Signal-to-Noise-Ratio (PSNR, dB) and Structural Similarity Index Measure (SSIM, %) to compare the translation performances of competing methods. Results are presented for both single- and multi-domain case in IXI for I2I-Galip to show the effectiveness of the proposed approach for both cases. CT-MRI results are presented as the single-domain translation. I2I-Galip-S (Single), CycleGAN, U-GAT-IT, UNIT are separately trained for all possible domain pairs while I2I-Galip-M (Multi) and StarGAN are trained once per dataset. EGSDE is a training free method for image translation but it requires separately trained diffusion models for each target domain. Tab. 2 and Tab. 3 shows the translation performance in IXI and CT-MRI datasets respectively. We show the corresponding translated images for each competing methods for distinct translation tasks in Fig. 2, Fig. 3, Fig. 4 and Fig. 5. Best performances are highlighted as bold in each table for each metric. Overall, I2I-Galip-M yields 2.17dB better PSNR and over 2% better SSIM than the second best competing method on average in IXI. I2I-Galip-S yields 0.10 dB better PSNR and 1.52% better SSIM in T1 to CT, while providing 1.38% better SSIM in T2 to CT task. Our method excels in capturing high-frequency details more effectively than competing approaches, providing superior image clarity and precision. Unlike other methods, it does not suffer from the noise artifacts that often degrade the quality of the output, leading to more accurate and visually appealing results even with its low computational budget as shown in Tab. 1.

5.1   Ablation Studies

We illustrate the effect of individual loss components in the proposed model for single- and multi-domain cases in Tab. 4. We observe that the most of the performance gain comes from adversarial and cycle losses although clip guidance, identity and clip encoding losses are beneficial especially in the multi-domain setting. In a single-domain setting, the adversarial loss tends to dominate, thereby reducing the influence of other loss terms. We discuss the reasons behind these results in Sec. 6.

Table 4: Single- and multi-domain ablation results in IXI dataset. PSNR and SSIM values are averaged across the whole test set.
I2I-Galip-S I2I-Galip-M
IXI IXI
PSNR SSIM PSNR SSIM
Proposed 29.09 92.48 29.07 92.35
λadv=0subscript𝜆𝑎𝑑𝑣0\lambda_{adv}=0italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = 0 19.80 60.80 18.62 49.05
λcls=0subscript𝜆𝑐𝑙𝑠0\lambda_{cls}=0italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = 0 29.00 92.26 28.88 91.99
λcycle=0subscript𝜆𝑐𝑦𝑐𝑙𝑒0\lambda_{cycle}=0italic_λ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT = 0 27.91 90.93 28.08 90.88
λclip=0subscript𝜆𝑐𝑙𝑖𝑝0\lambda_{clip}=0italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT = 0 28.90 92.23 28.99 92.18
λidentity=0subscript𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦0\lambda_{identity}=0italic_λ start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT = 0 28.74 91.12 28.76 92.01

6   Discussion and Limitations

We observed only minimal benefits from using CLIP guidance, identity, and CLIP encoding losses in our experiments. Despite experimenting with various metrics to define these losses, such as Cosine, L2, and Contrastive losses, we observed similar behavior. This outcome is likely due to the dominating effect of the adversarial loss. Our discriminator, leveraging the powerful feature extraction capabilities of the pre-trained BiomedCLIP’s ViT and MSE loss, can distinguish fake images early in the training process, effectively acting as a regularizer. This regularization may limit the efficacy of CLIP guidance, which can sometimes provide incorrect directions in translation, as shown in earlier studies [29, 15]. Consequently, the impact of CLIP guidance is constrained. Additionally, our model incorporates BiomedCLIP as the multi-modal foundation model to utilize the extensive domain knowledge it gathered during pre-training. Thus, the translation performance of our method is also limited by the contrastive pre-training strategy, where the ViT, focusing on semantically meaningful feature extraction, may suboptimally extract low-level image features. Furthermore, our model may show sensitivity to the captions chosen for the target domain description. Although we tested a range of caption styles, we did not see notable benefits. Therefore, we opted for the simplest combinations employed in BiomedCLIP [36], such as ”This is a photo of XX-weighted MRI” and ”This is pelvic MRI” or ”This is pelvic CT”. We leave exploring this aspect further as a future research direction.

Our model employs a generative adversarial network thus, could be fragile to known problems in GAN training like mode collapse. Additionally, GAN training can be unstable and sensitive to hyper-parameter choices, further complicating the training process and potentially hindering the model’s effectiveness.

Our method can be readily adapted to non-medical domains with an expanding range of translations. The primary constraints would be the inherent capacity of the lightweight generator and the domain knowledge embedded in the utilized pre-trained vision-language model.

7   Conclusion

We proposed an unsupervised multi-modal image translation framework employing a generative adversarial network which is empowered with a pre-trained vision-language model. Our framework improves upon the cycle-consistent translation models while enhancing the multi-domain translation performance with a reduced computational budget.

References

  • [1] Karim Armanious, Chenming Jiang, Marc Fischer, Thomas Küstner, Tobias Hepp, Konstantin Nikolaou, Sergios Gatidis, and Bin Yang. Medgan: Medical image translation using gans. Computerized medical imaging and graphics, 79:101684, 2020.
  • [2] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
  • [3] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
  • [4] Xianjin Dai, Yang Lei, Yabo Fu, Walter J Curran, Tian Liu, Hui Mao, and Xiaofeng Yang. Multimodal mri synthesis using unified generative adversarial networks. Medical physics, 47(12):6343–6354, 2020.
  • [5] Salman UH Dar, Mahmut Yurt, Levent Karacan, Aykut Erdem, Erkut Erdem, and Tolga Cukur. Image synthesis in multi-contrast mri with conditional generative adversarial networks. IEEE transactions on medical imaging, 38(10):2375–2388, 2019.
  • [6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [7] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  • [8] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  • [9] Junlin Han, Mehrdad Shoeiby, Lars Petersson, and Mohammad Ali Armin. Dual contrastive learning for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 746–755, 2021.
  • [10] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • [11] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018.
  • [12] Lan Jiang, Ye Mao, Xiangfeng Wang, Xi Chen, and Chao Li. Cola-diff: Conditional latent diffusion model for multi-modal mri synthesis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 398–408. Springer, 2023.
  • [13] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • [14] Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwanghee Lee. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830, 2019.
  • [15] Gihyun Kwon and Jong Chul Ye. Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264, 2022.
  • [16] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pages 35–51, 2018.
  • [17] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. Advances in neural information processing systems, 30, 2017.
  • [18] Xuhui Liu, Bohan Zeng, Sicheng Gao, Shanglin Li, Yutang Feng, Hong Li, Boyu Liu, Jianzhuang Liu, and Baochang Zhang. Ladiffgan: Training gans with diffusion supervision in latent spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1115–1125, 2024.
  • [19] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017.
  • [20] Tufve Nyholm, Stina Svensson, Sebastian Andersson, Joakim Jonsson, Maja Sohlin, Christian Gustafsson, Elisabeth Kjellén, Karin Söderström, Per Albertsson, Lennart Blomqvist, et al. Mr and ct data with multiobserver delineations of organs in the pelvic area—part of the gold atlas project. Medical physics, 45(3):1295–1300, 2018.
  • [21] Muzaffer Özbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, Şaban Özturk, Alper Güngör, and Tolga Çukur. Unsupervised medical image translation with adversarial diffusion models. IEEE Transactions on Medical Imaging, 2023.
  • [22] Jihye Park, Sunwoo Kim, Soohyun Kim, Seokju Cho, Jaejun Yoo, Youngjung Uh, and Seungryong Kim. Lanit: Language-driven image-to-image translation for unlabeled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23401–23411, 2023.
  • [23] Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024.
  • [24] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2085–2094, 2021.
  • [25] Pramuditha Perera, Mahdi Abavisani, and Vishal M Patel. In2i: Unsupervised multi-image-to-image translation using generative adversarial networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 140–146. IEEE, 2018.
  • [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [27] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [28] Snehashis Roy, Amod Jog, Aaron Carass, and Jerry L Prince. Atlas based intensity transformation of brain mr images. In Multimodal Brain Image Analysis: Third International Workshop, MBIA 2013, Held in Conjunction with MICCAI 2013, Nagoya, Japan, September 22, 2013, Proceedings 3, pages 51–62. Springer, 2013.
  • [29] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  • [30] Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. Galip: Generative adversarial clips for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14214–14223, 2023.
  • [31] Dmitrii Torbunov, Yi Huang, Haiwang Yu, Jin Huang, Shinjae Yoo, Meifeng Lin, Brett Viren, and Yihui Ren. Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 702–712, 2023.
  • [32] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022.
  • [33] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  • [34] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017.
  • [35] Chenlu Zhan, Yu Lin, Gaoang Wang, Hongwei Wang, and Jian Wu. Medm2g: Unifying medical multi-modal generation via cross-guided diffusion with visual invariant. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11502–11512, 2024.
  • [36] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023.
  • [37] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
  • [38] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.