Mixture of Prompt Learning for Vision Language Models

Yu Du, Tong Niu, Rong Zhao
Center for Brain-Inspired Computing Research
Department of Precision Instrument, Tsinghua University
{duyu20,nt20}@mails.tsinghua.edu.cn, r_zhao@tsinghua.edu.cn
\dagger Equal contribution
, * Corresponding author
Abstract

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset’s varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at https://anonymous.4open.science/r/mocoop-6387

Mixture of Prompt Learning for Vision Language Models


Yu Du, Tong Niu, Rong Zhao Center for Brain-Inspired Computing Research Department of Precision Instrument, Tsinghua University {duyu20,nt20}@mails.tsinghua.edu.cn, r_zhao@tsinghua.edu.cn \dagger Equal contribution, * Corresponding author


1 Introduction

Recently, pre-trained vision-language models like CLIP become increasingly prominent, numerous studies have explored their application in various downstream tasks such as image classification Zhou et al. (2022b), visual question answering (VQA) Eslami et al. (2021), and cross-modal generation Crowson et al. (2022). Prompt learning has emerged as an effective method by optimizing the prompts fed into the model, significantly improving performance on new downstream tasks without requiring large-scale fine-tuning of the entire model.

For example, take the downstream task of image classification, the prompt essentially serves as a template that can be positioned before, after, or surrounding the class name. Traditionally, manually designed text templates were used during the training of CLIP, guide the model in associating textual descriptions with visual content. These manually designed prompts are called hard prompts. Prompt learning takes this a step further by replacing these fixed text templates with learnable continuous vectors. By fine-tuning these vectors with a small number of samples, the performance on downstream tasks can be significantly improved. These vector-based prompts are called soft prompts to distinguish them from hard prompts.

We focus on two challenges of soft prompt learning in this work.

Refer to caption
Figure 1: For a dataset, the existing hard templates can be divided into different sets based on the different styles and patterns they describe in the images (such as different contents within the different colored blocks). Furthermore, one image can simultaneously possess multiple different styles. Traditionally, only one soft prompt is used to fit all images, but we use multiple soft prompts. Each soft prompt represents a style, and a router selects the best matches. This approach better bridges the gap between visual and text features by taking different styles into consideration.

1) Dataset style variations. As seen in Figure. 1 For one dataset, a single soft prompt may not be sufficient to capture the diverse styles present in the data. Difference instances in the same dataset may be compatible with different prompts. Therefore, it is more natural to use multiple prompts to represent these variations adequately. 2) Overfitting Issue. Improper finetuning of the soft prompts may result in performance that even lags behind the zero shot capabilities of the original VLMs Radford et al. (2021); Zhou et al. (2022b). This is related to over-training on base classes and the catastrophic forgetting of domain-general knowledge Zhu et al. (2023).

To address these challenges, we propose a mixture of soft prompt learning method. This method incorporates a routing module that selects the most suitable prompts for each instance. The selected prompts are then encoded by a text encoder to obtain several sets of class text features. These features are weighted and averaged to produce the final set of class text features, which are then compared with image features to calculate similarities. Conceptually, this process can be deemed as selecting the most compatible style prompts for each instance, thereby enhancing the system’s adaptability and performance.

For the router, we also propose a hard prompt guided gating loss to ensure it selects the soft prompts initialized from the hard prompt templates whose text features are the most similar to the image feature. This mechanism distills the knowledge of hard prompt templates into the router and encourages it to make more accurate and relevant selections.

Additionally, to mitigate the overfitting issue, we introduce semantically grouped text-level supervision. Each soft prompt corresponds to a set of manually designed templates (hard prompts), where the semantics within each set are relatively close. We use the token embeddings of one of the templates from each set as the initialization for each soft prompt. During training, the text features obtained by the text encoder for each soft prompt are constrained to stay close to the text features obtained from their corresponding hard prompts. This ensures that the initial knowledge from the manual text templates is preserved and integrated into the soft prompts.

We validated our method on 11 datasets, under the few-shot learning, domain generalization and base-to-new generalization from three main aspects. Our methods achieve improvements compared to existing baselines. We also designed ablation experiments to verify the contribution of different modules in our method to the performance improvement.

In summary, our contributions are as follows:

  • We propose a mixture of soft prompt learning method that incorporates a routing module to select the most suitable prompts for each instance.

  • We introduce a hard prompt guided gating loss to ensure the router selects prompts based on their similarity to hard prompt templates, thus improving selection accuracy.

  • We implement semantically grouped text-level supervision to maintain the initial knowledge from manual text templates and mitigate over-fitting.

  • We validate our method on 11 datasets, demonstrating improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines.

2 Related Works

Prompt Learning. In the realm of vision-language models, prompt learning aims to bridge the gap between visual and textual representations more effectively. A pioneering work in this area is the CoOp (Context Optimization) model Zhou et al. (2022b), which optimizes the context of prompts to enhance the performance of models like CLIP Radford et al. (2021) in few-shot learning scenarios.

Researchers have also introduced the concept of a vision prompt Zang et al. (2022); Khattak et al. (2023), which involves appending learnable vectors to the inputs of a vision encoder, similar to text prompts. This approach can significantly enhance performance, although it also increases computational demands. In this paper, we focus exclusively on text-based prompts. In the future, our methodology could potentially be extended to include vision prompts.

Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. A variety of methods have been developed to constrain the update of soft prompts using features from the original manual templates. These methods either directly restrict the gradient update direction or employ knowledge distillation. Among them, ProGrad Zhu et al. (2023) prevents prompt tuning from forgetting general knowledge in VLMs by updating prompts only when their gradients align with the "general direction" represented by the KL loss gradient of a predefined prompt. LASP Bulat and Tzimiropoulos (2022) use grouped manual templates encoded feature as supervision to regularize the learning of the prompt. KgCoOp Yao et al. (2023) reduces the difference between the textual embeddings generated by learned prompts and those from hand-crafted prompts. We also incorporate this technique by distilling the knowledge from original text features into each expert soft prompt. Additionally, we apply gating regularization to distill prior knowledge from discrete text into the router.

PLOT Chen et al. first explored to learn multiple comprehensive prompts to describe diverse characteristics of categories, using optimal transport to align visual and textual features. This method improves few-shot recognition tasks by applying a two-stage optimization strategy, demonstrating superior performance across various datasets compared to conventional prompt learning approaches. We in another way, use multiple prompts to capture the diverse styles in the dataset and learning to prompt in a sparse mixture of experts way.

Mixture of Experts. The mixture of experts (MoE) framework Zhou et al. (2022c); Masoudnia and Ebrahimpour (2014), initially introduced decades ago, has brought significant advancements for AI, especially with the advent of sparsely-gated MoE in transformer-based large language models Sukhbaatar et al. (2024); Liu et al. (2024). This framework allows different parts of a model, known as experts, to specialize in various tasks, engaging only relevant experts for a given input to maintain computational efficiency while leveraging specialized knowledge. A major issue of MoE is effectively balancing the load among different expert models, as poor load distribution can result in inefficiencies and unstable model performance Masoudnia and Ebrahimpour (2014).

3 Method

3.1 Overview

As illustrated in Figure. 2, during inference, an image is first processed by the CLIP image encoder to obtain an image feature. This feature is then routed to select the k soft prompts with the highest probabilities. These selected prompts are concatenated with the available classes and fed into the CLIP text encoder, resulting in k sets of class text features. These k sets are then averaged weighted by the router’s gating distribution (after the softmax layer) to produce a single set of class text features. The final feature set is compared with the image feature to produce the classification logits. In this way, only k soft prompts are activated at a time, keeping the inference cost comparable to using a single prompt.

Refer to caption
Figure 2: Overview of MoCoOp. The orange lines signify the extra flow for training while the black lines are shared by training and inference. During inference, two soft prompts with the highest probabilities are selected and combined with the available classes for text encoding. The resulting text features are averaged and used for classification. During training, the hard prompt guided routing and semantically grouped text level supervision are introduced to supervise the router and soft prompts respectively. In our experiments, we set k to 2.

During training, there are three parts of gradient flow. First, we apply a cross entropy loss to the final classification probabilities with the ground truth label. Second, for the router, we calculate the similarity between the image feature and the text features from each hard prompt template set (using the average feature of all classes and all templates in the set). These similarities serve as a reference distribution. Then, a KL divergence objective function is used to align the router’s gating distribution with this reference distribution. Finally, for the soft prompts, we use another cross entropy loss to ensure that each class’s text feature from each soft prompt closely matches the corresponding class’s feature from the associated hard prompt.

3.2 Preliminary of CoOp

Here we give a brief introduction of CoOp Zhou et al. (2022b), the pioneering work in prompt learning of VLMs.

Notation:

First, here are some notations used in prompt learning of VLMs.

  • 𝐱𝐱\mathbf{x}bold_x: Input image

  • 𝐩𝐩\mathbf{p}bold_p: Text prompt

  • fimgsubscript𝑓imgf_{\text{img}}italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT: CLIP image encoder

  • ftxtsubscript𝑓txtf_{\text{txt}}italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT: CLIP text encoder

  • 𝐡x=fimg(𝐱)subscript𝐡𝑥subscript𝑓img𝐱\mathbf{h}_{x}=f_{\text{img}}(\mathbf{x})bold_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( bold_x ): Encoded image feature

  • 𝐡p=ftxt(𝐩)subscript𝐡𝑝subscript𝑓txt𝐩\mathbf{h}_{p}=f_{\text{txt}}(\mathbf{p})bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT ( bold_p ): Encoded text feature

  • 𝐂𝐂\mathbf{C}bold_C: Context vectors (learnable parameters)

Prompt Representation. The text prompt 𝐩𝐩\mathbf{p}bold_p is represented as a sequence of tokens, including learnable context tokens and a class token.

𝐩=[𝐂,CLASS]𝐩𝐂CLASS\mathbf{p}=[\mathbf{C},\text{CLASS}]bold_p = [ bold_C , CLASS ]

The context tokens can also be placed after or around the class token.

Context:

  • The context is learnable vectors 𝐂=[𝐜1,𝐜2,,𝐜M]𝐂subscript𝐜1subscript𝐜2subscript𝐜𝑀\mathbf{C}=[\mathbf{c}_{1},\mathbf{c}_{2},\ldots,\mathbf{c}_{M}]bold_C = [ bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ], where 𝐜idsubscript𝐜𝑖superscript𝑑\mathbf{c}_{i}\in\mathbb{R}^{d}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and M𝑀Mitalic_M is the number of context tokens.

  • All classes share the same context 𝐂𝐂\mathbf{C}bold_C or each class c𝑐citalic_c has its own context 𝐂csubscript𝐂𝑐\mathbf{C}_{c}bold_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

Training Objective. Given a dataset with images {𝐱i}subscript𝐱𝑖\{\mathbf{x}_{i}\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and corresponding labels {yi}subscript𝑦𝑖\{y_{i}\}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, the goal is to find the optimal context vectors 𝐂𝐂\mathbf{C}bold_C (or 𝐂csubscript𝐂𝑐\mathbf{C}_{c}bold_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for class-specific context) by minimizing the cross-entropy loss:

=ilogexp(sim(𝐡xi,𝐡pyi)/τ)cexp(sim(𝐡xi,𝐡pc)/τ)subscript𝑖simsuperscriptsubscript𝐡𝑥𝑖superscriptsubscript𝐡𝑝subscript𝑦𝑖𝜏subscript𝑐simsuperscriptsubscript𝐡𝑥𝑖superscriptsubscript𝐡𝑝𝑐𝜏\mathcal{L}=-\sum_{i}\log\frac{\exp(\text{sim}(\mathbf{h}_{x}^{i},\mathbf{h}_{% p}^{y_{i}})/\tau)}{\sum_{c}\exp(\text{sim}(\mathbf{h}_{x}^{i},\mathbf{h}_{p}^{% c})/\tau)}caligraphic_L = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( sim ( bold_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_exp ( sim ( bold_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG

where

  • 𝐡xi=fimg(𝐱i)superscriptsubscript𝐡𝑥𝑖subscript𝑓imgsubscript𝐱𝑖\mathbf{h}_{x}^{i}=f_{\text{img}}(\mathbf{x}_{i})bold_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the image feature for image i𝑖iitalic_i.

  • 𝐡pc=ftxt([𝐂,CLASSc])superscriptsubscript𝐡𝑝𝑐subscript𝑓txt𝐂subscriptCLASS𝑐\mathbf{h}_{p}^{c}=f_{\text{txt}}([\mathbf{C},\text{CLASS}_{c}])bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT ( [ bold_C , CLASS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] ) is the text feature for class c𝑐citalic_c.

  • sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) denotes a similarity function, such as cosine similarity.

  • τ𝜏\tauitalic_τ is the temperature.

Optimization. The context vectors 𝐂𝐂\mathbf{C}bold_C are updated through backpropagation to minimize the loss \mathcal{L}caligraphic_L, while keeping the pre-trained parameters of fimgsubscript𝑓imgf_{\text{img}}italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT and ftxtsubscript𝑓txtf_{\text{txt}}italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT fixed.

In summary, CoOp involves learning optimal context vectors 𝐂𝐂\mathbf{C}bold_C for text prompts, which are used to synthesize classification weights for downstream tasks. This process automates prompt engineering and enhances the adaptability and performance of vision-language models like CLIP on various image recognition tasks.

3.3 Mixture of Prompt Learning

The essential idea of this work is to learn to prompt like mixture of experts. In LLMs, the router selects the top K experts for each input token. Similarly, we use a router to select the top K contexts. Then the selected contexts are concatenated with the class names and encoded by the text encoder to obtain several sets of class features:

𝐡pi=ftxt([𝐂i,CLASS])subscript𝐡subscript𝑝𝑖subscript𝑓txtsubscript𝐂𝑖CLASS\mathbf{h}_{p_{i}}=f_{\text{txt}}([\mathbf{C}_{i},\text{CLASS}])bold_h start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT ( [ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , CLASS ] ) (1)

for i=1,2,,K𝑖12𝐾i=1,2,\ldots,Kitalic_i = 1 , 2 , … , italic_K, where 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the context vectors for the i𝑖iitalic_i-th selected prompt.

The features are then weighted and averaged to produce the final set of class features:

𝐡p=i=1Kwrouteri𝐡pisubscript𝐡𝑝superscriptsubscript𝑖1𝐾superscriptsubscript𝑤router𝑖subscript𝐡subscript𝑝𝑖\mathbf{h}_{p}=\sum_{i=1}^{K}w_{\text{router}}^{i}\mathbf{h}_{p_{i}}bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT router end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (2)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the weights assigned to each prompt feature. A cross entropy loss is utilized to optimize these prompts:

cls=ilogexp(cos(𝐡xi,𝐡pyi)/τ)c𝒞exp(cos(𝐡xi,𝐡pc)/τ)subscriptclssubscript𝑖cossuperscriptsubscript𝐡𝑥𝑖superscriptsubscript𝐡𝑝subscript𝑦𝑖𝜏subscript𝑐𝒞cossuperscriptsubscript𝐡𝑥𝑖superscriptsubscript𝐡𝑝𝑐𝜏\mathcal{L}_{\text{cls}}=-\sum_{i}\log\frac{\exp(\text{cos}(\mathbf{h}_{x}^{i}% ,\mathbf{h}_{p}^{y_{i}})/\tau)}{\sum_{c\in\mathcal{C}}\exp(\text{cos}(\mathbf{% h}_{x}^{i},\mathbf{h}_{p}^{c})/\tau)}caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( cos ( bold_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( cos ( bold_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG (3)

3.4 Hard Prompt Guided Routing

Given G𝐺Gitalic_G sets of hard prompts (I1,I2,IGsubscript𝐼1subscript𝐼2subscript𝐼𝐺I_{1},I_{2},...I_{G}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_I start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT), each concatenated with every class and encoded through the CLIP text encoder, we obtain G𝐺Gitalic_G sets of hard text features for all classes. Specifically, for a hard prompt concatenated with a specific CLASScsubscriptCLASS𝑐\text{CLASS}_{c}CLASS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the corresponding hard text features can be similarly obtained using the CLIP text encoder, resulting in:

𝐡c=ftxt([hard_prompt,CLASSc])subscript𝐡𝑐subscript𝑓txthard_promptsubscriptCLASS𝑐\mathbf{h}_{c}=f_{\text{txt}}([\text{hard\_prompt},\text{CLASS}_{c}])bold_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT ( [ hard_prompt , CLASS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] ) (4)

where c𝑐citalic_c denotes the specific class.

These hard text features are then averaged to generate G𝐺Gitalic_G group text features, each representing one of the G𝐺Gitalic_G groups. Specifically, the group text feature 𝐡gsubscript𝐡𝑔\mathbf{h}_{g}bold_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for the g𝑔gitalic_g-th group is computed by averaging the hard text features for all classes and all templates within that group as:

𝐡g=1|Ig|iIg1|𝒞|c𝒞𝐡i,csubscript𝐡𝑔1subscript𝐼𝑔subscript𝑖subscript𝐼𝑔1𝒞subscript𝑐𝒞subscript𝐡𝑖𝑐\mathbf{h}_{g}=\frac{1}{|I_{g}|}\sum_{i\in I_{g}}\frac{1}{|\mathcal{C}|}\sum_{% c\in\mathcal{C}}\mathbf{h}_{i,c}bold_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_C | end_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT (5)

where 𝒞𝒞\mathcal{C}caligraphic_C represents the set of all classes, and 𝐡i,csubscript𝐡𝑖𝑐\mathbf{h}_{i,c}bold_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT represents the i-th hard text feature for class c𝑐citalic_c in the g𝑔gitalic_g-th group.

The cosine similarity between the image feature 𝐯𝐯\mathbf{v}bold_v and each group’s text feature, is calculated. The hard prompt guided gating distribution Whardsubscript𝑊hardW_{\text{hard}}italic_W start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT is then derived by applying the softmax function to these similarity scores, expressed as:

Whard=Softmax(cos(𝐡1,𝐯)cos(𝐡2,𝐯)cos(𝐡G,𝐯))subscript𝑊hardSoftmaxsubscript𝐡1𝐯subscript𝐡2𝐯subscript𝐡𝐺𝐯\displaystyle W_{\text{hard}}=\text{Softmax}\left(\begin{array}[]{c}\cos(% \mathbf{h}_{1},\mathbf{v})\\ \cos(\mathbf{h}_{2},\mathbf{v})\\ \vdots\\ \cos(\mathbf{h}_{G},\mathbf{v})\end{array}\right)italic_W start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT = Softmax ( start_ARRAY start_ROW start_CELL roman_cos ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v ) end_CELL end_ROW start_ROW start_CELL roman_cos ( bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_v ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_cos ( bold_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , bold_v ) end_CELL end_ROW end_ARRAY ) (10)

The router’s output gating distribution is denoted by Wroutersubscript𝑊routerW_{\text{router}}italic_W start_POSTSUBSCRIPT router end_POSTSUBSCRIPT. To ensure coherence between the two distributions, KL divergence is employed as a constraint, with the loss function defined as:

router=DKL(WrouterWhard)subscriptroutersubscript𝐷KLconditionalsubscript𝑊routersubscript𝑊hard\mathcal{L}_{\text{router}}=D_{\text{KL}}(W_{\text{router}}\parallel W_{\text{% hard}})caligraphic_L start_POSTSUBSCRIPT router end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT router end_POSTSUBSCRIPT ∥ italic_W start_POSTSUBSCRIPT hard end_POSTSUBSCRIPT ) (11)

3.5 Semantically Grouped Text Level Supervision

To mitigating the overfitting issue, we introduce semantically grouped text level supervision to allievating the overfitting issue.

Refer to caption
Figure 3: The few-shot learning results on 11 datasets. We plot the results across 1,2,4,8,16 shots. It can be seen that our MoCoOp consistently and significantly surpasses CoOp, LASP, and the Linear Probe approach across most datasets. This is evident in the average accuracy displayed in the top left corner. For LASP Bulat and Tzimiropoulos (2022), we use our reproduced results.

The hard prompts are semantically grouped into G sets I1,I2,IGsubscript𝐼1subscript𝐼2subscript𝐼𝐺I_{1},I_{2},...I_{G}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_I start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. (See A for details). For each learnable soft prompt 𝐭gssuperscriptsubscript𝐭𝑔𝑠\mathbf{t}_{g}^{s}bold_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and its corresponding hard prompt group Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, the probability of a class y filled in this prompt being classified as its proper class y𝑦yitalic_y is given by:

P(y|𝐭gs)=1|Ig|iIgPi(y|𝐭𝐠s)𝑃conditional𝑦superscriptsubscript𝐭𝑔𝑠1subscript𝐼𝑔subscript𝑖subscript𝐼𝑔subscript𝑃𝑖conditional𝑦superscriptsubscript𝐭𝐠𝑠\displaystyle P(y|\mathbf{t}_{g}^{s})=\frac{1}{|I_{g}|}\sum_{i\in I_{g}}\ P_{i% }(y|\mathbf{t_{g}}^{s})italic_P ( italic_y | bold_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_t start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (12)
Pi(y|𝐭𝐠s)=exp(cos(𝐡i,y,ftxt([𝐭gs,y]))/τ)c𝒞exp(cos(𝐡i,c,ftxt([𝐭gs,c]))/τ)subscript𝑃𝑖conditional𝑦superscriptsubscript𝐭𝐠𝑠subscript𝐡𝑖𝑦subscript𝑓txtsuperscriptsubscript𝐭𝑔𝑠𝑦𝜏subscript𝑐𝒞subscript𝐡𝑖𝑐subscript𝑓txtsuperscriptsubscript𝐭𝑔𝑠𝑐𝜏\displaystyle P_{i}(y|\mathbf{t_{g}}^{s})=\frac{\exp\left(\cos\left(\mathbf{h}% _{i,y},f_{\text{txt}}([\mathbf{t}_{g}^{s},y])\right)/\tau\right)}{\sum_{c\in% \mathcal{C}}\exp\left(\cos\left(\mathbf{h}_{i,c},f_{\text{txt}}([\mathbf{t}_{g% }^{s},c])\right)/\tau\right)}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_t start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) = divide start_ARG roman_exp ( roman_cos ( bold_h start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT ( [ bold_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y ] ) ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( roman_cos ( bold_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT ( [ bold_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_c ] ) ) / italic_τ ) end_ARG

where Pi(y|𝐭𝐠s)subscript𝑃𝑖conditional𝑦superscriptsubscript𝐭𝐠𝑠P_{i}(y|\mathbf{t_{g}}^{s})italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_t start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) is the possibility of 𝐭𝐠ssuperscriptsubscript𝐭𝐠𝑠\mathbf{t_{g}}^{s}bold_t start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT applied to class y be classified as the i-th hard template in Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT applied to class y, cos(,)\cos(\cdot,\cdot)roman_cos ( ⋅ , ⋅ ) denotes the cosine similarity, and τ𝜏\tauitalic_τ is a temperature parameter, 𝒞𝒞\mathcal{C}caligraphic_C is the class set.

Next, we use the cross-entropy loss to minimize the distance between the encoded learnable soft prompts and the manually defined text prompts in the encoded space. The loss function can be expressed as:

text=1Gg=1Gc𝒞1|𝒞|logP(c|𝐭gs)subscripttext1𝐺superscriptsubscript𝑔1𝐺subscript𝑐𝒞1𝒞𝑃conditional𝑐subscriptsuperscript𝐭𝑠𝑔\mathcal{L}_{\text{text}}=-\frac{1}{G}\sum_{g=1}^{G}\sum_{c\in\mathcal{C}}% \frac{1}{|\mathcal{C}|}\log P(c|\mathbf{t}^{s}_{g})caligraphic_L start_POSTSUBSCRIPT text end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_C | end_ARG roman_log italic_P ( italic_c | bold_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) (13)

The overall training objective is

=cls+λ1router+λ2textsubscriptclssubscript𝜆1subscriptroutersubscript𝜆2subscripttext\mathcal{L}=\mathcal{L}_{\text{cls}}+\lambda_{1}\mathcal{L}_{\text{router}}+% \lambda_{2}\mathcal{L}_{\text{text}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT router end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT text end_POSTSUBSCRIPT (14)

Where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are weights that balance the importance of each loss term.

4 Experiment

Dataset CLIP CoOp CoCoOp LASP KgCoOp MoCoOp (Ours)
Base New H Base New H Base New H Base New H Base New H Base New H
Average 70.25 74.22 71.57 82.64 68.00 74.02 80.47 71.69 75.42 83.18 76.11 79.48 73.63 76.90 83.22 83.32 77.34 80.17
ImageNet 72.43 68.14 70.22 76.46 66.31 71.00 75.98 70.43 73.11 76.25 71.17 73.62 75.83 69.96 71.89 76.52 69.2 72.67
Caltech101 96.84 94.00 95.40 98.11 93.52 95.76 97.96 93.81 95.84 98.17 94.33 96.21 97.72 94.39 95.55 98.43 94.87 96.61
OxfordPets 91.17 97.26 94.11 94.24 96.66 95.44 95.20 97.69 96.43 95.73 97.87 96.79 94.65 97.76 96.18 95.59 96.64 96.11
StanfordCars 63.37 74.89 68.61 76.2 69.14 72.51 70.49 73.59 72.01 75.23 71.77 73.46 71.76 75.04 73.36 76.34 73.26 74.77
Flowers102 72.08 77.80 74.82 97.63 69.55 81.35 94.87 71.75 81.64 97.17 73.53 83.71 95.00 74.73 83.65 97.18 77.21 86.05
Food101 90.10 91.22 90.66 89.44 87.5 88.46 90.70 91.29 90.99 91.20 91.90 91.54 90.50 91.70 91.10 90.25 91.57 90.90
FGVCAircraft 27.19 36.29 31.07 39.24 30.49 34.23 33.41 23.71 27.73 38.05 33.20 35.46 36.21 33.55 34.83 38.78 38.09 38.43
SUN397 69.36 75.35 72.22 80.85 68.34 74.06 79.74 76.86 78.27 80.70 79.30 80.00 80.29 76.53 78.36 81.43 77.45 79.39
DTD 53.24 59.90 56.36 80.17 47.54 59.67 77.01 56.00 64.85 81.10 62.57 70.64 77.55 54.99 64.35 81.94 60.99 69.93
EuroSAT 56.48 64.05 60.03 91.54 54.44 68.15 87.49 60.04 71.11 95.00 83.37 88.86 95.64 64.34 76.93 94.79 85.18 89.73
UCF101 70.53 77.50 73.82 85.14 64.47 73.57 82.33 73.45 77.63 85.53 78.20 81.70 82.89 76.67 79.66 85.28 79.31 82.17
Table 1: The comparison with baselines on novel class prediction. H is the harmonic mean of the test accuracy on base and new class. The best results are marked in bold font.
Caltech101 EuroSAT UCF101 Flowers102
Base New H Base New H Base New H Base New H
Baseline 95.40 98.11 93.52 91.54 54.44 68.15 85.14 64.47 73.57 97.63 69.55 81.35
+ MoE 98.38 92.03 95.10 94.90 58.79 72.60 85.78 69.50 76.79 97.63 70.64 81.97
+Lroutersubscript𝐿routerL_{\text{router}}italic_L start_POSTSUBSCRIPT router end_POSTSUBSCRIPT 98.39 92.47 95.34 95.17 57.05 71.34 86.97 73.88 79.89 97.34 72.77 83.28
+Ltextsubscript𝐿textL_{\text{text}}italic_L start_POSTSUBSCRIPT text end_POSTSUBSCRIPT 98.43 94.87 96.61 94.79 85.18 89.73 85.28 79.31 82.17 97.18 77.21 86.05
Table 2: Component analysis. We sequentially add the components MoE, Lroutersubscript𝐿routerL_{\text{router}}italic_L start_POSTSUBSCRIPT router end_POSTSUBSCRIPT and Ltextsubscript𝐿textL_{\text{text}}italic_L start_POSTSUBSCRIPT text end_POSTSUBSCRIPT. Our baseline is CoOp Zhou et al. (2022b)

Settings: We conduct experiments under three settings: base to new generalization, few-shot learning, and domain generalization. For base to new generalization, we train on the base class and test on both the base class and new class. For few-shot learning, we train and test on all classes.And domain generalization refers to training on ImageNet and testing on other datasets. The few-shot capability reflects the method’s fitting ability, while base-to-new generalization and domain generalization can measure the model’s robustness.

Implementation Details: We build our framework based on LASP Bulat and Tzimiropoulos (2022). For each expert, we use different context positions depending on the handcrafted template object used to initialize it. We used 4 to 20 experts. The number of experts and corresponding templates varies for datasets. For example, for FGVC_Aircraft, we use the template "a photo of a {}, a type of aircraft." For the Oxford_Flowers dataset, we use "a photo of a {}, a type of flower." Generally, a unique template for the dataset is combined with some general templates like "a photo of a ". Since ImageNet covers a wide range of categories, we use 20 groups of templates. Specific templates can be found in the appendix A. Based on existing studies, we use ViT-B/16 as the backbone. Specifically, we use the publicly available CLIP-ViT-B/16 models (https://github.com/openai/CLIP). The resolution of CLIP’s feature map is 14 × 14 for CLIP-ViT-B/16. The λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is set as 1. and 5. respectively. The τ𝜏\tauitalic_τ in Eq.3 and Eq.12 is set to 0.07. For base-to-new generalization, we use virtual classes during training following LASP Bulat and Tzimiropoulos (2022) by incorporating new classes as text-level supervision. This approach helps mitigate overfitting to some extent.

Evaluation Metrics: For few-shot experiments, we use top-1 accuracy. For base to new generalization, we evaluate by base class accuracy, new class accuracy, and the harmonic mean of base and new classes.

Training: Our training schedule is consistent with LASP Bulat and Tzimiropoulos (2022), and both training and testing are conducted on four NVIDIA GeForce RTX 3090 GPUs.

Baselines: In the few-shot experiment, we compared with Linear Probe, CoOp Zhou et al. (2022b), PLOT Chen et al. , and LASP Bulat and Tzimiropoulos (2022). In the base-to-new generalization experiment, we compare with CoOp Zhou et al. (2022b), CoCoOp Zhou et al. (2022a), KgCoOp Yao et al. (2023) and LASP Bulat and Tzimiropoulos (2022). Note that CoOp Zhou et al. (2022b), KgCoOp Yao et al. (2023), LASP Bulat and Tzimiropoulos (2022), PLOT Chen et al. are textual only methods while CoCoOp Zhou et al. (2022a) is instance-conditioned. Textual-only methods typically have poorer generalization to unseen classes within the same task, even lagging behind the original CLIP on some datasets. Instance-conditioned methods improves the generalization by generating different contexts based on various image visual features, and then obtain different text features through the CLIP text encoder. Therefore, they require significant computational resources. Our method, MoCoOp, also partially relies on visual information but does not generate new contexts. Instead, it combines different text features for different images, thus eliminating the heavy computational cost of the text encoder during inference.

Dataset: Following previous studies Zhou et al. (2022b, a); Chen et al. ; Yao et al. (2023); Bulat and Tzimiropoulos (2022), we primarily evaluate the accuracy of our approach across a total of 11 datasets. The datasets used include: ImageNet Deng et al. (2009), Caltech101 Fei-Fei et al. (2004), Oxford-Pets Parkhi et al. (2012), Stanford Cars Krause et al. (2013), Flowers102 Nilsback and Zisserman (2008), Food101 Bossard et al. (2014), FGVC Aircraft Maji et al. (2013), SUN397 Xiao et al. (2010), DTD Cimpoi et al. (2014), EuroSAT Helber et al. (2019), and UCF-101 Soomro et al. (2012).

4.1 Main Results

Here we show the results of few-shot experiments and base-to-new generalization. The domain generalization results can be found in Appendix. B

4.1.1 Results of Few-shot experiment

In the Figure 3, we plot the performance curves of our MoCoOp and the baselines across 11 datasets for various shots, along with the average accuracies of all datasets. It can be seen that our method achieves the best results in most cases. The performance on ImageNet is average, possibly because other methods utilized all 39 hand-crafted templates, whereas we need to control the number of groups and selected only a portion. Since ImageNet contains images with diverse styles, using only a subset of templates might not have been sufficient.

4.1.2 Results of Base-to-New Generalization

In the Table 1, we list the comparison results of MoCoOp and several baselines. It can be seen that our method surpasses the baselines in generalization ability on most datasets, especially compared to LASP Bulat and Tzimiropoulos (2022). The improvement can be attributed to the utilization of multiple prompts and the semantically grouped text supervision.

4.2 Ablations

4.2.1 Component Analysis.

Table.2 presents the performance as we progressively include components. Our baseline is CoOp  Zhou et al. (2022b). As can be seen in Table. 2, adding MoE alone has already achieved significant improvement. Adding hard prompt guided routing provides a slight improvement, while incorporating semantically grouped text supervision brings a huge enhancement.

5 Conclusion

In this work, we introduce a novel mixture of prompt learning method for vision-language models, addressing key challenges such as dataset style variations and overfitting. Our approach employs a routing module to dynamically select the most suitable prompts for each instance, enhancing adaptability and performance. We also propose a hard prompt guided gating loss and semantically grouped text-level supervision, which help maintain initial knowledge and mitigate overfitting. Our method demonstrate significant improvements across multiple datasets in few-shot learning, domain generalization, and base-to-new generalization scenarios. Future work could explore extending this methodology to include vision prompts or instance-conditioned contexts for further enhancements. Another direction could be using ChatGPT for generating and grouping hard prompt templates.

6 Limitations

While our MoCoOp demonstrates improvements across various tasks, there are two limitations. First, despite the sparse gating of soft prompts, the training cost and memory usage remain high compared to single prompt methods. This can be a constraint in resource-limited environments, especially when dealing with large-scale datasets. Second, templates require manual grouping based on their semantics, potentially introducing human bias that could affect the model’s performance. To enhance efficiency and accuracy, developing automated grouping algorithms may be necessary in the future.

References

  • Bossard et al. (2014) Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101–mining discriminative components with random forests. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13, pages 446–461. Springer.
  • Bulat and Tzimiropoulos (2022) Adrian Bulat and Georgios Tzimiropoulos. 2022. Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models. arXiv preprint arXiv:2210.01115.
  • (3) Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models. In The Eleventh International Conference on Learning Representations.
  • Cimpoi et al. (2014) Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613.
  • Crowson et al. (2022) Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  • Eslami et al. (2021) Sedigheh Eslami, Gerard de Melo, and Christoph Meinel. 2021. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906.
  • Fei-Fei et al. (2004) Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE.
  • Helber et al. (2019) Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2019. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226.
  • Hendrycks et al. (2021a) Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. 2021a. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349.
  • Hendrycks et al. (2021b) Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021b. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271.
  • Khattak et al. (2023) Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122.
  • Krause et al. (2013) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561.
  • Liu et al. (2024) Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.
  • Maji et al. (2013) Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
  • Masoudnia and Ebrahimpour (2014) Saeed Masoudnia and Reza Ebrahimpour. 2014. Mixture of experts: a literature survey. Artificial Intelligence Review, 42:275–293.
  • Nilsback and Zisserman (2008) Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE.
  • Parkhi et al. (2012) Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. 2012. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  • Recht et al. (2019) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR.
  • Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  • Sukhbaatar et al. (2024) Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, et al. 2024. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. arXiv preprint arXiv:2403.07816.
  • Wang et al. (2019) Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. 2019. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32.
  • Xiao et al. (2010) Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE.
  • Yao et al. (2023) Hantao Yao, Rui Zhang, and Changsheng Xu. 2023. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767.
  • Zang et al. (2022) Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225.
  • Zhou et al. (2022a) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825.
  • Zhou et al. (2022b) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.
  • Zhou et al. (2022c) Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. 2022c. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114.
  • Zhu et al. (2023) Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. 2023. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659–15669.

Appendix A Groups of Hard Prompt Templates

Here is the groups of hard prompt templates.

[
# Photos of flowers
"a photo of a {}, a type of flower.",
# Photos of people doing activities
"a photo of a person doing {}.",
# Satellite photos
"a centered satellite photo of {}.",
# Photos of aircraft
"a photo of a {}, a type of aircraft.",
# "Itap" (I took a picture) photos
"itap of a {}.",
"itap of the {}.",
# Photos of large objects
"a photo of the large {}.",
"a photo of a large {}.",
# Art and renditions
"art of the {}.",
"a rendering of a {}.",
"a rendering of the {}.",
"a rendition of the {}.",
# Photos of small objects
"a photo of the small {}.",
# General photo prompts
"a photo of a {}.",
"a photo of the {}.",
"a photo of many {}.",
# Low resolution and pixelated photos
"a low resolution photo of the {}.",
"a low resolution photo of a {}.",
"a pixelated photo of the {}.",
"a pixelated photo of a {}.",
"a jpeg corrupted photo of the {}.",
"a blurry photo of a {}.",
"a bad photo of the {}.",
# Cropped photos
"a cropped photo of the {}.",
"a cropped photo of a {}.",
# Bright photos
"a bright photo of the {}.",
# Good quality photos
"a good photo of the {}.",
"a good photo of a {}.",
# Close-up photos
"a close-up photo of the {}.",
# Jpeg corrupted photos
# Blurry photos
# Clean objects
"a photo of the clean {}.",
# Video game screenshots
"a {} in a video game.",
# Hard to see objects
"a photo of the hard to see {}.",
# Bad quality photos
# Origami photos
"a origami {}.",
# Texture photos
"{} texture.",
]

Appendix B Results of Domain Generalization

Method Source Target
ImageNet -R -A -Sketch -V2
CLIP 66.73 73.96 47.77 46.15 60.83
CoOp 71.51 75.21 49.71 47.99 64.20
CoCoOp 71.02 76.18 50.63 48.75 64.07
ProGrad 72.24 74.58 49.39 47.63 64.73
KgCoOp 71.20 76.70 50.69 48.97 64.10
LASP 69.49 75.54 47.08 47.59 62.52
MoCoOp (Ours) 70.08 75.88 48.97 46.50 61.31
Table 3: Comparisons on robustness to domain shift. All methods are trained on 16 shots per class of ImageNet and tested on ImageNet-R, ImageNet-A, ImageNet-Sketch and ImageNet-V2. For LASP Bulat and Tzimiropoulos (2022), we use our reproduced results.
K=2 K=3 K=4
Base New H Base New H Base New H
Caltech101 98.43 94.87 96.61 98.39 94.43 96.37 98.00 94.87 96.41
EuroSAT 94.48 77.02 84.75 94.38 75.0 83.58 94.02 74.36 83.04
UCF101 85.28 79.31 82.17 84.23 68.63 75.63 86.40 79.23 82.66
Flowers102 97.18 77.21 86.05 98.10 71.42 82.66 97.34 76.67 85.78
Table 4: Ablations of the number of selected experts.

We evaluate our MoCoOp in the domain generalization setting. This evaluate the robustness to domain shift. We train on 16 shots of ImageNet and test on ImageNet-R Hendrycks et al. (2021a), ImageNet-A Hendrycks et al. (2021b), ImageNet-Sketch Wang et al. (2019), ImageNet-V2 Recht et al. (2019). As seen in Table 3, the results are comparable with LASP Bulat and Tzimiropoulos (2022).

Appendix C Additional Ablations

C.0.1 The number of experts selected by the router

We also show the effect of the number of experts selected on the performance. As seen in Table 4, using top 2 prompts is the best in most cases.