\useunder

\ul

Knowledge-Based Domain-Oriented Data Augmentation for Enhancing Unsupervised Sentence Embedding

Peichao Lai
School of Computer Science
Peking University
lpc@pku.edu.cn
&Zhengfeng Zhang
College of Computer and Data Science
Fuzhou University
elizzf624@gmail.com
&Bin Cui
School of Computer Science
Peking University
bin.cui@pku.edu.cn

Abstract

Recently, unsupervised sentence embedding models have received significant attention in downstream natural language processing tasks. Using large language models (LLMs) for data augmentation has led to considerable improvements in previous studies. Nevertheless, these strategies emphasize data augmentation with extensive generic corpora, neglecting the consideration of few-shot domain data. The synthesized data lacks fine-grained information and may introduce negative sample noise. This study introduces a novel pipeline-based data augmentation method that leverages LLM to synthesize the domain-specific dataset. It produces both positive and negative samples through entity- and quantity-aware augmentation, utilizing an entity knowledge graph to synthesize samples with fine-grained semantic distinctions, increasing training sample diversity and relevance. We then present a Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model to reduce synthetic data noise and improve model discrimination to reduce negative sample noise. Experimental results demonstrate that our approach achieves state-of-the-art semantic textual similarity performance with fewer synthetic data samples and lesser LLM parameters, demonstrating its efficiency and robustness in varied backbones.

1 Introduction

Sentence representation learning as a fundamental task in natural language processing (NLP), has benefited various downstream tasks like semantic inference (Reimers and Gurevych, 2019), retrieval (Thakur et al., 2021; Wang et al., 2022a) and question answering (Sen et al., 2020). With the recent emergence of large language models (LLMs) (OpenAI, 2023; Bai et al., 2023; Touvron et al., 2023), the utilization of sentence embedding representation is crucial in retrieval augmented generation (RAG) approaches (Mitra et al., 2023; Ryu et al., 2023), since it enhances the precision of LLM generation and mitigates hallucination issues. To reduce the costs of manual labeling, the unsupervised contrastive learning approach, such as SimCSE (Gao et al., 2021), has emerged as the most efficient baseline method. The model performance in contrastive learning heavily relies on the quality of sentence samples (Chen et al., 2022). Prior works primarily concentrated on enhancing the diversity of samples through rule-based word modification (Wang and Dou, 2023; Wu et al., 2022a) or feature sampling and perturbation (Xu et al., 2023; Chuang et al., 2022a). Nevertheless, the efficacy of these data augmentation techniques in enhancing sample diversity has been somewhat constrained. Recently, several studies (Zhang et al., 2023; Wang et al., 2024) suggest utilizing LLMs to generate samples from original sentences, thereby significantly enhancing sample diversity and subsequently improving the efficacy of related methodologies.

Despite existing works have achieved commendable performance, several issues remain to be addressed. Existing approaches commonly face constraints in distinguishing fine-grained semantic information, such as distinctions in entities and quantities. The absence of discriminative ability leads to a higher proportion of false positive samples compared to false negative samples during inference. This leads to the selection of inaccurate samples with similar surface semantics in retrieval tasks (Miao et al., 2023). To enhance the model’s capacity for distinguishing these fine-grained distinctions, it is necessary to implement meticulous data augmentation techniques specifically targeting entities and quantities. Secondly, synthetic samples generated by LLMs may be inconsistent with the target domain label because of the diverse evaluation criteria across different domains, thereby adversely affecting model performance. Consequently, the model may potentially acquire erroneous or inefficient representations, leading to a decline in its overall performance. Finally, the variations in domains within the corpus also present a difficulty. Although general-domain corpora are typically abundant and readily available, domain-specific data samples tend to be scarce in quantity. In such cases, maximizing the utilization of few-shot samples to improve the quality and effectiveness of sentence representations becomes a critical research priority.

To overcome the above challenges, we utilize LLM to construct a pipeline-based data augmentation method for synthesizing domain-specific datasets, which aims to improve the performance of unsupervised sentence embedding models. Our framework utilizes domain data and partial general data to synthesize samples that balance domain-specific relevance and general-domain applicability. Initially, we extract entities and quantities from the source data samples and then construct a knowledge graph (KG) by organizing entities and their properties. By utilizing entity KG, LLMs can effectively leverage fine-grained knowledge within the sample to construct more diverse samples. Next, we create a sentence construction prompt using the provided sample knowledge to instruct LLM to generate positive samples with more diversity. To generate negative samples, we create an algorithm that searches for neighboring nodes to accurately replace the entity information in the original knowledge. This approach facilitates the generation of negative samples by LLM with similar surface semantics. Compared to related LLM-based data synthesis methods, our methodology enhances the model’s ability to learn fine-grained sentence representation information. In addition, our data selection strategy focuses on domain-specific samples, utilizing fewer samples for synthesizing and resulting in better performance. To mitigate the impact of noise in the generated data, we propose a Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model. The training process of GCSE is divided into two stages. In the first stage, we trained an evaluation model using both the unlabeled domain and general data. This evaluation model serves as the backbone for GCSE and is used to annotate the synthesized data, thereby initially filtering out false positive and negative samples. By increasing the similarity threshold, it is possible to filter false positive samples in synthetic data. However, to reduce the occurrence of false positive samples in real data, we plan to utilize high-quality negative samples for model training, specifically those with similar surface semantics. And it is necessary to set a higher filtering threshold during the initial filtering of false negative samples, which may result in the inclusion of potential false negative samples. Thus, we duplicated the evaluation model in the second stage to guide the spatial distance of in-batch hard negative samples for GCSE. To mitigate the impact of hard negative samples on the model during the initial steps, we employ a Gaussian-decayed function to calculate the prediction distinctions between the GCSE and the evaluation model. This allows us to control the model from being affected by hard negative samples at the initial steps, and instead use other in-batch negative sample losses to optimize the gradient. Experimental results demonstrate the efficiency of our model, and our method achieves state-of-the-art results on semantic textual similarity (STS) tasks.

In summary, we highlight the major contributions as follows:

1. We propose a pipeline-based data augmentation method via LLM to extract entities and quantities from samples and generate diverse positive and negative samples, enhancing fine-grained sentence representation learning models.

2. We propose a Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model that reduces noise in generated data to improve fine-grained semantic discrimination.

3. Experimental results show that our method achieves better performance on STS tasks while utilizing fewer samples for data synthesis with lesser LLM parameters.

2 Related Work

Early work on sentence embeddings builds on the distributional hypothesis, predicting surrounding sentences (Kiros et al., 2015; Logeswaran and Lee, 2018; Hill et al., 2016) or extending the word2vec framework (Mikolov et al., 2013) with n-gram embeddings (Pagliardini et al., 2018). Post-processing techniques like BERT-flow (Li et al., 2020) and BERT-whitening (Su et al., 2021) address the anisotropy issue in pre-trained language models (PLMs), and more recent methods focus on generative approaches (Wang et al., 2021; Wu and Zhao, 2022) and regularizing embeddings to prevent representation degeneration (Huang et al., 2021). Recently, contrastive learning approaches have become prominent, using various augmentation methods to derive different views of the same sentence (Zhang et al., 2020; Giorgi et al., 2021; Kim et al., 2021; Gao et al., 2021). Among these, SimCSE uses dropout as a simple augmentation and achieves strong results in unsupervised STS tasks, inspiring further approaches like ArcCSE (Zhang et al., 2022), DiffCSE (Chuang et al., 2022a) and RankCSE (Liu et al., 2023).

With the advent of LLM, some works attempt to utilize LLM for sentence representation learning. For example, Ni et al. (2022) uses T5 with mean pooling to obtain a sentence embedding model by fine-tuning on a large-scale NLI corpus; Cheng et al. (2023) uses prompt learning to measure the semantic similarity of sentence pairs. Springer et al. (2024) employs sentence repetition to enhance the capacity for sentence representation. Nevertheless, the performance of the LLMs in unsupervised settings does not exhibit a substantial improvement compared to encoder-based approaches, while the utilization of computer resources is greatly augmented. Thus, the objective of this study is to utilize the generative capability of LLM to enhance the performance of the encoder-based models.

3 Methodology

In this section, we present the data synthesis pipeline using LLM and the specific structure of the GCSE. We first introduce two procedures of the pipeline in detail, the synthesized data is sampled from both domain and general data, and the overall structure is shown in Figure 1.

Refer to caption — Figure 1: The pipeline of knowledge extraction and data synthesis, where the solid black arrows in the Entity KG are hard edges, and dotted yellow lines are soft edges.

3.1 Knowledge Extraction and Integration

The variety and interrelationship of samples have a direct impact on the performance of models in tasks related to learning sentence embedding. An important obstacle in synthesizing data via LLM is the limited variation in data synthesis for a single short text instance. To trade off the diversity of the model generation with its relevance to the domain semantic space, we first design an extraction prompt to obtain entities and quantities information from the given data. Formally, we denote the extraction prompt as $\mathcal{P}_{e}$ , and LLM $\mathcal{L}$ , suppose we finally extract instances with $d$ sample number, the knowledge set $\mathcal{K}_{i}=\left\{k_{i1},\dots,k_{in}\right\}$ of each instance $x_{i}$ is computed as:

\mathcal{K}=\bigcup_{i=1}^{d}{\mathcal{F}}([\mathcal{P}_{e};x_{i}],{\mathcal{L% }})=\bigcup_{i=1}^{d}\{\langle t_{ij},c_{ij},q_{ij}\rangle\mid j\in[1,n_{i}]\},

(1)

where $t_{j}$ , $c_{j}$ and $q_{j}$ represent the entity text, entity type and quantity of $k_{i}$ , $n_{i}$ is the size of $\mathcal{K}_{i}$ , and $\mathcal{F}(\cdot)$ is the formatting function that convert text to triplet. Next, we integrate all knowledge by establishing an entity knowledge graph $\mathcal{G}=\langle V,E\rangle$ , where the node set $V$ contains all the $\langle t,c,q\rangle$ from $\mathcal{K}$ :

V=\{t_{ij},c_{ij},q_{ij}\mid i\in[1,d];j\in[1,n_{i}]\}.

(2)

The edges $E$ consist of hard edges $E_{r}$ and soft edges $E_{s}$ , where $E_{r}$ represents the relationship between the entity text, type and quantity of each $k\in\mathcal{K}$ :

E_{r}=\{(t_{ij},c_{ij})\cup(t_{ij},q_{ij})\mid i\in[1,d];j\in[1,n_{i}]\},

(3)

and $E_{s}$ indicates the relationship between entity text in $k_{ij}$ and other entity text or type in the same instance $x_{i}$ :

E_{s}=\bigcup_{i=1}^{d}\{(t_{ij},t_{ik}),(t_{ij},c_{il})\mid k,l\neq j;j,k,l% \in[1,n_{i}]\}.

(4)

By defining the hard and soft edges, we may more effectively acquire and substitute entity nodes that are in proximity to the current node, hence enhancing the correlation between the instance generated by the LLM and the source instance.

3.2 Data Synthesis via LLM

According to empirical evidence and the performance of sentence embedding models on various standard datasets, models face a greater challenge in accurately distinguishing negative samples compared to positive ones. In the contrastive learning methods, the model acquires sentence embedding representation by calculating the distance between sentence-pairs. It aims to minimize the spatial distance between positive pairs and increase the spatial distance between negative pairs. Thus, it is essential to obtain negative samples that closely resemble the source instance in terms of surface similarity, while positive samples should exhibit a diverse range of representations while still conveying the same meaning as the source instance.

In this study, we employ LLM to generate positive samples using a rewrite prompt. In addition, we are concerned about the impact of variations in the entities and their quantities inside the sample. The negative samples of synthesis are generated using LLM at both the syntactic level and the fine-grained knowledge level. The data synthesis prompt can be broadly categorized into three main types: (1) Rewriting prompt, (2) Syntactic antisense prompt, and (3) Entity revision prompt. The first type is used to create positive samples, while the second and third types are used to create negative samples at the syntactic and knowledge levels, respectively. The “rewriting prompt" can be classified into three forms: directly requesting LLM to generate a new sentence instance using the “rewrite" instruction, creating the preceding part of the sentence instance, and generating based on the knowledge set of the instance. As the diversity of synthetic samples generated by these prompts increases, the likelihood of false positive samples will likewise escalate. Thus, the subsequent steps in Section 3.3 involve scoring the generated samples by implementing an evaluation model. The “syntactic antisense prompt” aims to modify the semantics to create a contradiction at the syntactic level. Such as transforming it into a positive/negative statement using explicit positive/negative words, or by expressing a contrary sentiment. This is an initial approach to synthesizing negative samples that preserves a strong coherence with the source instance in terms of sequence structure. However, it is deficient in generation diversity. To alleviate the issue, the “entity revision prompt" aims to enhance text diversity by replacing the entity text and quantity compared to the source instance. Simultaneously, to ensure the semantic relevance between the synthetic samples and the source instance, replacement entities are selected by searching for neighboring nodes on entity KG. We define $\mathcal{T(\cdot)}$ as the search function, and the replacement entity of $t_{ij}$ are computed as:

\mathcal{T}_{r}(t_{ij})=\{t_{ip}\mid(t_{ij},c_{ik})\in E_{r}\land(t_{ip},c_{ik% })\in E_{r}\},

(5)

\mathcal{T}_{s}(t_{ij})=\{t_{ip}\mid(t_{ij},t_{ip})\in E_{s}\},

(6)

\mathcal{T}_{p}(t_{ij})=\{t_{ip}\mid t_{ik}\in\mathcal{T}_{s}(t_{ij})\cap% \mathcal{T}_{s}(t_{ip})\land t_{ip}\in\mathcal{T}_{r}(t_{ij})\},

(7)

\mathcal{T}(t_{ij})=\mathcal{T}_{r}(t_{ij})\cup\mathcal{T}_{p}(t_{ij}),

(8)

where the function $\mathcal{T}_{r}(\cdot)$ is used to search for entities that has a hard edge with the current entity, and $\mathcal{T}_{s}(\cdot)$ is used to search for entities that have a soft edge with the current entity. $\mathcal{T}_{p}(\cdot)$ aims to search for $t_{ip}$ , that is of the same type as $t_{ij}$ , and they both have soft edges with another in-context entity $t_{ik}$ . Finally, the replacement entity can be randomly selected from the result of the search function $\mathcal{T}(t_{ij})$ . Compared to randomly replacing entities, our strategy enhances the semantic relevance between the generated sample and the source instance.

3.3 General Contrastive Learning

The training process of GCSE is divided into two stages, in the first stage, we adopt the combination of all general and domain data to train an evaluation model with a standard unsupervised contrastive learning method to enhance the uniformity of sentence embedding representation in the general scenario and mitigate the impact of the semantic spatial distribution limitations of synthesized data on model robustness. Next, the evaluation model is set to be frozen to filter synthetic data and assist in eliminating false hard negative sample noise at GCSE. We follow the formulation of SimCSE (Gao et al., 2021) to train the evaluation model. Formally, we define the encoder of the evaluation model as $E^{\prime}$ , each unlabeled sentence instance as $x_{i}$ , and its positive sample as $x_{i}^{+}=x_{i}$ . The representation of each instance is denoted as $\mathbf{h}^{\prime}=\mathcal{F}_{E^{\prime}}(x)$ , the representations of $x_{i}$ and $x_{i}^{+}$ are computed as $\mathbf{h}^{\prime}_{i}$ and $\mathbf{h}^{\prime+}_{i}$ , respectively. Since the dropout mask in $E^{\prime}$ is random, $\mathbf{h}^{\prime}_{i}$ and $\mathbf{h}^{\prime+}_{i}$ are computed with the same input but with slightly different results. Then, the loss of evaluation model is defined as:

-\log\frac{e^{\text{sim}(\mathbf{h}^{\prime}_{i},\mathbf{h}^{\prime+}_{i})/% \tau}}{\sum_{j=1}^{N}e^{\text{sim}(\mathbf{h}^{\prime}_{i},\mathbf{h}^{\prime+% }_{j})/\tau}},

(9)

where $N$ represents the size of each mini-batch, $\tau$ is a temperature hyperparameter, and $\text{sim}(\cdot)$ is the cosine similarity function.

3.4 Gaussian-decayed Training on Synthesized Data

In the second stage, we adopt a copy of the evaluation model as the backbone of GCSE and continue training on synthesized data. In this stage, each input is set as a triplet $(x_{i},x_{i}^{+},x_{i}^{-})$ , where $x_{i}^{+}$ and $x_{i}^{-}$ stand for the positive and negative samples of $x_{i}$ , respectively. Nevertheless, the synthesized data contains many potential false positive and false negative samples, necessitating the implementation of a filtering process. We use the frozen evaluation encoder to initially correct these inaccurate samples and build the ultimate triplet dataset. Let $\mathcal{S}(x_{i})=\left\{\hat{x}_{i1},\dots\hat{x}_{im}\right\}$ denotes the synthetic data set of $x_{i}$ , where $m$ is the size of the set, and $x_{i}^{+}$ , $x_{i}^{-}$ are calculated as:

x_{i}^{+}=\begin{cases}\hat{x}_{ij},&\text{sim}(\mathbf{h}^{\prime}_{i},\hat{% \mathbf{h}}^{\prime}_{ij})\geq\alpha,j\in[1,m]\\ x_{i},&\text{else}\end{cases},

(10)

x_{i}^{-}=\begin{cases}\hat{x}_{ij},&\text{sim}(\mathbf{h}^{\prime}_{i},\hat{% \mathbf{h}}^{\prime}_{ij})\leq\beta,j\in[1,m]\\ x_{k},&k\in[1,N],k\neq i\end{cases},

(11)

where $\alpha$ , $\beta$ are the threshold for positive and negative samples, respectively. $x_{k}$ denotes a randomly selected instance from in-batch data. To minimize the occurrence of false positive samples, we can assign a high value for $\alpha$ . However, filtering out false samples from synthetic negative samples is a greater challenge. Theoretically, smaller $\beta$ can reduce the number of false negative samples. Nevertheless, samples with low similarity to the source instance exhibit significant disparities in surface semantics, making it easy to distinguish. Consequently, training on these samples can not effectively improve the performance on distinguishing fine-grained false positive samples. Thus, we consider a higher value for $\beta$ . During training, we use a Gaussian-decayed function to align distances of hard negative samples between the GCSE encoder $E$ and the frozen encoder $E^{\prime}$ . As shown in Figure 2, given each mini-batch of triplet inputs, both $E$ and $E\prime$ compute similarity scores for the negative samples and their corresponding source instances, and the loss of each instance in GCSE is defined as:

-\log\frac{e^{\text{sim}({\bf h}_{i},{\bf h}_{i}^{+})/\tau}}{\sum_{j=1}^{N}e^{% \text{sim}({\bf h}_{i},{\bf h}_{j}^{+})/\tau}+\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}e^{\text{sim}({\bf h}_{i},{\bf h}_{j}^{-})/\tau}+G(s% _{i},s^{\prime}_{i},\tau,\sigma)},

(12)

G(s_{i},s^{\prime}_{i},\tau,\sigma)=s_{i}\left(1-e^{-\frac{(s_{i}-s^{\prime}_{% i})^{2}\tau^{2}}{2\,\sigma^{2}}}\right),

(13)

where $s_{i}=\text{sim}(\mathbf{h}_{i},\mathbf{h}^{-}_{i})$ , $s^{\prime}_{i}=\text{sim}(\mathbf{h}^{\prime}_{i},\mathbf{h}^{\prime-}_{i})$ . $G(\cdot)$ is the Gaussian-decayed function, where the loss attenuation of the hard negative sample grows as the distance between $s_{i}$ and $s^{\prime}_{i}$ decreases, and $\sigma$ is a hyperparameter that controls the width of $G(\cdot)$ . This implies that when $E$ initially calculates the hard negative sample, it will follow the spatial distribution of $E^{\prime}$ as the “established guidelines”, and use other in-batch negative samples to increase the spatial distance of negative samples, which can better eliminate the influence of false negative samples. With the iteration of training, the deviation of true hard negative spatial distribution between $E$ and $E^{\prime}$ will progressively increase, and its gradient will be restored.

4 Experiment

4.1 Experiment Setup

Training: We utilize the subset of NLI dataset from Gao et al. (2021) as the general data, and use the training sets from STS-Benchmark (STS-B) (Cer et al., 2017) with 5.7k samples and SICK (Marelli et al., 2014) with 4.5k samples as the domain data for a fair comparison with related approaches. To simulate the unsupervised scenario, we exclusively include unlabeled samples from the dataset. In this experiment, the ratio of sample numbers between domain data and general data was 1:3. We adopt ChatGLM-3(6B) (GLM et al., 2024) as the LLM for data synthesis, and we choose BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) as the backbone models of GCSE. In the stage of Gaussian-decayed training on synthesized data, the filtering thresholds of $\alpha$ and $\beta$ are set as 0.9 and 0.75, respectively. The temperature of $\tau$ is set as 0.05, and the $\sigma$ of $G(\cdot)$ is set as 0.01. In the first stage training, the evaluation model is firstly trained on the unlabeled dataset of all general data and domain data. One copy instance of the evaluation model is then utilized as the pre-trained model for GCSE, while the original instance is set to be frozen to filter synthesized data and provide guidance for GCSE. In the second stage, GCSE is trained on the filtered synthesized data, and the sentence embedding is obtained from the last output hidden states of the first token.

Evaluation: We follow the standard evaluation methods to evaluate our model on the semantic textual similarity (STS) tasks. We use Spearman’s correlation to measure the model performance, and we adopt SentEval (Conneau and Kiela, 2018) ¹¹1https://github.com/facebookresearch/SentEval as the evaluation tool, which contains seven STS subsets: STS 2012-2016 (Agirre et al., 2012, 2013, 2014, 2015, 2016), the STS-Benchmark (Cer et al., 2017) and the SICK Relatedness (Marelli et al., 2014). To compare the ranking performance of our method on retrieval tasks, we further evaluate the model on four reranking datasets: AskUbuntuDupQuestions (Lei et al., 2016), MindSmallReranking (Wu et al., 2020), SciDocsRR (Cohan et al., 2020) and StackOverflowDupQuestions (Liu et al., 2018), and follow the same settings of Zhang et al. (2023) by using Mean Average Precision (MAP) as the metric. performance on transfer tasks follow the settings of Gao et al. (2021). In addition, we compared the performance of the model and other methods on the transfer task in SentEval to evaluate the applicability of our method.

Model	Method	STS-12	STS-13	STS-14	STS-15	STS-16	STS-B	SICK-R	Avg.
BERT-base	whitening†	57.83	66.90	60.90	75.08	71.31	68.24	63.73	66.28
	SimCSE†	68.40	82.41	74.38	80.91	78.56	76.85	72.23	76.25
	DiffCSE†	72.28	84.43	76.47	83.90	80.54	80.59	71.23	78.49
	PromptBERT $\clubsuit$	71.56	84.58	76.98	84.47	80.60	81.60	69.87	78.54
	PCL $\spadesuit$	72.84	83.81	76.52	83.06	79.32	80.01	73.38	78.42
	DebCSE†	\ul76.15	84.67	78.91	85.41	80.55	82.99	73.60	80.33
	RankCSE $\spadesuit$	75.66	86.27	77.81	84.74	81.10	81.80	75.13	80.36
	SynCSE (ChatGPT)*	75.86	82.19	78.71	85.63	\ul81.11	82.35	\ul78.79	80.66
	MultiCSR (ChatGPT) $\clubsuit$	74.86	84.19	\ul79.46	84.70	80.34	83.59	79.37	\ul80.93
	GCSE	76.91	\ul86.23	80.49	\ul85.16	81.45	\ul82.54	75.71	81.21
BERT-large	SimCSE†	70.88	84.16	76.43	84.50	79.76	79.26	73.88	78.41
	PCL $\spadesuit$	74.87	86.11	78.29	85.65	80.52	81.62	73.94	80.14
	DebCSE†	\ul76.82	86.36	\ul79.81	85.80	80.83	83.45	74.67	81.11
	RankCSE $\spadesuit$	75.48	\ul86.50	78.60	85.45	\ul81.09	81.58	75.53	80.60
	SynCSE (ChatGPT)*	74.24	85.31	79.41	\ul85.71	81.76	82.61	79.25	\ul81.18
	GCSE	76.99	87.34	80.88	85.47	80.55	\ul82.97	\ul75.68	81.41
RoBERTa-base	whitening†	46.99	63.24	57.23	71.36	68.99	61.36	62.91	61.73
	SimCSE†	70.16	81.77	73.24	81.36	80.65	80.22	68.56	76.57
	DiffCSE†	70.05	83.43	75.49	82.81	82.12	82.38	71.19	78.21
	PromptRoBERTa $\clubsuit$	73.94	84.74	77.28	84.99	81.74	81.88	69.50	79.15
	PCL $\spadesuit$	71.13	82.38	75.40	83.07	81.98	81.63	69.72	77.90
	DebCSE†	74.29	\ul85.54	79.46	85.68	81.20	83.96	74.04	80.60
	RankCSE $\spadesuit$	73.20	85.95	77.17	84.82	\ul82.58	83.08	71.88	79.81
	SynCSE (ChatGPT) $\diamondsuit$	74.61	83.76	77.89	85.09	82.28	82.71	\ul78.88	80.75
	MultiCSR (ChatGPT) $\clubsuit$	\ul75.61	84.33	\ul80.10	84.98	82.13	84.54	79.67	81.62
	GCSE	76.06	85.30	80.38	\ul85.28	83.26	\ul84.07	74.55	\ul81.27
RoBERTa-large	SimCSE†	72.86	83.99	75.62	84.77	81.80	81.98	71.26	78.90
	PCL $\spadesuit$	74.08	84.36	76.42	85.49	81.76	82.79	71.51	79.49
	DebCSE†	\ul77.68	\ul87.17	\ul80.53	85.90	\ul83.57	85.36	73.89	82.01
	RankCSE $\spadesuit$	73.20	85.83	78.00	85.63	82.67	84.19	73.64	80.45
	SynCSE (ChatGPT) $\diamondsuit$	75.45	85.01	80.28	\ul86.55	83.95	84.49	80.61	\ul82.33
	GCSE	78.24	87.24	81.93	86.80	83.52	\ul85.08	\ul76.70	82.79

Table 1: Comparison of Spearman’s correlation results on STS tasks, where the value highlighted in bold is the best value, and the value underlined is the second-best value. “†”: results from Miao et al. (2023), “

\clubsuit

”: results from Wang et al. (2024), “

\spadesuit

”: results from Liu et al. (2023), “

\diamondsuit

”: results from Zhang et al. (2023). “*”: we reproduce the results with the officially released corpus from Zhang et al. (2023).

Baselines: We compare our method with mainstream unsupervised sentence embedding baselines: BERT-whitening (Su et al., 2021), SimCSE (Gao et al., 2021), DiffCSE (Chuang et al., 2022b), PromptBERT (Jiang et al., 2022), PCL (Wu et al., 2022b), CARDS (Wang et al., 2022b), DebCSE (Miao et al., 2023) and RankCSE (Liu et al., 2023). In addition, we further compare two baselines: SynCSE (Zhang et al., 2023) and MultiCSR (Wang et al., 2024), which using LLM for data synthesizing in whole NLI datasets. To verify the effectiveness of our data synthesis method, we choose their results of using ChatGPT for comparison.

4.2 Main Results

STS Tasks: The overall results of STS tasks are shown in Table 1. The results show that our approach achieves state-of-the-art results in the backbones of BERT-base, BERT-large, and RoBERTa-large when compared to other unsupervised baselines. This demonstrates the high versatility of our approach, since it can be effectively utilized across multiple models. Compared to the standard unsupervised SimCSE, Spearman’s correlation of our approach is improved by an average of 17.24% on the base models and 3.44% on the large models. On the strong baseline RankCSE, our approach achieved a 1.36% improvement over its average performance, demonstrating the effectiveness of the LLM data synthesis process. Furthermore, we compare two baseline models: SynCSE and MultiCSR, both of which utilize LLM as the data synthesis model. We specifically analyze the results of using ChatGPT for both models. The results show that our approach outperform both models in most cases, in the case of using RoBERTa-base, our method is slightly behind MultiCSR by 0.35% and still achieves the second-best result. It should be noted that the ChatGLM-3(6b) we use is much more lightweight than ChatGPT(about 175b) in parameters. Additionally, our method only utilizes 14% of the sample size compared to the other two methods that employ the entire NLI datasets. This demonstrates the effectiveness of our data synthesis strategy and domain-oriented sample selection strategy.

Model	Method	AskU.	Mindsmall	SciDocsRR	StackO.	Avg.
BERT-base	SimCSE	51.89	28.68	67.88	\ul39.60	47.01
	PCL	52.46	28.72	68.03	41.30	\ul47.63
	SynCSE (ChatGPT)*	\ul52.61	29.17	\ul68.46	38.60	47.21
	GCSE	52.62	\ul28.79	70.67	39.53	47.90
BERT-large	SimCSE	53.10	\ul29.59	\ul71.94	\ul40.68	\ul48.83
	PCL	52.03	29.11	70.30	42.33	48.44
	SynCSE (ChatGPT)*	\ul53.24	30.09	71.45	39.24	48.50
	GCSE	53.40	29.43	73.04	39.68	48.89
RoBERTa-base	SimCSE $\diamondsuit$	52.78	\ul29.91	65.96	39.25	46.95
	CARDS $\diamondsuit$	52.94	27.92	64.62	41.51	46.75
	PCL $\diamondsuit$	51.85	27.92	64.70	\ul41.18	46.41
	SynCSE (ChatGPT) $\diamondsuit$	\ul53.27	30.29	\ul67.55	39.39	\ul47.63
	GCSE	53.44	29.35	67.89	41.13	47.95
RoBERTa-large	SimCSE $\diamondsuit$	\ul55.10	29.23	68.54	\ul42.56	48.86
	CARDS $\diamondsuit$	53.83	29.07	68.26	43.24	48.60
	PCL $\diamondsuit$	53.43	28.56	66.06	41.54	47.40
	SynCSE (ChatGPT) $\diamondsuit$	55.48	\ul30.27	\ul70.85	40.00	\ul49.15
	GCSE	54.05	30.30	71.23	41.65	49.31

Table 2: Comparison of Mean Average Precision (MAP) results on reranking tasks, where the value highlighted in bold is the best value, and the value underlined is the second-best value. “

\diamondsuit

”: results from Zhang et al. (2023). “*”: we reproduce the results with the officially released corpus from Zhang et al. (2023).

Reranking Tasks: We evaluate the model’s reranking performance on the reranking benchmark by using the MTEB (Muennighoff et al., 2023) to verify the efficacy of the model when applied to the retrieval tasks. Table 2 presents the MAP results of our approach and related baselines on the reranking benchmark, and all models are evaluated on the test sets of the reranking benchmark without using the training sets. The results indicate that various approaches exhibit varying performance on different datasets, which can be attributed to the distinct semantic distribution and evaluation scale of each dataset. Our GCSE outperforms SynCSE by 0.39% in average MAP score and achieves the best results in all backbone models, demonstrating the efficacy of our approach in enhancing the precision of unsupervised ranking tasks.

4.3 Analysis

Method	STS-12	STS-13	STS-14	STS-15	STS-16	STS-B	SICK-R	Avg.
GCSE	76.91	86.23	80.49	85.16	81.45	82.54	75.71	81.21
w/o stage-2	71.85	83.65	76.84	83.37	78.74	79.10	71.69	77.89
w randomly	71.94	84.03	76.99	83.65	79.11	78.66	69.28	77.67
w/o filtering	74.65	83.54	77.39	83.27	79.97	79.66	74.27	78.96
w/o decay	76.15	85.83	79.77	85.19	80.72	82.59	75.55	80.83
w/o general	75.44	85.55	79.19	84.91	80.23	81.57	74.14	80.15
w/o domain	75.59	85.66	78.93	84.09	80.87	82.29	76.00	80.49

Table 3: Ablation studies of STS tasks on BERT-base. Other PLMs yield similar patterns to BERT-base.

Ablation Studies: We analyze the impact of each module or strategy in our GCSE and report the results in Table 3. First, “w/o stage-2” refers to the results obtained without training in the second stage. This leads to a significant decrease in performance compared to the default model, which is the performance of the evaluation model and is similar to the conventional unsupervised SimCSE. Then, “w randomly” refers to the direct use of the instance itself as a positive sample in the combination dataset of domain and general data, while randomly selecting a negative instance from the dataset. We can observe that its performance in this case is even worse than the evaluation model. This demonstrates that the diversity of positive samples and the quality of negative samples significantly impact the performance of the model. “w/o filtering” indicates the results of training by skipping evaluation model filtering and directly using the data synthesized by LLM. The results show that the performance of the model is significantly affected when false positive and negative samples are introduced without filtering. We investigate the impact of the Gaussian-decayed function by removing it, and the results are shown in “w/o decay”. We can observe that the default model performs better overall than when the Gaussian-decayed function is removed, indicating that it can filter out potential false negative sample noise. Finally, we analyze the necessity of including general data and domain data in “w/o general” and “w/o domain” respectively. It can be observed that removing either of them results in a decline in performance, which indicates the significance of domain data and the essentiality of general data in our method.

Method	Spearman’s
unsup-SimCSE	75.59
RankCSE	79.74
SynCSE (ChatGPT)	91.58
GCSE	93.77

Table 4: Comparison of Spearman’s correlation results on the synthetic data of the STS-Benchmark development set.

Analysis of entities and quantities awareness: We analyze GCSE awareness of entities and quantities by constructing a dataset using the data synthesis method in Section 3.2 on the STS-Benchmark development set. Then, the similarity scores of each triplet in the dataset are annotated by two supervised pre-trained models: “sup-simcse-bert-large” and “sup-simcse-roberta-large”. The final label is the average score of the similarity calculated by both models. We evaluate Spearman’s correlation scores of GCSE and the other three strong baselines on the backbone of the BERT-base model, and the results are shown in Table 4. Our GCSE achieves the best result and outperforms RankCSE by 14.03%. In this case, both SynCSE and GCSE achieve significant improvements over methods without LLM. This might be due to the similarity of the semantic representation space between the training set and the development set, both of which are synthesized via LLM. Nevertheless, GCSE shows a notable enhancement in performance of 2.19% compared to SynCSE, demonstrating that its understanding of the entities and quantities in sentences has enhanced to a certain degree.

Impact on the ratio between domain and general data: Figure 4 presents the trend of the GCSE Spearman’s correlation result as the proportion of general data introduced increases, where “d” represents that only using the domain data. The results show that the introduction of a certain proportion of general data can improve the performance on the STS tasks. Nevertheless, if the sample size of general data surpasses three times the domain data, the model’s performance starts to decrease. This suggests that incorporating a suitable quantity of external corpus can enhance the uniformity of sentence embedding representation. However, when the size of out-domain samples grows, the impact of domain data on model training diminishes. Thus, the results indicate that the domain data can enhance the model’s capacity to represent the target domain sentences more effectively, while general data can enhance the uniformity of sentence representation.

Impact of the Gaussian-decayed: To further investigate the effectiveness of the Gaussian-decayed function, we analyze the GCSE performance against the weight of $\sigma$ on the synthesized data, both with and without filtering. As shown in Figure 4, we use the synthesized data without filtering to evaluate the efficacy of the Gaussian-decayed function in eliminating false negative samples, and results are presented in Figure 4 (b). It is clear that the model’s performance improves as the weight of $\sigma$ grows. This suggests that a greater $\sigma$ weight enhances the model’s effectiveness in mitigating the impact of false negative samples. It is important to acknowledge that a higher $\sigma$ does not necessarily indicate better performance. As shown in Figure 4 (a), an increase in $\sigma$ at the initial stage contributes to enhancing the model’s performance. Nevertheless, as the weight of $\sigma$ increases, the performance of backbones generally declines, resulting in the model adhering too strictly to the “established guidelines”. Consequently, it impacts the efficacy to learn from the hard negative samples. We further use the density plots to visualize the prediction on the STS-Benchmark development set in Figure 5. These models are trained on the synthesized data without filtering. We can observe that in Figure 5 (a), the distribution of prediction results for labels $\geq 4$ is significantly shifted to the left. Compared with the results in Figure 5 (b), this issue is effectively alleviated, demonstrating the effectiveness of the Gaussian-decayed function in reducing the influence of false negative samples.

5 Conclusion

In this paper, we present a pipeline-based data augmentation method that uses LLM to improve unsupervised sentence embedding models. Our data synthesis strategy for few-shot domain-specific data emphasizes entity and quantity information, improving the model’s capacity to distinguish fine-grained semantic distinctions. The proposed Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model further mitigates the impact of noise in generated data. Extensive experiments on STS and reranking tasks confirm that our approach achieves state-of-the-art results while requiring fewer synthesized data samples and more lightweight LLM, demonstrating its effectiveness and efficiency.

References

Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3973–3983, 2019.
Thakur et al. [2021] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ.
Wang et al. [2022a] Bin Wang, C.-C. Jay Kuo, and Haizhou Li. Just rank: Rethinking evaluation with word and sentence similarities. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6060–6077, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.419. URL https://aclanthology.org/2022.acl-long.419.
Sen et al. [2020] Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi Dalmia, Greg Stager, Ashish R. Mittal, Diptikalyan Saha, and Karthik Sankaranarayanan. ATHENA++: natural language querying for complex nested SQL queries. Proc. VLDB Endow., 13(11):2747–2759, 2020. URL http://www.vldb.org/pvldb/vol13/p2747-sen.pdf.
OpenAI [2023] OpenAI. GPT-4 technical report. CoRR, abs/2303.08774:1–100, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
Bai et al. [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. CoRR, abs/2309.16609:1–59, 2023. doi:10.48550/ARXIV.2309.16609. URL https://doi.org/10.48550/arXiv.2309.16609.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971:1–27, 2023. doi:10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
Mitra et al. [2023] Kathakali Mitra, Aditha Venkata Santosh Ashish, Soumya Teotia, and Aruna Malapati. Effect of pivot language and segment-based few-shot prompting for cross-domain multi-intent identification in low resource languages. In Jyoti D. Pawar and Sobha Lalitha Devi, editors, Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 349–356, Goa University, Goa, India, December 2023. NLP Association of India (NLPAI). URL https://aclanthology.org/2023.icon-1.27.
Ryu et al. [2023] Cheol Ryu, Seolhwa Lee, Subeen Pang, Chanyeol Choi, Hojun Choi, Myeonggee Min, and Jy-Yong Sohn. Retrieval-based evaluation for LLMs: A case study in Korean legal QA. In Daniel Preo\textcommabelowtiuc-Pietro, Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos Spanakis, and Nikolaos Aletras, editors, Proceedings of the Natural Legal Language Processing Workshop 2023, pages 132–137, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.nllp-1.13. URL https://aclanthology.org/2023.nllp-1.13.
Gao et al. [2021] Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910. Association for Computational Linguistics, November 2021. doi:10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main.552.
Chen et al. [2022] Yiming Chen, Yan Zhang, Bin Wang, Zuozhu Liu, and Haizhou Li. Generate, discriminate and contrast: A semi-supervised sentence representation learning framework. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8150–8161, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.558. URL https://aclanthology.org/2022.emnlp-main.558.
Wang and Dou [2023] Hao Wang and Yong Dou. Sncse: Contrastive learning for unsupervised sentence embedding with soft negative samples. In International Conference on Intelligent Computing, pages 419–431. Springer, 2023.
Wu et al. [2022a] Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, and Songlin Hu. ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, pages 3898–3907, Gyeongju, Republic of Korea, October 2022a. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.342.
Xu et al. [2023] Bo Xu, Shouang Wei, Luyi Cheng, Shizhou Huang, Hui Song, Ming Du, and Hongya Wang. Hsimcse: Improving contrastive learning of unsupervised sentence representation with adversarial hard positives and dual hard negatives. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2023. doi:10.1109/IJCNN54540.2023.10191335.
Chuang et al. [2022a] Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Scott Yih, Yoon Kim, and James Glass. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States, July 2022a. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.311. URL https://aclanthology.org/2022.naacl-main.311.
Zhang et al. [2023] Junlei Zhang, Zhenzhong Lan, and Junxian He. Contrastive learning of sentence embeddings from scratch. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3916–3932. Association for Computational Linguistics, 2023. doi:10.18653/V1/2023.EMNLP-MAIN.238. URL https://doi.org/10.18653/v1/2023.emnlp-main.238.
Wang et al. [2024] Huiming Wang, Zhaodonghui Li, Liying Cheng, De Wen Soh, and Lidong Bing. Large language models can contrastively refine their generation for better sentence representation learning. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 7874–7891. Association for Computational Linguistics, 2024. doi:10.18653/V1/2024.NAACL-LONG.436. URL https://doi.org/10.18653/v1/2024.naacl-long.436.
Miao et al. [2023] Pu Miao, Zeyao Du, and Junlin Zhang. Debcse: Rethinking unsupervised contrastive sentence embedding learning in the debiasing perspective. In Ingo Frommholz, Frank Hopfgartner, Mark Lee, Michael Oakes, Mounia Lalmas, Min Zhang, and Rodrygo L. T. Santos, editors, Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023, pages 1847–1856. ACM, 2023. doi:10.1145/3583780.3614833. URL https://doi.org/10.1145/3583780.3614833.
Kiros et al. [2015] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.
Logeswaran and Lee [2018] Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJvJXZb0W.
Hill et al. [2016] Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377, San Diego, California, June 2016. Association for Computational Linguistics. doi:10.18653/v1/N16-1162. URL https://aclanthology.org/N16-1162.
Mikolov et al. [2013] Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
Pagliardini et al. [2018] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning of sentence embeddings using compositional n-gram features. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 528–540, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-1049. URL https://aclanthology.org/N18-1049.
Li et al. [2020] Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. On the sentence embeddings from pre-trained language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.733. URL https://aclanthology.org/2020.emnlp-main.733.
Su et al. [2021] Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. Whitening sentence representations for better semantics and faster retrieval. CoRR, abs/2103.15316, 2021. URL https://arxiv.org/abs/2103.15316.
Wang et al. [2021] Kexin Wang, Nils Reimers, and Iryna Gurevych. TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 671–688, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.findings-emnlp.59. URL https://aclanthology.org/2021.findings-emnlp.59.
Wu and Zhao [2022] Bohong Wu and Hai Zhao. Sentence representation learning with generative objective rather than contrastive objective. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3356–3368, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.221. URL https://aclanthology.org/2022.emnlp-main.221.
Huang et al. [2021] Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, and Nan Duan. Whiteningbert: An easy unsupervised sentence embedding approach. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 238–244. Association for Computational Linguistics, 2021. doi:10.18653/V1/2021.FINDINGS-EMNLP.23. URL https://doi.org/10.18653/v1/2021.findings-emnlp.23.
Zhang et al. [2020] Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. An unsupervised sentence embedding method by mutual information maximization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.124. URL https://aclanthology.org/2020.emnlp-main.124.
Giorgi et al. [2021] John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. DeCLUTR: Deep contrastive learning for unsupervised textual representations. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 879–895, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.72. URL https://aclanthology.org/2021.acl-long.72.
Kim et al. [2021] Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. Self-guided contrastive learning for BERT sentence representations. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2528–2540, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.197. URL https://aclanthology.org/2021.acl-long.197.
Zhang et al. [2022] Yuhao Zhang, Hongji Zhu, Yongliang Wang, Nan Xu, Xiaobo Li, and Binqiang Zhao. A contrastive framework for learning sentence representations from pairwise and triple-wise perspective in angular space. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4892–4903, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.336. URL https://aclanthology.org/2022.acl-long.336.
Liu et al. [2023] Jiduan Liu, Jiahao Liu, Qifan Wang, Jingang Wang, Wei Wu, Yunsen Xian, Dongyan Zhao, Kai Chen, and Rui Yan. Rankcse: Unsupervised sentence representations learning via learning to rank. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13785–13802. Association for Computational Linguistics, 2023. doi:10.18653/V1/2023.ACL-LONG.771. URL https://doi.org/10.18653/v1/2023.acl-long.771.
Ni et al. [2022] Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1864–1874. Association for Computational Linguistics, 2022. doi:10.18653/V1/2022.FINDINGS-ACL.146. URL https://doi.org/10.18653/v1/2022.findings-acl.146.
Cheng et al. [2023] Qinyuan Cheng, Xiaogui Yang, Tianxiang Sun, Linyang Li, and Xipeng Qiu. Improving contrastive learning of sentence embeddings from AI feedback. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 11122–11138, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-acl.707. URL https://aclanthology.org/2023.findings-acl.707.
Springer et al. [2024] Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings. CoRR, abs/2402.15449, 2024. doi:10.48550/ARXIV.2402.15449. URL https://doi.org/10.48550/arXiv.2402.15449.
Cer et al. [2017] Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity - multilingual and cross-lingual focused evaluation. CoRR, abs/1708.00055, 2017. URL http://arxiv.org/abs/1708.00055.
Marelli et al. [2014] Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. A SICK cure for the evaluation of compositional distributional semantic models. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26-31, 2014, pages 216–223. European Language Resources Association (ELRA), 2014. URL http://www.lrec-conf.org/proceedings/lrec2014/summaries/363.html.
GLM et al. [2024] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186. Association for Computational Linguistics, June 2019. doi:10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692:1–13, 2019. URL http://arxiv.org/abs/1907.11692.
Conneau and Kiela [2018] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA), 2018. URL http://www.lrec-conf.org/proceedings/lrec2018/summaries/757.html.
Agirre et al. [2012] Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 task 6: A pilot on semantic textual similarity. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret, editors, *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. URL https://aclanthology.org/S12-1051.
Agirre et al. [2013] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *SEM 2013 shared task: Semantic textual similarity. In Mona Diab, Tim Baldwin, and Marco Baroni, editors, Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 32–43, Atlanta, Georgia, USA, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/S13-1004.
Agirre et al. [2014] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. SemEval-2014 task 10: Multilingual semantic textual similarity. In Preslav Nakov and Torsten Zesch, editors, Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 81–91, Dublin, Ireland, August 2014. Association for Computational Linguistics. doi:10.3115/v1/S14-2010. URL https://aclanthology.org/S14-2010.
Agirre et al. [2015] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Preslav Nakov, Torsten Zesch, Daniel Cer, and David Jurgens, editors, Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 252–263, Denver, Colorado, June 2015. Association for Computational Linguistics. doi:10.18653/v1/S15-2045. URL https://aclanthology.org/S15-2045.
Agirre et al. [2016] Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Steven Bethard, Marine Carpuat, Daniel Cer, David Jurgens, Preslav Nakov, and Torsten Zesch, editors, Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 497–511, San Diego, California, June 2016. Association for Computational Linguistics. doi:10.18653/v1/S16-1081. URL https://aclanthology.org/S16-1081.
Lei et al. [2016] Tao Lei, Hrishikesh Joshi, Regina Barzilay, Tommi Jaakkola, Kateryna Tymoshenko, Alessandro Moschitti, and Lluís Màrquez. Semi-supervised question retrieval with gated convolutions. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1279–1289, San Diego, California, June 2016. Association for Computational Linguistics. doi:10.18653/v1/N16-1153. URL https://aclanthology.org/N16-1153.
Wu et al. [2020] Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. MIND: A large-scale dataset for news recommendation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3597–3606, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.331. URL https://aclanthology.org/2020.acl-main.331.
Cohan et al. [2020] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. SPECTER: Document-level representation learning using citation-informed transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.207. URL https://aclanthology.org/2020.acl-main.207.
Liu et al. [2018] Xueqing Liu, Chi Wang, Yue Leng, and ChengXiang Zhai. Linkso: a dataset for learning to retrieve similar question answer pairs on software development forums. In Yijun Yu, Erik M. Fredericks, and Premkumar T. Devanbu, editors, Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering, NL4SE@ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 4, 2018, pages 2–5. ACM, 2018. doi:10.1145/3283812.3283815. URL https://doi.org/10.1145/3283812.3283815.
Chuang et al. [2022b] Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Scott Yih, Yoon Kim, and James R. Glass. Diffcse: Difference-based contrastive learning for sentence embeddings. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 4207–4218. Association for Computational Linguistics, 2022b. doi:10.18653/V1/2022.NAACL-MAIN.311. URL https://doi.org/10.18653/v1/2022.naacl-main.311.
Jiang et al. [2022] Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. PromptBERT: Improving BERT sentence embeddings with prompts. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8826–8837, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.603. URL https://aclanthology.org/2022.emnlp-main.603.
Wu et al. [2022b] Qiyu Wu, Chongyang Tao, Tao Shen, Can Xu, Xiubo Geng, and Daxin Jiang. PCL: Peer-contrastive learning with diverse augmentations for unsupervised sentence embeddings. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12052–12066, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.826. URL https://aclanthology.org/2022.emnlp-main.826.
Wang et al. [2022b] Wei Wang, Liangzhu Ge, Jingqiao Zhang, and Cheng Yang. Improving contrastive learning of sentence embeddings with case-augmented positives and retrieved negatives. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2159–2165, New York, NY, USA, 2022b. Association for Computing Machinery. ISBN 9781450387323. doi:10.1145/3477495.3531823. URL https://doi.org/10.1145/3477495.3531823.
Muennighoff et al. [2023] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.eacl-main.148. URL https://aclanthology.org/2023.eacl-main.148.
Pang and Lee [2005] Bo Pang and Lillian Lee. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, page 115–124, USA, 2005. Association for Computational Linguistics. doi:10.3115/1219840.1219855. URL https://doi.org/10.3115/1219840.1219855.
Hu and Liu [2004] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Won Kim, Ron Kohavi, Johannes Gehrke, and William DuMouchel, editors, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004, pages 168–177. ACM, 2004. doi:10.1145/1014052.1014073. URL https://doi.org/10.1145/1014052.1014073.
Pang and Lee [2004] Bo Pang and Lillian Lee. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL ’04, page 271–es, USA, 2004. Association for Computational Linguistics. doi:10.3115/1218955.1218990. URL https://doi.org/10.3115/1218955.1218990.
Wiebe et al. [2005] Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Lang. Resour. Evaluation, 39(2-3):165–210, 2005. doi:10.1007/S10579-005-7880-9. URL https://doi.org/10.1007/s10579-005-7880-9.
Socher et al. [2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
Voorhees and Tice [2000] Ellen M. Voorhees and Dawn M. Tice. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, page 200–207, New York, NY, USA, 2000. Association for Computing Machinery. ISBN 1581132263. doi:10.1145/345508.345577. URL https://doi.org/10.1145/345508.345577.

Appendix

Appendix A Data Synthesis Prompts

Knowledge Extraction Prompt	Instruction: Predicts the subject categories, contained entities, and quantified information of the following text Rules: The category is an item in [ $\left\{categories\_name\right\}$ , …], quantified information refers to information contained in the text with numerical values or units, such as ‘2GB’, ‘three cups’, ‘two dogs’, etc Output format: json format data, the data format is: { cls: [], // category entities: [{text: “", type: “"}], // entities, ‘text’ must be subsequences in the Input text quantities: [{text: “", type: “", quantity: 0}] // To quantify the information, ‘text’ must be a subsequence in the Input text } Input: $\left\{x\right\}$
Rewriting Prompt 1	Instruction: You are an excellent storyteller; rewrite the input sentence in a different way. Please try to recreate the sentence using different expressions, including varied tones, synonyms, and sentence patterns, while ensuring that the new sentence has the same meaning as the original sentence. Input: $\left\{x\right\}$
Rewriting Prompt 2	Instruction: You are a great storyteller; I would be grateful if you could employ your creativity to devise an illustration of the preceding segment of the sentence. The preceding statement must not exceed $\left\{number\right\}$ words, and it follows the original text. Input: $\left\{x\right\}$
Rewriting Prompt 3	Instruction: You are a great rewriter, and I want you to generate new sentence according to the classification, entities and quantities info provided by the json. Rules: You should aware that the new text in “quantities" should be rewrite follows the “quantity" value. e.g. “text": “A man", “quantity": 5 should rewrite as “five men". Metadata: { “cls": “ $\left\{categories\_name\right\}$ ", “entities": [{ “text": “ $\left\{entity\_text\right\}$ ", “type": “ $\left\{entity\_type\right\}$ " }, …], “quantities": [{ “text": “ $\left\{entity\_text\right\}$ ", "quantity": $\left\{entity\_quantity\right\}$ }, …] } Input: $\left\{x\right\}$
Syntactic Antisense Prompt	Instruction: You are dishonest; you ought to reformulate the input sentence so that the NLI model perceives it as an opposing sample. Rules: 1. If the statement asserts negation, you should affirm; conversely, if the statement asserts affirmation, you should negate. 2. If an individual loves something, one should assert that it does not reciprocate that affection. 3. If an individual is engaged in one activity, state that they are performing a different activity. 4. If the statement is affirmative/negative, express it as negative/affirmative. Input: $\left\{x\right\}$
Entity Revision Prompt	Instruction: You are a great story teller, rewrites the input sentence, and change the entity ‘ $\left\{original\_entity\_text\right\}$ ’ to another $\left\{entity\_type\right\}$ ‘ $\left\{new\_entity\_text\right\}$ ’. Input: $\left\{x\right\}$
Quantity Revision Prompt	Instruction: You are a great story teller, rewrites the input sentence, and change the quantity $\left\{original\_quantity\_value\right\}$ of ‘ $\left\{original\_quantity\_text\right\}$ ’ to $\left\{random\_quantity\_value\right\}$ . Input: $\left\{x\right\}$

Table 5: Examples of data synthesis prompts, where

\left\{variable~{}name\right\}

refers to a varibale.

In this section, we provides the specifics of our prompts for knowledge extraction and integration, and data synthesis. The particular prompts are presented in Table 5.

Appendix B Performance on Transfer Tasks

We also evaluate our GCSE follow the same settings as SimCSE on seven transfer tasks: MR [Pang and Lee, 2005], CR [Hu and Liu, 2004], SUBJ [Pang and Lee, 2004], MPQA [Wiebe et al., 2005], SST2 [Socher et al., 2013], TREC [Voorhees and Tice, 2000], and MRPC [Voorhees and Tice, 2000]. The results are shown in Table 6, it can be observed that our GCSE achieve best performance on BERT-base, BERT-large and RoBERTa-large. In RoBERTa-base, our method also shows comparable reults, demonstrating the potential capability in downstream tasks.

Model	Method	MR	CR	SUBJ	MPQA	SST2	TREC	MRPC	Avg.
BERT-base	SimCSE $\spadesuit$	68.40	82.41	74.38	80.91	78.56	76.85	72.23	76.25
	DiffCSE $\spadesuit$	72.28	84.43	76.47	83.90	80.54	80.59	71.23	78.49
	PCL $\spadesuit$	72.84	83.81	76.52	83.06	79.32	80.01	73.38	78.42
	RankCSE $\spadesuit$	75.66	86.27	77.81	84.74	81.10	81.80	75.13	80.36
	MultiCSR (ChatGPT) $\clubsuit$	\ul82.70	88.15	94.97	90.08	86.87	\ul87.70	75.46	\ul86.56
	SynCSE (ChatGPT)*	83.34	88.80	93.88	90.39	88.96	83.60	\ul75.94	86.42
	GCSE	82.22	\ul88.43	\ul94.59	\ul90.09	\ul86.88	89.40	76.06	86.81
BERT-large	SimCSE $\spadesuit$	70.88	84.16	76.43	84.50	79.76	79.26	73.88	78.41
	PCL $\spadesuit$	74.87	86.11	78.29	85.65	80.52	81.62	73.94	80.14
	RankCSE $\spadesuit$	75.48	86.50	78.60	85.45	81.09	81.58	75.53	80.60
	SynCSE (ChatGPT)*	85.78	90.47	\ul94.77	90.41	90.50	\ul89.00	75.77	88.10
	GCSE	\ul83.97	\ul89.38	95.13	\ul90.22	\ul89.57	90.60	\ul75.71	\ul87.80
RoBERTa-base	SimCSE $\spadesuit$	70.16	81.77	73.24	81.36	80.65	80.22	68.56	76.57
	DiffCSE $\spadesuit$	70.05	83.43	75.49	82.81	82.12	82.38	71.19	78.21
	PCL $\spadesuit$	71.13	82.38	75.40	83.07	81.98	81.63	69.72	77.90
	RankCSE $\spadesuit$	73.20	85.95	77.17	84.82	82.58	83.08	71.88	79.81
	MultiCSR (ChatGPT) $\clubsuit$	\ul84.70	90.69	94.40	\ul89.38	89.42	89.62	77.01	87.89
	SynCSE (ChatGPT) $\diamondsuit$	85.47	91.44	92.53	89.67	\ul90.94	81.60	76.06	86.82
	GCSE	84.39	\ul90.81	\ul94.02	88.90	91.05	\ul89.40	\ul76.12	\ul87.81
RoBERTa-large	SimCSE $\spadesuit$	72.86	83.99	75.62	84.77	81.80	81.98	71.26	78.90
	PCL $\spadesuit$	74.08	84.36	76.42	85.49	81.76	82.79	71.51	79.49
	RankCSE $\spadesuit$	73.20	85.83	78.00	85.63	82.67	\ul84.19	73.64	80.45
	SynCSE (ChatGPT) $\diamondsuit$	87.24	92.16	\ul93.75	90.81	91.87	84.00	76.29	\ul88.02
	GCSE	\ul85.65	\ul90.78	94.16	\ul90.08	\ul90.44	92.80	\ul73.74	88.24

Table 6: Comparison of different sentence embedding models accuracy on transfer tasks. “

\spadesuit

”: results from Liu et al. [2023], “

\clubsuit

”: results from Wang et al. [2024], “

\diamondsuit

”: results from Zhang et al. [2023]. “*”: we reproduce the results with the officially released corpus from Zhang et al. [2023].

Appendix C Case Studies

Premise	Hypothesis	Gold	SimCSE	RankCSE	SynCSE	GCSE
A woman is cooking eggs.	A woman is cooking something.	3.00	4.37 (1.372)	4.23 (1.320)	\ul3.66 (0.662)	3.24 (0.236)
Two little girls are talking on the phone.	A little girl is walking down the street.	0.50	3.38 (2.881)	3.64 (3.139)	\ul1.97 (1.468)	1.85 (1.351)
A chef is preparing some food.	A chef prepared a meal.	4.00	4.27 (0.270)	4.59 (0.588)	4.56 (0.561)	\ul4.41 (0.408)
Five kittens are eating out of five dishes.	Kittens are eating food on trays.	2.75	3.81 (1.056)	3.71 (0.957)	\ul3.28 (0.535)	3.12 (0.373)
A woman is cutting some herbs.	A woman is chopping cilantro.	2.80	3.58 (0.777)	3.58 (0.967)	\ul3.11 (0.313)	2.61 (0.185)

Table 7: Case studies on model prediction similarity with gold labels in the STS-Benchmark development set, where Gold represents the label score of the sentence pair (ranging from zero to five). The similarity scores of all models are multiplied by a coefficient of five for better comparison, and the value in parentheses denotes the RMS error between the predicted score and the label. Words highlighted in blue denote the entity alteration in the sentence-pair, whereas words in yellow indicate the quantities that change inside the sentence-pair.

To further verify the improvement in our method’s awareness of entity and quantity, we selected five sample sets from the STS-Benchmark development set that explicitly contained alterations in entity or quantity within the sentence-pair, and presented the prediction cosine-similarity scores of GCSE and related methodologies with the backbone of BERT-base in Table 7. We can observe from the results that the prediction score of our model achieves the minimum root-mean-square error compared to the label in most cases, which indicates that our model have a stronger capacity to distinguish information.

Appendix D Ablation Studies of Gaussian-decayed and Few-shot Samples

Method	STS-12	STS-13	STS-14	STS-15	STS-16	STS-B	SICK-R	Avg.
SynCSE (ChatGPT)*	75.86	82.19	78.71	85.63	81.11	82.35	78.79	80.66
w sampled	75.48	85.60	78.76	84.78	80.38	82.12	76.46	80.51
w sampled & G.D.	75.71	85.24	79.09	85.15	80.82	82.68	77.54	80.89
w G.D.	75.89	85.26	79.24	85.67	80.79	82.63	78.19	81.10
w sampled & domain & G.D.	75.88	86.02	79.46	86.10	80.27	82.87	76.91	81.07

Table 8: Ablation studies of sample size and the Gaussian-decayed function by utilizing SynCSE. “*”: we reproduce the results with the officially released corpus from Zhang et al. [2023].

We employ the Gaussian-decayed function on SynCSE and sample SynCSE training data with a sample size the same to our synthetic data to evaluate the efficacy of the proposed Gaussian-decayed function and our domain-oriented selection strategy in the ablation experiment. The data sample size is 64k, and the weight of $\sigma$ in $G(\cdot)$ is assigned the same value as specified in Section 4.1. The results of various policies implemented in SynCSE are presented in Table 8. “w sampled” denotes the utilization of purely the sampled data in SynCSE, and a performance decrease can be observed when training on a reduced number of samples without extra configurations. “w sampled & G.D.” denotes the additional incorporation of $G(\cdot)$ based on “w sampled”. “w G.D.” indicates the results by training on the full dataset utilizing $G(\cdot)$ . In both configurations, the average performance outperforms the vanilla model, illustrating the module’s efficacy. “w sampled & domain & G.D.” denotes the concurrent utilization of sample data, domain data, and $G(\cdot)$ , with a sample size of 48k for the SynCSE dataset and 16k for the synthesized domain dataset. The results reveal that "w sampled & domain & G.D." attains the second-best performance, suggesting that incorporating domain data can decrease the required training samples while enhancing model efficacy.

Appendix E Unsupervised Sentence Embedding on LLM

Model	Avg.
SimCSE ${\dagger}$	76.57
Flan-T5-XL $\clubsuit$	68.76
ChatGPT $\clubsuit$	76.19
ChatGLM-3 LoRA	71.17

Table 9: Performance comparison of different LLMs on STS tasks. “†”: results from Miao et al. [2023], “

\clubsuit

”: results from Wang et al. [2024]

In this section, we utilize contrastive learning on ChatGLM-3 with a low-rank adapter (LoRA) layer, and we gather other sentence embedding results from LLMs via in-context learning (ICL) to evaluate the alignment of LLM-generated similarities with the gold labels, as presented in Table 9. The results indicate that even ChatGPT with ICL cannot surpass the performance of contrastive learning methods based on encoder models. The existing methodology of LLM for generating sentence embeddings is unsatisfactory. Although some works utilize prompts without extra training, the efficacy of sentence embedding primarily relies on the quality of prompt engineering. Furthermore, employing the contrastive learning fine-tuning method for LLMs achieves no substantial benefit in semantic textual similarity tasks, underutilizing the capabilities of LLMs. To reduce expenses, we assert that fully leveraging the capabilities of LLMs for distilling smaller models is the better option.