Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Xie, Yiqing; Naik, Atharva; Fried, Daniel; Rose, Carolyn

Computer Science > Computation and Language

arXiv:2311.00317 (cs)

[Submitted on 1 Nov 2023 (v1), last revised 4 Oct 2024 (this version, v2)]

Title:Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Authors:Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose

View PDF HTML (experimental)

Abstract:One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at this https URL.

Comments:	EMNLP 2023 Findings (with minor updates on the flowcharts)
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2311.00317 [cs.CL]
	(or arXiv:2311.00317v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.00317

Submission history

From: Yiqing Xie [view email]
[v1] Wed, 1 Nov 2023 06:01:22 UTC (761 KB)
[v2] Fri, 4 Oct 2024 04:16:21 UTC (851 KB)

Computer Science > Computation and Language

Title:Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators