Data Augmentation Approaches for Source Code Models

If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.

We group the papers by code authorship attribution, clone detection, defect detection and repair, code summarization, code search, code completion, code translation, code question answering, problem classification, method name prediction, and type prediction.

This repository is based on our paper, Source Code Data Augmentation for Deep Learning: A Survey. You can cite it as follows:

@article{zhuo2023source,
      title={Source Code Data Augmentation for Deep Learning: A Survey}, 
      author={Terry Yue Zhuo and Zhou Yang and Zhensu Sun and Yufei Wang and Li Li and Xiaoning Du and Zhenchang Xing and David Lo},
      year={2023},
      eprint={2305.19915},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Authors: Terry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du, Zhenchang Xing, David Lo

Note: WIP. More papers will be added from our survey paper to this repo soon. Inquiries should be directed to terry.zhuo@monash.edu or by opening an issue here.

Code Authorship Attribution

Paper	Evaluation Datasets
Natural Attack for Pre-trained Models of Code (ICSE'22)	GCJ
RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation (ICSE'22)	GCJ, GitHub
Boosting Source Code Learning with Data Augmentation (ArXiv'23)	GCJ
Code Difference Guided Adversarial Example Generation for Deep Code Models ASE'23	GCJ

Clone Detection

Paper	Datasets
Contrastive Code Representation Learning (EMNLP'22)	JavaScript (paper-specific)
Data Augmentation by Program Transformation (JSS'22)	BCB
Natural Attack for Pre-trained Models of Code (ICSE'22)	BigCloneBench
Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings (ICSE'22)	POJ-104, GCJ
Heloc: Hierarchical contrastive learning of source code representation (ICPC'22)	GCJ, OJClone
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22)	BinaryCorp-3M
Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection (ArXiv'22)	POJ-104, Codeforces
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (ACL'22)	POJ-104, BigCloneBench
ReACC: A retrieval-augmented code completion framework (ACL'22)	CodeNet
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22)	POJ-104
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23)	BigCloneBench
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'22)	---
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'22)	POJ-104
Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection (ICPC'23)	CLCDSA
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23	BigCloneBench
A Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning (JoS'23)	POJ-104, BigCloneBench
CONCORD: Clone-aware Contrastive Learning for Source Code (ISSTA'23)	CodeNet (Java), POJ104
Neuro-symbolic Zero-Shot Code Cloning with Cross-Language Intermediate Representation (ArXiv'23)	CodeNet (C, COBOL)
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23)	BCB

Defect Detection and Repair

Paper	Datasets
Adversarial Examples for Models of Code (OOPSLA'20)	VarMisuse
Self-Supervised Bug Detection and Repair (NeurIPS'21)	RANDOMBUGS, PYPIBUGS
Semantic-Preserving Adversarial Code Comprehension (COLING'22)	Defects4J
Path-sensitive code embedding via contrastive learning for software vulnerability detection (ISSTA'22)	D2A, Fan, Devign
Natural Attack for Pre-trained Models of Code (ICSE'22)	Devign
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22)	SySeVR
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (ACL'22)	REVEAL, CodeXGLUE
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23)	Refactory, CodRep1
MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation (SANER'23)	Refactory, CodRep1
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23)	Devign
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23)	Devign, CodeChef
MUFIN: Improving Neural Repair Models with Back-Translation (ArXiv'23)	Defects4J (paper-specific), QuixBugs (paper-specific)
Leveraging Causal Inference for Explainable Automatic Program Repair (IJCNN'22)	Defects4J, QuixBugs, BugAID
Deepdebug: Fixing python bugs using stack traces, backtranslation, and code skeletons (ArXiv'21)	paper-specific
Break-It-Fix-It: Unsupervised Learning for Program Repair (ArXiv'21)	paper-specific, DeepFix
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23)	Devign. Bug2Fix
InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts (ArXiv'23)	InferredBugs
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair (FSE'23)	TFix, Bug2Fix, Defects4J
Too Few Bug Reports? Exploring Data Augmentation for Improved Changeset-based Bug Localization (ArXiv'23)	Locus data

Code Summarization

Paper	Datasets
Training Deep Code Comment Generation Models via Data Augmentation (Internetware'20)	TL-CodeSum
Retrieval-Based Neural Source Code Summarization (ICSE'20)	PCSD, JCSD
Generating adversarial computer programs using optimized obfuscations (ICLR'21)	Python-150K, Code2Seq Data
Contrastive code representation learning (EMNLP'21)	JavaScript (paper-specific)
A search-based testing framework for deep neural networks of source code embedding (ICST'21)	paper-specific
Retrieval-Augmented Generation for Code Summarization via Hybrid GNN (ICLR'21)	CCSD (paper-specific)
BASHEXPLAINER: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT (ICMSE'22)	BASHEXPLANER Data
Data Augmentation by Program Transformation (JSS'22)	DeepCom
Adversarial robustness of deep code comment generation (TOSEM'22)	CCSD (paper-specific)
Do Not Have Enough Data? An Easy Data Augmentation for Code Summarization (PAAP'22)	---
Semantic robustness of models of source code (SANER'22)	Python-150K, Code2Seq Data
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22)	CodeSearchNet (Python, Java)
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'23)	---
Exploring Data Augmentation for Code Generation Tasks (EACL'23)	CodeSearchNet (CodeXGLUE)
Bash Comment Generation Via Data Augmentation and Semantic-Aware Codebert (ArXiv'23)	BASHEXPLANER Data
READSUM: Retrieval-Augmented Adaptive Transformer for Source Code Summarization (Access'23)	PCSD
Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization (ArXiv'23)	PCSD, CCSD, DeepCom
Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network (OOPSLA'23)	CodeSearchNet (Python, Java)
Better Language Models of Code through Self-Improvement (ACL'23)	CodeSearchNet

Code Search

Paper	Datasets
AugmentedCode: Examining the Effects of Natural Language Resources in Code Retrieval Models (ArXiv'21)	CodeSearchNet
Cosqa: 20, 000+ web queries for code search and question answering (ACL'21)	CoSQA
A search-based testing framework for deep neural networks of source code embedding (ICST'21)	paper-specific
Semantic-Preserving Adversarial Code Comprehension (COLING'22)	CodeSearchNet
Exploring Representation-Level Augmentation for Code Search (EMNLP'22)	CodeSearchNet
Cross-Modal Contrastive Learning for Code Search (ICSME'22)	AdvTest, CoSQA
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22)	CodeSearchNet
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22)	CodeSearchNet (Python, Java)
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23)	AdvTest, WebQueryTest
CoCoSoDa: Effective Contrastive Learning for Code Search (ICSE'23)	CodeSearchNet
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (EACL'23)	WebQueryTest
A Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning (JoS'23)	CodeSearchNet
Rethinking Negative Pairs in Code Search (EMNLP'23)	CodeSearchNet
Towards Better Multilingual Code Search through Cross-Lingual Contrastive Learning (Internetware'23)	XLCoST
MCodeSearcher: Multi-View Contrastive Learning for Code Search (Internetware'23)	CodeSearchNet (Python, Java), CoSQA, StaQC, WebQuery
MulCS: Towards a Unified Deep Representation for Multilingual Code Search (SANER'23)	CodeSearchNet (Python, Java), paper-specific
Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network (OOPSLA'23)	CodeSearchNet (Python, Java)

Code Completion

Paper	Datasets
Generative Code Modeling with Graphs (ICLR'19)	ExprGen Data (paper-specific)
Adversarial Robustness of Program Synthesis Models (AIPLANS'21)	ALGOLISP
ReACC: A retrieval-augmented code completion framework (ACL'22)	PY150 (CodeXGLUE), GithHub Java (CodeXGLUE)
Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation (ASE'22)	MBPP
How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective (ArXiv'22)	refined CONCODE, refined PyTorrent
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22)	CodeSearchNet (Python, Java)
ReCode: Robustness Evaluation of Code Generation Models (ACL'23)	HumanEval, MBPP
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'23)	---
Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning (ICSE'23)	ATLAS, TFIX
RustGen: An Augmentation Approach for Generating Compilable Rust Code with Large Language Models (DeployableGenerativeAI'23)	paper-specific
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23)	GithHub Java (CodeXGLUE)
Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases (ASE'23)	paper-specific
APICom: Automatic API Completion via Prompt Learning and Adversarial Training-based Data Augmentation (Internetware'23)	paper-specific
Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation (ASE'22)	MBPP
Better Language Models of Code through Self-Improvement (ACL'23)	CONCODE

Code Translation

Paper	Datasets
Leveraging Automated Unit Tests for Unsupervised Code Translation (ICLR'23)	paper-specifc
Exploring Data Augmentation for Code Generation Tasks (EACL'23)	CodeTrans (CodeXGLUE)
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages (EACL'23)	Transcoder Data
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23)	CodeTrans (CodeXGLUE)
Code Translation with Compiler Representations (ICLR'23)	Transcoder Data
Data Augmentation for Code Translation with Comparable Corpora and Multiple References (EMNLP'23)	Transcoder Data
Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation (ArXiv'23)	AVATAR
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23)	Transcoder Data

Code Question Answering

Paper	Datasets
Cosqa: 20, 000+ web queries for code search and question answering (ACL'21)	CoSQA
Semantic-Preserving Adversarial Code Comprehension (COLING'22)	CodeQA
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (EACL'23)	CoSQA
MCodeSearcher: Multi-View Contrastive Learning for Code Search (Internetware'23)	WebQuery (paper-specific)

Code Classification

Paper	Datasets
Generating Adversarial Examples for Holding Robustness of Source Code Processing Models (AAAI'20)	OJ
Generating Adversarial Examples of Source Code Classification Models via Q-Learning-Based Markov Decision Process (QRS'21)	OJ
Heloc: Hierarchical contrastive learning of source code representation (ICPC'22)	GCJ, OJ
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22)	POJ-104 (CodeXGLUE)
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22)	POJ-104
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23)	Java250, Python800
MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation (SANER'23)	Java250, Python800
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23)	GCJ
An Enhanced Data Augmentation Approach to Support Multi-Class Code Readability Classification (SEKE'22)	paper-specific
Improving Multi-Class Code Readability Classification with An Enhanced Data Augmentation Approach (130) (International Journal of Software Engineering and Knowledge Engineering)	paper-specific

Method Name Prediction

Paper	Datasets
Adversarial Examples for Models of Code (OOPSLA'20)	Code2vec
A search-based testing framework for deep neural networks of source code embedding (ICST'21)	paper-specific
On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations (IST'21)	Code2Seq
Data Augmentation by Program Transformation (JSS'22)	Code2vec
Discrete Adversarial Attack to Models of Code (PLDI'23)	Code2vec

Type Prediction

Paper	Datasets
Adversarial Robustness for Code (ICML'21)	DeepTyper
Contrastive code representation learning (EMNLP'21)	DeepTyper
Cross-Lingual Transfer Learning for Statistical Type Inference (ISSTA'22)	DeepTyper, Typilus (Python), CodeSearchNet (Java)

Acknowledgement

We thank Steven Y. Feng, et al. for their open-source paper list on DataAug4NLP.

terryyz / DataAug4Code