Data Augmentation Approaches for Source Code Models
If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.
We group the papers by code authorship attribution , clone detection , defect detection and repair , code summarization , code search , code completion , code translation , code question answering , problem classification , method name prediction , and type prediction .
This repository is based on our paper, Source Code Data Augmentation for Deep Learning: A Survey . You can cite it as follows:
@article{zhuo2023source,
title={Source Code Data Augmentation for Deep Learning: A Survey},
author={Terry Yue Zhuo and Zhou Yang and Zhensu Sun and Yufei Wang and Li Li and Xiaoning Du and Zhenchang Xing and David Lo},
year={2023},
eprint={2305.19915},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Authors: Terry Yue Zhuo ,
Zhou Yang ,
Zhensu Sun ,
Yufei Wang ,
Li Li ,
Xiaoning Du ,
Zhenchang Xing ,
David Lo
Note: WIP. More papers will be added from our survey paper to this repo soon.
Inquiries should be directed to terry.zhuo@monash.edu or by opening an issue here.
Code Authorship Attribution
Paper
Evaluation Datasets
Natural Attack for Pre-trained Models of Code (ICSE'22 )
GCJ
RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation (ICSE'22 )
GCJ, GitHub
Boosting Source Code Learning with Data Augmentation (ArXiv'23 )
GCJ
Code Difference Guided Adversarial Example Generation for Deep Code Models ASE'23
GCJ
Paper
Datasets
Contrastive Code Representation Learning (EMNLP'22 )
JavaScript (paper-specific)
Data Augmentation by Program Transformation (JSS'22 )
BCB
Natural Attack for Pre-trained Models of Code (ICSE'22 )
BigCloneBench
Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings (ICSE'22 )
POJ-104, GCJ
Heloc: Hierarchical contrastive learning of source code representation (ICPC'22 )
GCJ, OJClone
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22 )
BinaryCorp-3M
Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection (ArXiv'22 )
POJ-104, Codeforces
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (ACL'22 )
POJ-104, BigCloneBench
ReACC: A retrieval-augmented code completion framework (ACL'22 )
CodeNet
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22 )
POJ-104
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23 )
BigCloneBench
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'22 )
---
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'22 )
POJ-104
Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection (ICPC'23 )
CLCDSA
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23
BigCloneBench
A Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning (JoS'23 )
POJ-104, BigCloneBench
CONCORD: Clone-aware Contrastive Learning for Source Code (ISSTA'23 )
CodeNet (Java), POJ104
Neuro-symbolic Zero-Shot Code Cloning with Cross-Language Intermediate Representation (ArXiv'23 )
CodeNet (C, COBOL)
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23 )
BCB
Defect Detection and Repair
Paper
Datasets
Adversarial Examples for Models of Code (OOPSLA'20 )
VarMisuse
Self-Supervised Bug Detection and Repair (NeurIPS'21 )
RANDOMBUGS, PYPIBUGS
Semantic-Preserving Adversarial Code Comprehension (COLING'22 )
Defects4J
Path-sensitive code embedding via contrastive learning for software vulnerability detection (ISSTA'22 )
D2A, Fan, Devign
Natural Attack for Pre-trained Models of Code (ICSE'22 )
Devign
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22 )
SySeVR
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (ACL'22 )
REVEAL, CodeXGLUE
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23 )
Refactory, CodRep1
MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation (SANER'23 )
Refactory, CodRep1
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23 )
Devign
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23 )
Devign, CodeChef
MUFIN: Improving Neural Repair Models with Back-Translation (ArXiv'23 )
Defects4J (paper-specific), QuixBugs (paper-specific)
Leveraging Causal Inference for Explainable Automatic Program Repair (IJCNN'22 )
Defects4J, QuixBugs, BugAID
Deepdebug: Fixing python bugs using stack traces, backtranslation, and code skeletons (ArXiv'21 )
paper-specific
Break-It-Fix-It: Unsupervised Learning for Program Repair (ArXiv'21 )
paper-specific, DeepFix
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23 )
Devign. Bug2Fix
InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts (ArXiv'23 )
InferredBugs
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair (FSE'23 )
TFix, Bug2Fix, Defects4J
Too Few Bug Reports? Exploring Data Augmentation for Improved Changeset-based Bug Localization (ArXiv'23 )
Locus data
Paper
Datasets
Training Deep Code Comment Generation Models via Data Augmentation (Internetware'20 )
TL-CodeSum
Retrieval-Based Neural Source Code Summarization (ICSE'20 )
PCSD, JCSD
Generating adversarial computer programs using optimized obfuscations (ICLR'21 )
Python-150K, Code2Seq Data
Contrastive code representation learning (EMNLP'21 )
JavaScript (paper-specific)
A search-based testing framework for deep neural networks of source code embedding (ICST'21 )
paper-specific
Retrieval-Augmented Generation for Code Summarization via Hybrid GNN (ICLR'21 )
CCSD (paper-specific)
BASHEXPLAINER: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT (ICMSE'22 )
BASHEXPLANER Data
Data Augmentation by Program Transformation (JSS'22 )
DeepCom
Adversarial robustness of deep code comment generation (TOSEM'22 )
CCSD (paper-specific)
Do Not Have Enough Data? An Easy Data Augmentation for Code Summarization (PAAP'22 )
---
Semantic robustness of models of source code (SANER'22 )
Python-150K, Code2Seq Data
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22 )
CodeSearchNet (Python, Java)
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'23 )
---
Exploring Data Augmentation for Code Generation Tasks (EACL'23 )
CodeSearchNet (CodeXGLUE)
Bash Comment Generation Via Data Augmentation and Semantic-Aware Codebert (ArXiv'23 )
BASHEXPLANER Data
READSUM: Retrieval-Augmented Adaptive Transformer for Source Code Summarization (Access'23 )
PCSD
Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization (ArXiv'23 )
PCSD, CCSD, DeepCom
Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network (OOPSLA'23 )
CodeSearchNet (Python, Java)
Better Language Models of Code through Self-Improvement (ACL'23 )
CodeSearchNet
Paper
Datasets
AugmentedCode: Examining the Effects of Natural Language Resources in Code Retrieval Models (ArXiv'21 )
CodeSearchNet
Cosqa: 20, 000+ web queries for code search and question answering (ACL'21 )
CoSQA
A search-based testing framework for deep neural networks of source code embedding (ICST'21 )
paper-specific
Semantic-Preserving Adversarial Code Comprehension (COLING'22 )
CodeSearchNet
Exploring Representation-Level Augmentation for Code Search (EMNLP'22 )
CodeSearchNet
Cross-Modal Contrastive Learning for Code Search (ICSME'22 )
AdvTest, CoSQA
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22 )
CodeSearchNet
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22 )
CodeSearchNet (Python, Java)
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23 )
AdvTest, WebQueryTest
CoCoSoDa: Effective Contrastive Learning for Code Search (ICSE'23 )
CodeSearchNet
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (EACL'23 )
WebQueryTest
A Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning (JoS'23 )
CodeSearchNet
Rethinking Negative Pairs in Code Search (EMNLP'23 )
CodeSearchNet
Towards Better Multilingual Code Search through Cross-Lingual Contrastive Learning (Internetware'23 )
XLCoST
MCodeSearcher: Multi-View Contrastive Learning for Code Search (Internetware'23 )
CodeSearchNet (Python, Java), CoSQA, StaQC, WebQuery
MulCS: Towards a Unified Deep Representation for Multilingual Code Search (SANER'23 )
CodeSearchNet (Python, Java), paper-specific
Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network (OOPSLA'23 )
CodeSearchNet (Python, Java)
Paper
Datasets
Generative Code Modeling with Graphs (ICLR'19 )
ExprGen Data (paper-specific)
Adversarial Robustness of Program Synthesis Models (AIPLANS'21 )
ALGOLISP
ReACC: A retrieval-augmented code completion framework (ACL'22 )
PY150 (CodeXGLUE), GithHub Java (CodeXGLUE)
Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation (ASE'22 )
MBPP
How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective (ArXiv'22 )
refined CONCODE, refined PyTorrent
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22 )
CodeSearchNet (Python, Java)
ReCode: Robustness Evaluation of Code Generation Models (ACL'23 )
HumanEval, MBPP
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'23 )
---
Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning (ICSE'23 )
ATLAS, TFIX
RustGen: An Augmentation Approach for Generating Compilable Rust Code with Large Language Models (DeployableGenerativeAI'23 )
paper-specific
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23 )
GithHub Java (CodeXGLUE)
Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases (ASE'23 )
paper-specific
APICom: Automatic API Completion via Prompt Learning and Adversarial Training-based Data Augmentation (Internetware'23 )
paper-specific
Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation (ASE'22 )
MBPP
Better Language Models of Code through Self-Improvement (ACL'23 )
CONCODE
Paper
Datasets
Leveraging Automated Unit Tests for Unsupervised Code Translation (ICLR'23 )
paper-specifc
Exploring Data Augmentation for Code Generation Tasks (EACL'23 )
CodeTrans (CodeXGLUE)
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages (EACL'23 )
Transcoder Data
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23 )
CodeTrans (CodeXGLUE)
Code Translation with Compiler Representations (ICLR'23 )
Transcoder Data
Data Augmentation for Code Translation with Comparable Corpora and Multiple References (EMNLP'23 )
Transcoder Data
Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation (ArXiv'23 )
AVATAR
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23 )
Transcoder Data
Paper
Datasets
Cosqa: 20, 000+ web queries for code search and question answering (ACL'21 )
CoSQA
Semantic-Preserving Adversarial Code Comprehension (COLING'22 )
CodeQA
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (EACL'23 )
CoSQA
MCodeSearcher: Multi-View Contrastive Learning for Code Search (Internetware'23 )
WebQuery (paper-specific)
Paper
Datasets
Generating Adversarial Examples for Holding Robustness of Source Code Processing Models (AAAI'20 )
OJ
Generating Adversarial Examples of Source Code Classification Models via Q-Learning-Based Markov Decision Process (QRS'21 )
OJ
Heloc: Hierarchical contrastive learning of source code representation (ICPC'22 )
GCJ, OJ
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22 )
POJ-104 (CodeXGLUE)
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22 )
POJ-104
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23 )
Java250, Python800
MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation (SANER'23 )
Java250, Python800
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23 )
GCJ
An Enhanced Data Augmentation Approach to Support Multi-Class Code Readability Classification (SEKE'22 )
paper-specific
Improving Multi-Class Code Readability Classification with An Enhanced Data Augmentation Approach (130) (International Journal of Software Engineering and Knowledge Engineering )
paper-specific
Paper
Datasets
Adversarial Examples for Models of Code (OOPSLA'20 )
Code2vec
A search-based testing framework for deep neural networks of source code embedding (ICST'21 )
paper-specific
On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations (IST'21 )
Code2Seq
Data Augmentation by Program Transformation (JSS'22 )
Code2vec
Discrete Adversarial Attack to Models of Code (PLDI'23 )
Code2vec
Paper
Datasets
Adversarial Robustness for Code (ICML'21 )
DeepTyper
Contrastive code representation learning (EMNLP'21 )
DeepTyper
Cross-Lingual Transfer Learning for Statistical Type Inference (ISSTA'22 )
DeepTyper, Typilus (Python), CodeSearchNet (Java)
We thank Steven Y. Feng, et al. for their open-source paper list on DataAug4NLP .