CoLLaMA: A Multilingual Instruction Dataset and Large Language Model for Code

[ English | 中文 ]

This is the repository for the CoLLaMA project, which aims to build a multilingual (Chinese and English) instruction tuning dataset and large language model for coding tasks.

Overview

Current code instruction datasets, which are essential for instruction-tuning tasks, are often disorganized, monolingual, and single-programming language focused, while covering an insufficient variety of tasks. Open-source datasets for instruction tuning in coding tasks are also scarce.

For this end, we propose this project, with the following advantages:

Multilingual Dataset: Our dataset incorporates code samples from a multitude of programming languages including Java, Python, C, C#, Go, PHP, JavaScript, and Ruby et.al. It also presents code instructions in both Chinese and English, enabling the model to learn in various programming language and spoken language contexts, and thereby enhancing its generalization ability.
Task diversity: The dataset spans a broad range of coding tasks, such as code summarization, code generation, code search, and others. It incorporates tasks with varying complexities and requirements, from beginner to advanced levels. This comprehensive approach ensures our instructions can handle different types of coding tasks and covers a broad spectrum of programming skills and knowledge.
Multi-programming paradigms: The project includes code examples from different programming paradigms, such as procedural, object-oriented, functional, and event-driven programming. This wide coverage provides the instruction-tuning model with a varied set of coding tasks to learn from and generate instructions for.
Real-world code examples: The dataset incorporates code snippets or excerpts from actual projects or forums such as StackOverflow and Github, to present more realistic and practical coding tasks. This aids the instruction-tuning model in generating instructions applicable to real-world scenarios.
Quality assurance: We are committed to providing an accurate and high-quality dataset for each coding task. For instance, the instruction dataset for code search, extracted from programming posts on Stackoverflow Q&A sites, is rigorously filtered and cleaned to ensure its usability in real Q&A applications.

The repository contains the following:

The MulCo used for fine-tuning the model
The code for fine-tuning the model
Model weight
The code for evaluation

Dataset release

data/MID_train_EN_data and data/MID_train_CN_data contains around 120k instruction-following data used for fine-tuning the CoLLaMA model. This file is a list of dictionaries, each dictionary contains the following fileds:

instruction: describes the task that the model should perform.
input: optional code or context for the task. For example, if the instruction is 'Please summarize this PHP code.', the input is the PHP code.
output: the answer to the instruction.

All data in our collection is formatted into the same templates, where each sample is as follows:

[
{"instruction":  `string`,
"input":  `string`, # (may be empty)
"output": `string`}
]

Due to the different code tasks, we choose which filed to generate with gpt-3.5-turbo or human. Unlike self-struct technology to generate data, most of the code in our data comes from the real world, whereas most instruction choices are generated by gpt-3.5-turbo. The detailed process of data processing is described in the next section.

Dataset Collection & Processing

It includes 8 datasets for 8 diversited code tasks covering the following scenarios:

code generation: According to the natural languages input by the user, the corresponding code is generated.
code summarization: It aims to generate concise and readable summaries or description of source code. It involves automatically generating human-readable explations or summaries of code snippets, functions, or entire programs.
code search
- code-to-code
  - clone detection: Given a piece of code, find another piece of code that is semantically related to it.
  - defect detection: Given a source code, the task is to clarify what the specific defect of the code is. This include common errors such as null pointer, dereferences, array out of bounds, memory leaks, etc.
  - Code Completion(line level): Complete the unfinished line given previous context.
  - code repair: It aims to automatically fix bugs in the code.
  - code translation: Code translation refers to the process of converting source code from one programming language to another. It involves transforming the syntax, structure, and semantics of the original code while preserving its functionality and behavior.
- query-to-code: Given a natural language query and mutiple code snippets, the task is to search source code that its function matches the natural languag query.

A brief summary of MulCo is given below:

Task			Source Dataset name	Num	Lang	Programming Lang
Code summarization			CodeSearchNet	10k	EN	Go,Java,JavaScript,PHP,Python,Ruby
Code summarization			CodeSearchNet	10K	CN	Go,Java,JavaScript,PHP,Python,Ruby
Code generation			CodeSearchNet	10k	EN	Go,Java,JavaScript,PHP,Python,Ruby
			codealpaca	20k	EN	C++,C,Java,JavaScript,PHP,Python,SQL etc.
			CodeGPT	20k	CN	C#,C,C++,Go,Java,JavaScript,PHP,Python,Ruby
			CodeSearchNet	5k	CN	Go,Java,JavaScript,PHP,Python,Ruby
Code Search	code-to-code	Clone Detection	BigCloneBench	10k	EN	Java
		Defect Detection	Devign	5K	EN	C
		Code Completion(line level)	CodeSearchNet	5K	EN	Go,Java,JavaScript,PHP,Python,Ruby
		Code Repair	Bug2Fix	5K	EN	Java
		Code Translation	CodeTrans	5k	EN	Java,C#
	query-to-code		CodePro	10K	EN	Python,SQL
	query-to-code		CodePro	5k	CN	Python,SQL

We mainly obtained datasets from CodeSearchNet, CodeXGLUE, codeGPT, codealpaca and CodePro, processed them to obtain the aforementioned datasets, and concentrated them into one dataset.

Finetuning

So far, considering the influence between different data tasks, we currently only use the Code generation、Code summarization、code completion、code query datasets for fine-tuning. At the same time, we added the codealpaca dataset. A total of 52K data after filtering.

The fine-tuning process is basically followed Firefly.

To reproduce a fine-tuned version of QWen, please follow the steps below.

In order to effectively finetune a QWen-7b model, we used QLora technology to train on an A100 80GB GPUs. Meanwhile, you need to adjust the training parameters according to your GPUs and dataset.

Before fine-tuning, first make sure to install all requirements using:

pip install -r requirements.txt

Below is the command to fine-tune QWen-7B using our dataset combined with QLoRA technology on an "A100 80G" GPU machine.

torchrun --nproc_per_node=1 --master_port='29502' train_qlora.py --train_args_file QWen-7b-sft-lora.json

The main fine-tuning parameters are as follows:

"train_file": "/data/MID_train_512_EN_52K.jsonl",
"num_train_epochs": 1,
"per_device_train_batch_size": 6,
"gradient_accumulation_steps": 2,
"learning_rate": 1e-5,
"max_seq_length": 512,
"logging_steps": 50,
"save_steps": 500,
"save_total_limit": 1,
"lr_scheduler_type": "constant_with_warmup",
"warmup_steps": 500,
"lora_rank": 64,
"lora_alpha": 16,
"lora_dropout": 0.05,
"gradient_checkpointing": true,
"optim": "paged_adamw_32bit",
"fp16": true,
"dataloader_num_workers": 0,
"save_strategy": "steps",
"weight_decay": 0,
"max_grad_norm": 0.3

You can replace train_file with your own dataset.

The above fine-tuning command only saves the weight and configuration file of the adapter, and needs to merge the weight of the adapter with the base model. Merge script see merge_lora.py

Evaluation (TODO)

After fine-tuning on 52K data using QLoRA technique, we evaluated this model on humaneval. The result is as follows：

Model	Dataset	Epoch	Max length	pass@1	pass@10
QWen-7b				0.2478	0.3836
	Summary+generation+completion+codealpaca(45k)	1	1024	0.2567	0.3414
	Summary+generation+completion+codealpaca (43k)	1	512	0.2658	0.3902
	Summary+generation+completion+codealpaca+query(52k)	1	512	0.2744	0.3902

We used about 26.6M tokens for instruction fine-tuning. Recently released the Code Llama models family, where all models are intialized with LLama 2 model weights and trained on 500B tokens from a code-heavy dataset. Among them, the evaluation result of Codellama-7b on human-eval is 29.98. In contrast, we only used 26.6M tokens for training, but the evaluation result reached 27.44. This is a amazing result. Since we only use part of the data at present, we will release the evaluation results of fine-tuning on all the data in the future.

Citation

Gang Hu¹ Xi Wen¹ Xin Liu¹ Jimin Huang² Qianqian Xie*³

¹School of Information Science & Engineering, Yunnan University ²ChanceFocus AMC ³School of Computer Science, Wuhan University

@misc{Hu2023CoLLaMA,
      title={CoLLaMA: A Multilingual Instruction Dataset and Large Language Model for Code}, 
      author={Gang Hu and Xi Wen and Xin Liu and Jimin Huang and Qianqian Xie},
      year={2023},
}

About

A Multilingual Instruction Dataset on Code and trained on large language models.

Apache License 2.0

Languages

Language:Python 98.7%Language:Shell 1.3%