LFX Mentorship (Jun-Aug, 2024): Finetune LLM models for Rust coding assistance

Question

LFX Mentorship (Jun-Aug, 2024): Finetune LLM models for Rust coding assistance

juntao opened this issue 2 months ago · comments

Summary

WasmEdge is a lightweight inference runtime for AI and LLM applications. We want to build specialized and finetuned models for WasmEdge community. The model should be supported by WasmEdge and its applications should benefit the WasmEdge community.

In this project, we will build and compare two finetuned model for Rust coding assistance.

A code review model. It aims to be a new backend for the PR review bot we currently use in the community.
A QA model. It should be able to answer user questions about the Rust language and provide explanations. Our goal is to provide an alternative to our Learn Rust app.

Details

Objective 1: Code review model

Create a dataset with the following two fields

We are looking for at least 200 Q and A pairs. The total length of each QA should be less than 3000 words.

Q: a code segment
A: explanation / review of the code

The QA could come from Rust documentation such as Rust by Example and The Rust Programming Language.

Assemble the dataset into the llama3 chat template

It is similar to the following. Each entry should be all in one line with linebreaks denoted as \n.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a reviewer of Rust source code.<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ a code segment }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ explanation / review of the code }}<|eot_id|>

Finetune

we will finetune based on the llama3-8b-instruct model.

You are free to use any finetune tools. But if you are unsure, we recommend using llama.cpp's finetune utility. See an example. It can run on CPUs. We will provide the computing resources required for the finetuing.

Objective 2: Code QA model

Create a dataset with the following three fields

We are looking for at least 100 chapter + Q + A rows.

C: A chapter from a Rust book
Q: A question related to the chapter
A: Explanation / answer for the question

You could use ChatGPT to generate these questions and answers based on the chapter content.

Assemble the dataset into the llama3 chat template

It is similar to the following. Each entry should be all in one line with linebreaks denoted as \n.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert of the Rust language. Please answer question based on the context below.\n---------\n{{ book chapter }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ a code segment }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ explanation / review of the code }}<|eot_id|>

Finetune

Due to the chapter-long context length used in this dataset, we will finetune based on the 262k long context length llama3-8b-instruct model.

You are free to use any finetune tools. But if you are unsure, we recommend using llama.cpp's finetune utility. See an example. It can run on CPUs. We will provide the computing resources required for the finetuing.

Objective 3: Compare the two finetuned models

Start the finetuned models using the LlamaEdge API server, and test them on commonly used scenarios.

LFX

Expected outcome: Two finetuned models based on Llama3-8b for Rust code review and QA.

Recommended skills:

Rust language
ChatGPT and Claude
LlamaEdge
llama.cpp

Mentor:

Michael Yuan @juntao michael@secondstate.io

Appendix

Dhruv Singh · Answer 1 · Mon Apr 29 2024 04:01:55 GMT+0800 (China Standard Time)

Hi @juntao!
I am Dhruv , Currently pursuing my undergrad in CS from IIT Mandi.
I am deeply interested in this mentorship project and I have some experience in creating dataset for finetuning llms and finetuning llms. Recently I did a project where i created medical Conversation dataset using ddxplus dataset using GPT-3.5 for further finetuning a medllama2(llama2 finetuned on medQA dataset) on a single RTX-3090 using QLora and further quantization using llama.cpp.

Could you please let me know if there's a pretest or any other steps I should take to participate?

QuantiPhy · Answer 2 · Sun May 12 2024 23:02:48 GMT+0800 (China Standard Time)

@juntao
I am Debrup pursuing Engg at BITS Pilani, and a contributor at Keras, I like to solve NLP problems and have worked extensively with hugginface(transformers, PEFT, datasets), Langchain, Llama index(data injection and Indexing), Knowledge graphs(Neo4J), RAGs, Rust.
I was also selected for GSoC in a similar project building coding assistant using Knowledge Graphs, which assist with QA and Summarization tasks of any Github Repo,
I find this project interesting and similar, and I want to contribute, and take part in the LFX-Mentorship program

QuantiPhy · Answer 3 · Sun May 12 2024 23:06:24 GMT+0800 (China Standard Time)

@juntao Is there any community(Discord, slack..) where we contributors can interact? What are the timelines and selection criteria for this project I am really excited about the project?
If possible can share a cover letter for the previous term I wanted to get an idea of length,content.