Futuremine97 / Llama2-finetuning

LLM finetuning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Llama2-finetuning

LLM finetuning

Hello World.

I'd be happy to intoduce the first LLM(Large Language Model) AI fine-tuning project in Wiznet by myself.

Hi ! I'm Sunghoon Park.

As an AI prompt engineer,

I have fine-tuned Llama2-7b-chat model with a talk dataset.

But the model is limited by colab pro plus resources. So don't worry about "Skynet" in "Terminator".

I completed this project by myself and handed it over to Benjamin and Simon.

I followed this repository:

https://github.com/Lightning-AI/lit-gpt

mqa Why Llama2 ?

This model is commercially available

********** note that **********

For security reasons, checkpoint files are not shared. dataset is too

but I will advice about how to fine-tuning Llama2 pretrained model.


RTX4070 setup information is here :

Driver version is 535.98 Cuda version is 12.2

It is too big file to inference on local GPU, RTX4070.

In this reason, I quantized the checkpoint file with bnb.nf4 quantization method

and inference on my local GPU, RTX4070 which have 12GB VRAM.(finetuned with LoRA Tuning Strategy)

bitsandbytes = bnb, This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices.

Why does the LoRA model store the same number of weights and still have a small file size?

This is because LoRA decomposes the matrix into two low-rank matrices.

You should manually edit lighting library file referring lightning Github repository if you want to set up model in local computer.

It is important.

For Question-Answering task , Meta launched "7b-chat-hf" pretrained model.

So, I used this model to fine-tuning.

Probably we can develop multi-modal model by our own ability utilizing mingpt repository.

GitHub - X-PLUG/mPLUG-Owl: mPLUG-Owl๐Ÿฆ‰: Modularization Empowers Large Language Models with Multimodality

Maybe this repository help you.

Let's start

Chatbot dataset should have List(dict) type which have three keys.

The keys are input, instruction, output.

You can put void(nothing) data in instruction key's data. <- this is important.

You should manually edit pre-processing script for own dataset even after request to ChatGPT.

Pre-processing work require so many time, but Create a new Gist (github.com) this can help you make http website link which contain json file.

I trained Llama-7b-chat-hf with batch size one , I used 25 GB~26GB VRAM on Colab pro plus with A100 GPU.

Probably it takes total 30~40 computing unit on Colab pro plus.

The training requires 12~13 computing unit per hour.

I installed thonny, I set my W5100S-EVB-PICO and my local GPU same IP address.

Setting process require Wiznet's own Github repos about firmware.

It was very easily install. Just copy and paste the firmware files.

Plus, my environment setting is Ubuntu 20.04 LTS, CUDA 12.2 driver and conda virtual environment.

Many difficult things happened to implement LLM on local GPU!

Parameters are reduced by approximately 1/3 by fine-tuning.

When I type my own inference code on my server(W5100s), it is well transported to my local GPU.

Below pictures about fine-tuning process on Colab pro plus.

I tried batch size to 2, but it is limited by one A100 GPU RAM even this code:

======================================================================

import torch torch.cuda.empty_cache()

import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1000000" !export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32

======================================================================

Probably you can encounter an error about "lit_model.pth", you should re-install llama2 base model. Check this link and follow.

lit-gpt/tutorials/download_llama_2.md at main ยท Lightning-AI/lit-gpt (github.com)

Below pictures are the inference outputs.

Many hallucination error is occur. So prompt engineer is important to make a reasonable generative model.

The hallucination phenomenon was observed frequently, especially when the link datas were trained.

my GPT made FIFO modules in Verilog Language. Certainly, it is not working. but structure is well defined. Isn't it? please leave your comment below.

code ckpt is not fine-tuned file. because of security reasons.

At second fine-tuning, I added a FIFO verilog code from "BARD"

Above picture is about hallucination error about unexist link.

To make it blink, you can use the following code:

#include <W5500.h>

#define SERIAL 2

int main(void) { init_comm(); // Initialize communication wiznet_serialize(); // Serialize data for security

while (1) { loop_main(); // Loop for main function } return; }

void loop_main() { byte_data[SERIAL]; // Allocate memory for serial data wiznet_Serial_Send(SERIAL, byte_data, length); // Send data to serial device }

By defining SERIAL as 2, this code will send data to serial device on PC serial port.

I hope this helps. If you have any further questions, do not hesitate to contact us. "

=============================================================================

I selected the best answer at the prompt.

In my opinion, we need to innovate the AI โ€‹โ€‹learning process at our company right now, and what we need to develop right now is RLHF.

You can get better output by maximizing its token size. (example : 512)

AND, You can adopt RLHF(Reinforcement Learning from Human Feedback) technique, which require score data written by experts.

Chat-GPT used this training method too.

=============================================================================

Below picture is describing that I use Hercules tools for communication between server and local computer.

If you can send a signal, it is an expandable system because you can use various IOs to make the generative model work.

Server(RTX4070 Desktop) can receive data from client(W5100S-EVB-PICO)'s message by push button.

Server handle this data via "subprocess" library to activate quantized model inference.

let's see this picture. You know, making FIFO module is not working well yet.

I used "subprocess" library to send command.

I send a command and server generated FIFO module in Verilog.

I recommend you to edit generate/lora.py file address to absolute path. otherwise an error will occur.

Now we can add some IO hardware for sending a command.

I set this thing.

Thonny--> Run --> configure interpreter --> Interpreter --> Python executable : ~/anaconda3/envs/{your_virtual_env_name}/bin/python3.11

Here is sever.py file

import socket import subprocess import time

์„œ๋ฒ„ ์„ค์ •

HOST = '192.168.0.2' # ๋ชจ๋“  ๋„คํŠธ์›Œํฌ ์ธํ„ฐํŽ˜์ด์Šค์—์„œ ๋“ค์–ด์˜ค๋Š” ์—ฐ๊ฒฐ์„ ํ—ˆ์šฉ PORT = 5000 # ์‚ฌ์šฉํ•  ํฌํŠธ ๋ฒˆํ˜ธ

์†Œ์ผ“ ์ƒ์„ฑ ๋ฐ ๋ฐ”์ธ๋”ฉ

server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server_socket.bind((HOST, PORT)) server_socket.listen(1) # ์ตœ๋Œ€ 1๊ฐœ์˜ ์—ฐ๊ฒฐ์„ ํ—ˆ์šฉ conda_env_name = "quant" activate_command = f"conda activate {conda_env_name}"

print(f"์„œ๋ฒ„๊ฐ€ {PORT} ํฌํŠธ์—์„œ ๋Œ€๊ธฐ ์ค‘")

#ํด๋ผ์ด์–ธํŠธ ์—ฐ๊ฒฐ ๋Œ€๊ธฐ client_socket, client_addr = server_socket.accept() print(f"{client_addr}์—์„œ ์—ฐ๊ฒฐ๋จ")

while True: data = client_socket.recv(1024) # ํด๋ผ์ด์–ธํŠธ๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ ์ˆ˜์‹  if not data: #break continue # ์—ฐ๊ฒฐ์ด ๋Š์–ด์ง€๋ฉด ๋ฃจํ”„ ์ข…๋ฃŒ command = data.decode().strip() # ์ˆ˜์‹ ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋””์ฝ”๋”ฉํ•˜๊ณ  ๊ณต๋ฐฑ ์ œ๊ฑฐ print(f"received command: {command}")

subprocess.run(activate_command, shell =True)

# ๋ช…๋ น์„ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
try:
    result = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    output = result.stdout + result.stderr  # ํ‘œ์ค€ ์ถœ๋ ฅ๊ณผ ์—๋Ÿฌ ์ถœ๋ ฅ์„ ํ•ฉ์นจ
except Exception as e:
    output = str(e).encode()  # ์—๋Ÿฌ ๋ฉ”์‹œ์ง€๋ฅผ ๋ฐ”์ดํŠธ๋กœ ์ธ์ฝ”๋”ฉ
#print(result.stdout.decode())
# ๊ฒฐ๊ณผ๋ฅผ ํด๋ผ์ด์–ธํŠธ๋กœ ๋ฐ์ดํ„ฐ ์†ก์‹ 
client_socket.sendall(output.encode())  # ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”์ดํŠธ๋กœ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์ „์†ก
print(f"received = {data}")

For model serving, you need a cloud service like Amazon SageMaker.

My goal is making an API by our own fine-tuned model.

I'm looking foward to using various protocol not only SPI protocol.

Let's imagine when we call "JARVIS" by our voice remotely.

Actually, "JARVIS" AI assistant activate command to launch particles for IRONMAN by Stark's voice in "IRONMAN3"

This is why "subprocess" library is important.

I'm expect that muti-modal AI model can be implemented in ASIC by communication each other processing units.

Regards, Sunghoon Park

About

LLM finetuning