scaleapi / llm-engine

Scale LLM Engine public repository

Home Page:https://llm-engine.scale.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can we fine tune with data stored in local csv file?

jaslatendresse opened this issue · comments

The docs state that the training_file parameter of the FineTune.create() method should be a "publicly accessible URL to a CSV file for training".

Does that mean that we can only fine-tune a model on data hosted publicly and not locally? It kind of defeats the purpose of experimenting with data we may not want to make public just yet.

If not, then it would be great if the fine tuning jobs failed earlier because right now, I am testing it with a local CSV file and it is simply running for a long time, and it is unclear if the failure is caused by a bad input or not.

Does that mean that we can only fine-tune a model on data hosted publicly and not locally?

From the docs:

Support for privately sharing data with the LLM Engine API is coming shortly

I think we need to wait a bit. The projects looks super interesting and once you enable private URLs I'm happy to play with it as well! Perhaps it would be possible to contribute with code to help out with implementation of local training_file?

commented

Hello, thanks for reaching out! We built out a Files API (see https://llm-engine.scale.com/api/python_client/#llmengine.File) so that you can upload private files. However, we still need to update the fine-tuning scripts to read from the Files API in order for this workflow to work e2e. (Separately, we're also working on open sourcing the fine-tuning scripts themselves!)

If not, then it would be great if the fine tuning jobs failed earlier because right now, I am testing it with a local CSV file and it is simply running for a long time, and it is unclear if the failure is caused by a bad input or not.

Also working on this - see #213. Though with our current in-progress implementation, it won't fail immediately per se, but it will fail before running any fine-tuning code. It'll enter the script, but the script will immediately validate the headers.

commented

Also @jaslatendresse @jmaczan I'm curious, do either of you have interest in self-hosting? Wondering what your use case(s) are.

@yixu34 Hi, thanks for the insight. I aim to self-host to cut costs of training and inference and keep the training data private

@yixu34 Not right now at least. I am trying to instruction-tune llama-2-7b for a specific use case and I am just experimenting with different infrastructures. It is very difficult to find one that is compatible with M1 and without access to other GPUs. llm-engine seemed to the solution for me but seeing that I cannot fine-tune with local data it bummed me. If self-hosting allows to train on private data I would definitely consider it but since I am just experimenting, I don't plan to go with it soon.

However very happy to see that this is something you consider!

commented

Ok I think I misspoke before - fine-tuning via the Files API does work, we just need to update our docs with examples accordingly: #231 cc @saiatmakuri @squeakymouse

commented

Though I should point out #233, where we're doing some maintenance on the fine-tuning APIs this week.

@yixu34 oh gotcha! I look forward to it then. Thanks for your patience and replies :)

The docs are updated with an example. Please refer to this section in the guide: https://llm-engine.scale.com/guides/fine_tuning/#making-your-data-accessible-to-llm-engine