Can we fine tune with data stored in local csv file?

Question

Can we fine tune with data stored in local csv file?

jaslatendresse opened this issue a year ago · comments

Jasmine Latendresse commented a year ago

The docs state that the training_file parameter of the FineTune.create() method should be a "publicly accessible URL to a CSV file for training".

Does that mean that we can only fine-tune a model on data hosted publicly and not locally? It kind of defeats the purpose of experimenting with data we may not want to make public just yet.

If not, then it would be great if the fine tuning jobs failed earlier because right now, I am testing it with a local CSV file and it is simply running for a long time, and it is unclear if the failure is caused by a bad input or not.

yu3zhou4 · Answer 1 · Fri Aug 25 2023 04:02:26 GMT+0800 (China Standard Time)

Does that mean that we can only fine-tune a model on data hosted publicly and not locally?

From the docs:

Support for privately sharing data with the LLM Engine API is coming shortly

I think we need to wait a bit. The projects looks super interesting and once you enable private URLs I'm happy to play with it as well! Perhaps it would be possible to contribute with code to help out with implementation of local training_file?

Yi Xu · Answer 2 · Sat Aug 26 2023 06:37:25 GMT+0800 (China Standard Time)

Hello, thanks for reaching out! We built out a Files API (see https://llm-engine.scale.com/api/python_client/#llmengine.File) so that you can upload private files. However, we still need to update the fine-tuning scripts to read from the Files API in order for this workflow to work e2e. (Separately, we're also working on open sourcing the fine-tuning scripts themselves!)

If not, then it would be great if the fine tuning jobs failed earlier because right now, I am testing it with a local CSV file and it is simply running for a long time, and it is unclear if the failure is caused by a bad input or not.

Also working on this - see #213. Though with our current in-progress implementation, it won't fail immediately per se, but it will fail before running any fine-tuning code. It'll enter the script, but the script will immediately validate the headers.

Yi Xu · Answer 3 · Sat Aug 26 2023 06:37:55 GMT+0800 (China Standard Time)

Also @jaslatendresse @jmaczan I'm curious, do either of you have interest in self-hosting? Wondering what your use case(s) are.

yu3zhou4 · Answer 4 · Sat Aug 26 2023 15:43:37 GMT+0800 (China Standard Time)

@yixu34 Hi, thanks for the insight. I aim to self-host to cut costs of training and inference and keep the training data private

Jasmine Latendresse · Answer 5 · Mon Aug 28 2023 15:44:15 GMT+0800 (China Standard Time)

@yixu34 Not right now at least. I am trying to instruction-tune llama-2-7b for a specific use case and I am just experimenting with different infrastructures. It is very difficult to find one that is compatible with M1 and without access to other GPUs. llm-engine seemed to the solution for me but seeing that I cannot fine-tune with local data it bummed me. If self-hosting allows to train on private data I would definitely consider it but since I am just experimenting, I don't plan to go with it soon.

However very happy to see that this is something you consider!

Yi Xu · Answer 6 · Tue Aug 29 2023 08:39:56 GMT+0800 (China Standard Time)

Ok I think I misspoke before - fine-tuning via the Files API does work, we just need to update our docs with examples accordingly: #231 cc @saiatmakuri @squeakymouse

Yi Xu · Answer 7 · Tue Aug 29 2023 09:44:17 GMT+0800 (China Standard Time)

Though I should point out #233, where we're doing some maintenance on the fine-tuning APIs this week.

Jasmine Latendresse · Answer 8 · Tue Aug 29 2023 14:55:57 GMT+0800 (China Standard Time)

@yixu34 oh gotcha! I look forward to it then. Thanks for your patience and replies :)

Sai Atmakuri · Answer 9 · Tue Aug 29 2023 16:10:19 GMT+0800 (China Standard Time)

The docs are updated with an example. Please refer to this section in the guide: https://llm-engine.scale.com/guides/fine_tuning/#making-your-data-accessible-to-llm-engine