Fine-tuning with tabular data

Question

Fine-tuning with tabular data

sudokhan112 opened this issue 8 months ago · comments

Md Nazmuzzaman Khan commented 8 months ago

Could you publish code/instructions on how to fine-tune with personal data?

Shikai Qiu · Answer 1 · Tue Oct 17 2023 12:58:11 GMT+0800 (China Standard Time)

Hi there! We don't have any fine-tuning code to share since our method is zero-shot (it directly runs inference on pre-trained LLMs). To fine-tune on your own data, you can perform the usual fine-tuning with any LLM of your choice once you convert the time series into strings in our format (see https://github.com/ngruver/llmtime/blob/main/models/llmtime.py#L209).

Md Nazmuzzaman Khan · Answer 2 · Tue Oct 17 2023 22:22:13 GMT+0800 (China Standard Time)

I looked into HF SFTrainer and lit-gpt library. All of them look like instructions finetune. where the dataset in format "question: answer". Can you explain/point me to some detailed instructions, how I can finetune a model like llama with a dataset like "titanic"?

Shikai Qiu · Answer 3 · Tue Oct 17 2023 23:55:36 GMT+0800 (China Standard Time)

The process would be identical to how you fine-tune LLaMA with a language modeling objective on any text data, once you convert your time series into strings with our format. You can find how to do this conversion here (https://github.com/ngruver/llmtime/blob/main/models/llmtime.py#L202-L209). Each entry in your dataset would simply be a string representing the time series (rather than a question/answer format) and you would train the model to do next-token prediction with that string.

I can't provide more detailed instructions because our experiments don't involve fine-tuning. Therefore, you might need to try things to find out how to get the details right (e.g. hyperparameters for preprocessing, how much history to condition on, etc.).

Md Nazmuzzaman Khan · Answer 4 · Wed Oct 18 2023 02:01:13 GMT+0800 (China Standard Time)

Is this only applicable for dataset which has time series and single column? Like 'time'-'value'. What if the dataset has multiple columns for each time step? How would that affect the dataset creation/finetune process?

Shikai Qiu · Answer 5 · Wed Oct 18 2023 05:34:09 GMT+0800 (China Standard Time)

I'd say how to best handle multivariate series in this framework is an open question. We mainly explored univariate time series (single column) in the paper. For multivariate series, you could first try simply modeling each column independently, which is what we did for the informer datasets and it worked well enough. Alternatively, you can include all columns with a format like x1[0] & x2[0] & x3[0], x1[1] & x2[1] & x3[1], ... where & is some special separator token to delimit the columns. But we haven't explored this yet.