ngruver / llmtime

Home Page:https://arxiv.org/abs/2310.07820

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fine-tuning with tabular data

sudokhan112 opened this issue · comments

Could you publish code/instructions on how to fine-tune with personal data?

Hi there! We don't have any fine-tuning code to share since our method is zero-shot (it directly runs inference on pre-trained LLMs). To fine-tune on your own data, you can perform the usual fine-tuning with any LLM of your choice once you convert the time series into strings in our format (see https://github.com/ngruver/llmtime/blob/main/models/llmtime.py#L209).

I looked into HF SFTrainer and lit-gpt library. All of them look like instructions finetune. where the dataset in format "question: answer". Can you explain/point me to some detailed instructions, how I can finetune a model like llama with a dataset like "titanic"?

The process would be identical to how you fine-tune LLaMA with a language modeling objective on any text data, once you convert your time series into strings with our format. You can find how to do this conversion here (https://github.com/ngruver/llmtime/blob/main/models/llmtime.py#L202-L209). Each entry in your dataset would simply be a string representing the time series (rather than a question/answer format) and you would train the model to do next-token prediction with that string.

I can't provide more detailed instructions because our experiments don't involve fine-tuning. Therefore, you might need to try things to find out how to get the details right (e.g. hyperparameters for preprocessing, how much history to condition on, etc.).

Is this only applicable for dataset which has time series and single column? Like 'time'-'value'. What if the dataset has multiple columns for each time step? How would that affect the dataset creation/finetune process?

I'd say how to best handle multivariate series in this framework is an open question. We mainly explored univariate time series (single column) in the paper. For multivariate series, you could first try simply modeling each column independently, which is what we did for the informer datasets and it worked well enough. Alternatively, you can include all columns with a format like x1[0] & x2[0] & x3[0], x1[1] & x2[1] & x3[1], ... where & is some special separator token to delimit the columns. But we haven't explored this yet.