Fine Tuning the Model

Question

Fine Tuning the Model

cv277 opened this issue 2 years ago · comments

I want to fine-tune ProGen2-small on my own dataset.
See this google colab notebook for an annotated version of the code and the error:
https://colab.research.google.com/drive/1_R0xgf6Kw0K88PYF7-ZOCIh9WRSmXN8C?usp=sharing

First I load the model like this:

import torch
from tokenizers import Tokenizer
from progen.progen2.models.progen.modeling_progen import ProGenForCausalLM

model = ProGenForCausalLM.from_pretrained('/content/drive/MyDrive/progen2-small', torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device)

I am using the huggingface Trainer to fine-tune the model with the DataCollatorForLanguageModeling. I load the tokenizer like this:

def create_tokenizer_custom(file):
    with open(file, 'r') as f:
        return Tokenizer.from_str(f.read())

tokenizer = create_tokenizer_custom(file='/content/progen/progen2/tokenizer.json')

And then convert it to a PreTrainedTokenizerFast as suggested by: huggingface/tokenizers#325

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer.save("my-tokenizer.json")
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="my-tokenizer.json")

During fine-tuning, the training loss becomes 0.0000. After training, I attempt to produce new samples:

with torch.no_grad():
  input_ids = torch.tensor(fast_tokenizer.encode("1GRGL")).view([1, -1]).to(device)
  tokens_batch = model.generate(input_ids, do_sample=True, temperature=0.7, max_length=50, top_p=10, num_return_sequences=1, pad_token_id=0)
  as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
  print(tokenizer.decode_batch(as_lists(tokens_batch))[0])

However, I get this error: RuntimeError: probability tensor contains either inf, nan or element < 0
Please see the above google colab notebook for the entire code.

Seaxingzhou · Answer 1 · Fri Feb 10 2023 15:33:05 GMT+0800 (China Standard Time)

top_p=10 might be <1?

cv277 · Answer 2 · Sat Feb 11 2023 12:48:08 GMT+0800 (China Standard Time)

top_p=10 might be <1?

Unfortunately I still get the error after setting top_p to a value less than one. Thank you though!

Tanuj Singh Shekhawat · Answer 3 · Thu Mar 23 2023 02:00:21 GMT+0800 (China Standard Time)

I am getting a warning and an error which are as follows:
Warning: You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Error: RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'.

Geraldene_Munsamy · Answer 4 · Mon Apr 24 2023 23:56:41 GMT+0800 (China Standard Time)

@cv277 where you able to resolve the issue?

msparsa · Answer 5 · Fri May 05 2023 01:45:53 GMT+0800 (China Standard Time)

To fix this, you should use torch_dtype=torch.float32 instead.

Xinfeng Lin · Answer 6 · Wed Jun 21 2023 14:57:05 GMT+0800 (China Standard Time)

我想知道数据集的样式是什么样的，能否提供呢

Oliver Fleetwood · Answer 7 · Fri Nov 10 2023 03:50:19 GMT+0800 (China Standard Time)

I've switched to torch_dtype=torch.float32 but am still getting this issue for progen-base and larger models, but not for progen-small when I'm calling
model = ProGenForCausalLM.from_pretrained('/content/drive/MyDrive/progen2-small', torch_dtype=torch.float32, low_cpu_mem_usage=True).to(device)

Has anyone experienced similar issues or is there somewhere else I need to change the dtype?

Geraldene_Munsamy · Answer 8 · Sun Nov 12 2023 17:23:25 GMT+0800 (China Standard Time)

@oliverfleetwood that works for me, I tried loading the progen2-large model and it loads fine - what error are you encountering?

Oliver Fleetwood · Answer 9 · Fri Nov 17 2023 17:45:01 GMT+0800 (China Standard Time)

First I only ran on cpu. After upgrading cuda and reinstalling torch, I was able to run the larger models on a GPU with the same setup.
I still get the same error as I try to run the larger models (ie all except for progen-small) on CPU