python train.py → ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Youngmi-Park opened this issue · comments
Hi! I get an error when i run python train.py
How can I fix this?
$ python train.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
10%|███▍ | 43979/445813 [02:17<19:16, 347.51it/s]Traceback (most recent call last):
File "train.py", line 73, in <module>
model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge)
File "/home/gpuadmin/home/ComFormer/bart_model.py", line 176, in train_model
train_dataset = self.load_and_cache_examples(train_data, verbose=verbose)
File "/home/gpuadmin/home/ComFormer/bart_model.py", line 868, in load_and_cache_examples
dataset = SimpleSummarizationDataset(encoder_tokenizer, self.args, data, mode)
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in __init__
preprocess_fn(d) for d in tqdm(data, disable=args.silent)
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in <listcomp>
preprocess_fn(d) for d in tqdm(data, disable=args.silent)
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 333, in preprocess_data_bart
truncation=True,
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2651, in batch_encode_plus
**kwargs,
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 731, in _batch_encode_plus
first_ids = get_input_ids(ids)
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 712, in get_input_ids
"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Thanks!
maybe you need modify the code in train.py:
train_df = pd.read_csv('data/train.csv')
eval_df = pd.read_csv('data/valid.csv')
test_df = pd.read_csv('data/test.csv')
to
train_df = pd.read_csv('data/train.csv').dropna()
eval_df = pd.read_csv('data/valid.csv').dropna()
test_df = pd.read_csv('data/test.csv').dropna()
I download the dataset and find that there is one nan in train.csv, code is followed:
import pandas as pd
df = pd.read_csv("train.csv")
df = pd.read_csv("train.csv").dropna()
input_text, target_text = df['input_text'].tolist(), df['target_text'].tolist()
for i, text in enumerate(input_text):
if(isinstance(text, str)==False):
print(text)
I suggest you directly fine-tune my pre-trained model, which will significantly reduce your training time. If an OOM is reported, you can freeze some of the parameters of the model by adding the following code at line 116 in bart_model.py.
unfreeze_layers = ['layers.0', 'layers.1', 'layers.2', 'layers.3', 'layers.4', 'layers.5', 'layers.6',
'layers.7', 'layers.8']
for name, param in self.model.named_parameters():
for ele in unfreeze_layers:
if ele in name:
param.requires_grad = False
Thanks for your help!
It can create features but another error occurs😥
$ python train.py
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
100%|██████████████████████████████████| 445782/445782 [20:00<00:00, 371.47it/s]
/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
INFO:bart_model: Training started
Epoch 1 of 30: 0%| | 0/30 [00:00<?, ?it/sINFO:bart_model:Saving model into result/checkpoint-200082 [14:53<49:54:35, 2.47
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
100%|████████████████████████████████████| 19999/19999 [00:49<00:00, 401.96it/s]
Epochs 0/30. Running Loss: 8.9371: 0%| | 1999/445782 [19:07<70:46:25, 1.74
Epoch 1 of 30: 0%| | 0/30 [19:07<?, ?it/s]
Traceback (most recent call last):
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'input_text_a'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 77, in <module>
model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge)
File "/home/gpuadmin/home/ComFormer/bart_model.py", line 186, in train_model
**kwargs,
File "/home/gpuadmin/home/ComFormer/bart_model.py", line 493, in train
**kwargs,
File "/home/gpuadmin/home/ComFormer/bart_model.py", line 697, in eval_model
to_predict_a = eval_data["input_text_a"].tolist()
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'input_text_a'
It's fixed, just re-clone bart_model.py.
model_args in train.py you need modify first...
I forget to say this tips in readme...
It works now! Thanks :)