NTDXYG / ComFormer

code and data for paper "ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation" accepted in DSA2021

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

python train.py → ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Youngmi-Park opened this issue · comments

Hi! I get an error when i run python train.py
How can I fix this?

$ python train.py

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
 10%|███▍                               | 43979/445813 [02:17<19:16, 347.51it/s]Traceback (most recent call last):
  File "train.py", line 73, in <module>
    model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge)
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 176, in train_model
    train_dataset = self.load_and_cache_examples(train_data, verbose=verbose)
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 868, in load_and_cache_examples
    dataset = SimpleSummarizationDataset(encoder_tokenizer, self.args, data, mode)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in __init__
    preprocess_fn(d) for d in tqdm(data, disable=args.silent)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in <listcomp>
    preprocess_fn(d) for d in tqdm(data, disable=args.silent)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 333, in preprocess_data_bart
    truncation=True,
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2651, in batch_encode_plus
    **kwargs,
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 731, in _batch_encode_plus
    first_ids = get_input_ids(ids)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 712, in get_input_ids
    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Thanks!

maybe you need modify the code in train.py:

train_df = pd.read_csv('data/train.csv')
eval_df = pd.read_csv('data/valid.csv')
test_df = pd.read_csv('data/test.csv')

to

train_df = pd.read_csv('data/train.csv').dropna()
eval_df = pd.read_csv('data/valid.csv').dropna()
test_df = pd.read_csv('data/test.csv').dropna()

I download the dataset and find that there is one nan in train.csv, code is followed:

import pandas as pd

df = pd.read_csv("train.csv")

df = pd.read_csv("train.csv").dropna()
input_text, target_text = df['input_text'].tolist(), df['target_text'].tolist()

for i, text in enumerate(input_text):
if(isinstance(text, str)==False):
print(text)

I suggest you directly fine-tune my pre-trained model, which will significantly reduce your training time. If an OOM is reported, you can freeze some of the parameters of the model by adding the following code at line 116 in bart_model.py.

unfreeze_layers = ['layers.0', 'layers.1', 'layers.2', 'layers.3', 'layers.4', 'layers.5', 'layers.6',
                           'layers.7', 'layers.8']

for name, param in self.model.named_parameters():
    for ele in unfreeze_layers:
        if ele in name:
            param.requires_grad = False

Thanks for your help!
It can create features but another error occurs😥

$ python train.py

INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
100%|██████████████████████████████████| 445782/445782 [20:00<00:00, 371.47it/s]
/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
INFO:bart_model: Training started
Epoch 1 of 30:   0%|                                     | 0/30 [00:00<?, ?it/sINFO:bart_model:Saving model into result/checkpoint-200082 [14:53<49:54:35,  2.47
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
100%|████████████████████████████████████| 19999/19999 [00:49<00:00, 401.96it/s]
Epochs 0/30. Running Loss:    8.9371:   0%| | 1999/445782 [19:07<70:46:25,  1.74
Epoch 1 of 30:   0%|                                     | 0/30 [19:07<?, ?it/s]
Traceback (most recent call last):
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'input_text_a'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 77, in <module>
    model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge)
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 186, in train_model
    **kwargs,
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 493, in train
    **kwargs,
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 697, in eval_model
    to_predict_a = eval_data["input_text_a"].tolist()
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'input_text_a'

It's fixed, just re-clone bart_model.py.

model_args in train.py you need modify first...
I forget to say this tips in readme...

It works now! Thanks :)