Loosing Input Data in Classification Task

Question

Loosing Input Data in Classification Task

FM29 opened this issue 3 years ago · comments

I am using FARM for a document classification task in medical background.
The deepset German Bert model was aditionally trained on german medical wikipedia and then trained for the task we have on hand: Given a long input sting (OCR read content of a scanned file), predict the document class of that file.

I mostly held onto this https://colab.research.google.com/drive/130_7dgVC3VdLBPhiEkGULHmqSlflhmVM#scrollTo=tPltDefXjSiJ tutorial and it worked fine in the first tests.

As we are now training on the final Data, the processor (?) started to simply loose data even before the training. The problem happens in the Inferencer too.

In a pre-cleaning step, the train.tsv and test.tsv get produced like this:

#get mukl data and convert to dataset
df_all_data = pd.read_csv('dataset/mukl.tsv', delimiter="\t", encoding='latin-1', names=['sentence', 'label'])
df_mukl_bert = pd.DataFrame({'text': df_all_data['sentence'], 'label': df_all_data['label']})

# convert given labels to final labels
new_labels = convert_labels_mukl(df_mukl_bert['label'])
df_mukl_bert['label'] = new_labels

# produce train and test DataFrames
df_mukl_train, df_mukl_test = train_test_split(df_mukl_bert, test_size=0.1)
df_mukl_test = pd.DataFrame({'text': df_mukl_test['text'],
    'label': df_mukl_test['label']}) 

# write DataFrames to files
df_mukl_train.to_csv('./dataset/train.tsv', sep='\t', index=False)
df_mukl_test.to_csv('./dataset/test.tsv', sep='\t', index=False)

After this step, i have my BERT Model ready to train and set up the basics:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = Tokenizer.load(
    pretrained_model_name_or_path= Path("./saved_models/german-bert-pretrain-med"), do_lower_case=False)

labels = ["2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        data_dir="./dataset" ,
                                        train_filename="train.tsv",
                                        label_list=labels,
                                        metric="acc",
                                        label_column_name="label" )

BATCH_SIZE = 32
data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

The DataSilo does its work without any Errors, but at latest here, some of the input data gets lost.
My train.tsv contains 55705 entries and my test.tsv contains 6190 entries.

Expected Behaviour:
The DataSilo logs:
Loading train set from: dataset/train.tsv
Got ya 15 parallel workers to convert 55705 dictionaries to pytorch datasets (chunksize = 741)
....Preprocessing...
Examples in train + Examples in dev = 550705
Examples in test : 6190

Actual Behaviour:
The DataSilo logs:
Loading train set from: dataset/train.tsv
Got ya 15 parallel workers to convert 55518 dictionaries to pytorch datasets (chunksize = 741)
....Preprocessing...
Examples in train: 49467
Examples in dev: 5871
Examples in test : 6171

When using the exact same code with another train/test pair from another file generated from a csv, the behaviour is as expected. So i felt the reason must lie in the data. Since even the Inferencer is loosing the same data, i could identify the lost texts and check them.
Some of them have special characters like !, ?, @, *, _ and some start with ' but no character is in all of those and all of those characters appeared in readable texts before.

So now i am our of ideas on how to fiy this. Any ideas?

Timo Moeller · Answer 1 · Wed Jan 27 2021 20:24:42 GMT+0800 (China Standard Time)

This might be related to the new tokenizer that we use. @brandenchan could you please have a look into it, I can see that our problemtic_ids are not populated yet for the TextClassificationProcessor, see here.

Branden Chan · Answer 2 · Mon Feb 01 2021 23:41:49 GMT+0800 (China Standard Time)

Hi @FM29, I can start looking at this issue for you. Is there any chance that the sample labels could be wrongly generated? Otherwise, it would be easiest for me to help debug if you are able to send me some of the samples that get lost in processing.

FM29 · Answer 3 · Tue Feb 02 2021 14:53:00 GMT+0800 (China Standard Time)

Hi Branden, thank you for taking the time.
I double checked the labels in the test file with this code:

df_test_data = pd.read_csv('dataset/test.tsv', delimiter="\t", encoding='latin-1')
print(np.unique(df_test_data['label']))

Output: [ 2  3  4  5  6  8  9 10 11 12 13 15 16 17 18 19 20 21]

As you can see in the above post, those are exactly the labels we gave the processor to work with, so it seems like the labels are set correctly. The same holds up for the train.tsv. Attached you can find a zip of the test.tsv, train was too big to attach. Feel free to do any checkup you want to do.
test.zip

To identify the ones lost during Inferencing, i used this code:

save_dir = Path("./saved_models/GermanBertInferMitMukl")
infer_model = Inferencer.load(save_dir, task_type="text_classification", gpu=True, return_class_probs = True)

result = infer_model.inference_from_file(file="dataset/test.tsv")

contexts_used = []
for i in range(len(result)):
    for j in range(len(result[i]['predictions'])):
        label = result[i]['predictions'][j]['context']
        contexts_used.append(label)
        
print("Texts: that arrived at the Inferencer: ", len(contexts_used))
        
actual_df = pd.read_csv("./dataset/test.tsv", delimiter='\t')
actual_contexts = actual_df['text']
removed_contexts = list(actual_df['text'])

print("Actual length of test.tsv: ", len(actual_contexts))

for i in range(len(actual_contexts)):
    if actual_contexts[i] in contexts_used:
        removed_contexts.remove(actual_contexts[i])
        
print("Number of texts that weren't identified: ", len(removed_contexts))

which outputs:

Texts that arrived at the Inferencer:  6168
Actual length of test.tsv:  6190
Number of texts that weren't identified:  25

I don't know if it was my way of counting, but it identifies 25 that were in the test.tsv but weren't in the result contexts. Since only 22 lines are missing in the results, i would assume that the other 3 got changed along the way? I put the identified removed/changed ones into this file:
removed_contexts.txt
Note that the " at the beginning and end of each text is not part of the actual text, everything else is though.

I hope this can help you to fix the problem. If you prefer, we can Skype too.

Branden Chan · Answer 4 · Wed Feb 03 2021 21:39:10 GMT+0800 (China Standard Time)

Ok so I put the texts in removed_contexts.txt through a different classification model and I still got 25 predictions returned. Could you try running inference again using the model called deepset/bert-base-german-cased-hatespeech-GermEval18Coarse and see if those texts still get removed / changed?

FM29 · Answer 5 · Wed Feb 03 2021 22:01:33 GMT+0800 (China Standard Time)

I'm sorry, i don't know how to do that. I am a beginner with Bert, and changing anything in the code leads to error mesages anywhere. So you say it could be because of my classification model?
I doubt that, since using the exact same classification model and the exact same Inferencer, not a single text is lost with another test set.
test2.zip
Should i inference all of my 65000 training samples to identify lost ones, remove them and then retry?

Branden Chan · Answer 6 · Thu Feb 04 2021 18:07:13 GMT+0800 (China Standard Time)

So just one suspicion is that the model you have maybe interacts with your dataset in some way causing you to lose this 25 examples. I say this because when I ran inference on these 25 examples using another model, there were no problems and all samples were retained.

To run inference on the test set using the same model as me, use these lines

infer_model = Inferencer.load(save_dir, task_type="text_classification", gpu=True, return_class_probs = True)
result = infer_model.inference_from_file(file="dataset/test.tsv")

I think it would also be good to see what happens if you run the inference just on the text in removed_contexts.txt using both models.

The number of texts that arrived at inference vs the length of test.tsv is also a little odd. Do you know if there are any duplicates in test.tsv?

The issue could also still be in the new preprocessing steps in FARM. But I think I would need to understand your problem a little more and also be able to replicate the issue before I am able to find a fix!

FM29 · Answer 7 · Mon Feb 08 2021 15:37:06 GMT+0800 (China Standard Time)

Your code for inference does not differ from the one i already have. What save_dir should i use for inferencing with your model?

The number of texts that arrived at inference vs the length of test.tsv is also a little odd. Do you know if there are any duplicates in test.tsv?

This is the exact reason why i submitted the ticket in the first place. I checked for duplicates and there where some, i removed them but the length of test.tsv and the ones arriving at the Inferencer still differs by 21. My only goal whith this is that no texts are lost in Inference. Updated test.tsv: test1.zip

Any other ideas why the length of the file vs what arrives at the Inferencer is different?

Branden Chan · Answer 8 · Mon Feb 08 2021 17:27:42 GMT+0800 (China Standard Time)

Your code for inference does not differ from the one i already have. What save_dir should i use for inferencing with your model?

Oh I'm really sorry that's a mistake on my part. What I mean to type was the following. FARM will automatically download the model that's specified there.

infer_model = Inferencer.load(
    "deepset/bert-base-german-cased-hatespeech-GermEval18Coarse", 
    task_type="text_classification", gpu=True, return_class_probs = True)
)
result = infer_model.inference_from_file(file="removed_contexts.txt")

You should also rerun inference on this "removed_contexts.txt" file using your saved model.

Any other ideas why the length of the file vs what arrives at the Inferencer is different?

Knowing about the versioning and environment you're doing this in will also help. Are you running this in Colab or locally on your computer? Otherwise I'll be able to make a much better guess when I can replicate your problem.

FM29 · Answer 9 · Thu Feb 11 2021 15:12:16 GMT+0800 (China Standard Time)

Hi, thanks for the update.
Neither of the models will take in .txt so i converted the removed_contexts.txt to .tsv and added the 'label' column with content NaN (since apparently it expects it).

Running with mine and your model, there seemes to be a converting issue. I placed the whole log into a txt file.
Your Model: YourModelLog.txt
My Model with removed_contexts: MyModelLog.txt

Those messages gave me hunch so i took the old datasets test.tsv and removed the (correct) labels from the label column and replaced them with NaN, since in the real Inference we wouldn't have them either. And bam, same errors. So could have to to with the processing of the label column in text.tsv. Maybe i did something wrong there in training and need to redo it?

For my environment:
We compute on various Jupyter Notebooks lying on a Linux server, so locally.
Entering python -v in a cell leaves me with python2.7, but the notebooks corner says python3 and all the code is python 3 too (at least 3.7.9)
I hope this helps.

Branden Chan · Answer 10 · Thu Feb 11 2021 18:33:04 GMT+0800 (China Standard Time)

One thing I just noticed from your logs is that you are using an older version of FARM. Could you update yours and rerun? You can upgrade using something like this

pip install --upgrade farm

Branden Chan · Answer 11 · Thu Feb 11 2021 18:33:41 GMT+0800 (China Standard Time)

I am going to do a few runs on my side to see what happens when the labels are NaN

FM29 · Answer 12 · Tue Feb 16 2021 14:40:22 GMT+0800 (China Standard Time)

Just saw your earlier comment, i updated farm and reran the DataSilo with the train and test set i will use is the final model.
Here is the log: newLogDataSilo.txt
The lengths still are not the same (length of train.tsv: 56150, length of test.tsv: 6240)

And here is the new log of the inferencer, used with test.tsv (real correct labels in label column): newLogInferencer.txt

Here too the same problem still exists:

Texts: that arrived at the Inferencer: 6217
Actual length of test.tsv: 6240
Number of texts that weren't identified: 30

Updated removed contexts: removed_contexts (1).txt

Hope this helps :)

Branden Chan · Answer 13 · Wed Feb 17 2021 00:43:24 GMT+0800 (China Standard Time)

@FM29 Hey I think I've figured out at least one part of the issue! Could you try training again and initializing the TextClassificationProcessor with quote_char='"'? This definitely had an impact on the number of samples kept before training. I was trying to train the model using the original test.tsv. After changing the quote char, the preprocessing steps retained all 6190 samples.

i.e.

processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        data_dir="./dataset" ,
                                        train_filename="train.tsv",
                                        label_list=labels,
                                        metric="acc",
                                        label_column_name="label",
                                        quote_char='"')

Quote chars are used so that text can include the tsv delimiter symbol (i.e. \t) without ruining the tsv format

FM29 · Answer 14 · Wed Feb 17 2021 14:59:18 GMT+0800 (China Standard Time)

You are amazing!

I just restarted training and with the modification the DataSilo does not lose any Data at all. So the Inferencer probably won't later either.
Does that mean that somewhere in our data we have hidden tabs lying around?

Thank you so much for this super intense help with this.

Branden Chan · Answer 15 · Wed Feb 17 2021 18:15:08 GMT+0800 (China Standard Time)

Really happy that I could help! Feel free to open this issue again if anything else comes up

Branden Chan · Answer 16 · Wed Feb 17 2021 18:19:12 GMT+0800 (China Standard Time)

Does that mean that somewhere in our data we have hidden tabs lying around?

I believe the issue is more that you might have some single quotation marks lying around (i.e. the ' character). These are the default quote_chars in FARM and if there is an opening single quotation without the closing quotation, it might cause preprocessing to ignore one of the tabs that delimit your tsv.

sophgit · Answer 17 · Fri Feb 26 2021 22:04:50 GMT+0800 (China Standard Time)

Hi @brandenchan,
I have a similar problem, when fine-tuning xlm-roberta-base for QA. While fine-tuning on MLQA (using the test-context-de-question-de.json), it recognizes all training examples. However, when using a dataset I annotated myself (with the haystack annotation tool), only 79 dictionaries are being converted; the original train file contains 910 training examples.
I am using Haystack version 0.6.2.
Any idea, what might be the problem in this case?
Thanks!

Timo Moeller · Answer 18 · Fri Feb 26 2021 22:40:19 GMT+0800 (China Standard Time)

Hey @sophgit you already created an issue about offstes on windows.

Maybe now it is a different offset problem. Did you try the solution outlined in deepset-ai/haystack#492 (comment) already?

sophgit · Answer 19 · Mon Mar 01 2021 18:08:03 GMT+0800 (China Standard Time)

Hey, thank you so much for your respond.
I got some things mixed up, I'm really sorry. (and it's not the offsets, I tried the script, didn't correct any offsets).
During training, I actually get the message that there are more training examples than I actually have. I'm currently trying to figure out why that is without bugging you again.

Timo Moeller · Answer 20 · Mon Mar 01 2021 18:14:10 GMT+0800 (China Standard Time)

k no worries : )
short info, the number of training samples reported in the printout is the number of passages. If you have long input article the text will be split into multiple passages, resulting in a higher number.