Tag Len DataError Occuring Regardless of Tag Len Matching Address Len

Question

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len

joseandrejv opened this issue 2 years ago · comments

I'm trying to retrain a Bpemb model with new address tags, and am using the CSVDatasetContainer function to load the data. I've followed all possible guidelines so it'll read in the data without errors. The training data is two columns with the specific formatting. None of the addresses are empties or single whitespaces, and I've corroborated time and time again that the length of each address is compatible with the length of the tag list. I've done this by tokenizing the original addresses and programmatically comparing their lengths with the lengths of the tag lists from the same row (using a pandas version of the same dataframe). I also dug into the source code and tried the function you guys have listed there (_data_tags_is_same_len_then_address) and when I try it with the pandas version of my df, the output is True, which is supposed to mean that everything is as it should be. I also tried this with PickleDatasetContainer instead, using a .p file with the data formatted as requested, and I get the same error.

This is how I'm trying to read in the data:
CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

And this is the error I keep getting:

System Info:

OS: Windows 10
IDE: VS Code
Python Version: 3.9.12
Deepparse Version: 0.7.3
Poutyne Version: 1.9 (I used this specific version so I could use the progress bar feature, since there's another issue with the code that compares the float version of Poutyne to 1.8, because the latest version is 1.11 and that is technically a smaller decimal number)

I'm not 100% sure whether this qualifies as a bug, but it sure is perplexing and I'm not sure where else to ask for help.

I guess this boils down to:

Is there anything about my system that could be causing this?
Is it the separator I'm using (without using ',', the function won't read in the data correctly, and its worked with a smaller training set before)
Is there any other potential factor I haven't considered?

Thanks in advance for your help.

David Beauchemin · Answer 1 · Wed May 11 2022 03:58:48 GMT+0800 (China Standard Time)

Is it possible for you to share your dataset with me (in private) to ease the debugging?

José Jiménez · Answer 2 · Wed May 11 2022 04:14:31 GMT+0800 (China Standard Time)

Hello Dave and thank you for replying so promptly.

I'm going to ask my boss, but chances are that I won't be allowed to, as the information is (even in sample form with limited attributes) confidential. Is there anything else I could do to help? Anything I could check, or something?

David Beauchemin · Answer 3 · Wed May 11 2022 04:20:53 GMT+0800 (China Standard Time)

No worries. I have pushed code on a branch to try to debug it.

Install the following version of the project using this pip install -U git+https://github.com/GRAAL-Research/deepparse.git@bug_fix_data_tags_len. You can later return to the stable version with pip install -U deepparse.

Then try running this:

dataset = CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

I have pushed a new method that is invoked when the error occurs to print the cases where there is a difference between len. I have not tested it, however. Tell me if it shows proper details.

Then send a print screen of the output (if it work properly).

Edits:
*I have simplified the code for you.
*Text improvement.

José Jiménez · Answer 4 · Wed May 11 2022 04:26:05 GMT+0800 (China Standard Time)

I installed the fix and ran the code. Now there's a unicode error.

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte"

I also ran the code after installing -U deepparse again and got the same error.

In pandas I would change the encoding with the encoding argument, how would I fix this with CSVDatasetContainer?

*Edit: Would I just have to create a different file?

David Beauchemin · Answer 5 · Wed May 11 2022 04:28:59 GMT+0800 (China Standard Time)

Uhm, which language is the addresses? It is possible your CSV is not in UTF-8. Try recreating it but specify UTF-8 encoding. It is possible that there are special characters that look like whitespace but are not.

José Jiménez · Answer 6 · Wed May 11 2022 04:35:59 GMT+0800 (China Standard Time)

Wow, that makes sense!

The addresses are in Spanish, but part of the cleaning process involves programmatically replacing all latin characters with plain letters (as well as removing others that aren't a part of the language but have no business being in a standardized address), so I didn't think this would be an issue. I'll try recreating the file with the UTF-8 specification and let you know what happens as soon as I can.

José Jiménez · Answer 7 · Wed May 11 2022 05:21:11 GMT+0800 (China Standard Time)

Ok, I've re-saved the csv with utf-8 encoding specified, and ran the code once again. I'm back where I started, with the same DataError about the lengths of the Tag lists not matching the lengths of the addresses.

David Beauchemin · Answer 8 · Wed May 11 2022 06:55:08 GMT+0800 (China Standard Time)

k this would be an issue. I'll try recreating the file with the UTF-8 specification and let you know wh

Ok and does the bug_fix print something useful? (This version of deepparse pip install -U git+https://github.com/GRAAL-Research/deepparse.git@bug_fix_data_tags_len)

José Jiménez · Answer 9 · Thu May 12 2022 01:35:45 GMT+0800 (China Standard Time)

No, first it was the unicode error (which doesn't seem to be an issue now that I changed the file) and now I'm getting the same len error after reinstalling the bug fix version and trying it out again.

Would sharing the data help much more? They got back to me and I can send it to you privately.

David Beauchemin · Answer 10 · Thu May 12 2022 21:40:16 GMT+0800 (China Standard Time)

It would definitely be easier. david.beauchemin.5 at(@) ulaval.ca

David Beauchemin · Answer 11 · Fri May 13 2022 00:23:14 GMT+0800 (China Standard Time)

See #127 for details.
The problem where that the list of tags where not properly split. It is fixed in release 0.7.4.