Jumbled Characters in Dataset

Question

Jumbled Characters in Dataset

t03i opened this issue 2 years ago · comments

In the train 74k.fasta the sequence 9pcyA00 contains 0 bytes.

Michael Heinzinger · Answer 1 · Mon Jan 10 2022 23:16:30 GMT+0800 (China Standard Time)

Thanks for reporting. I did not encounter this error when reading in the file with Python, however, I also saw the single malformatted character in the above reported sequence when opening the file in the browser. Therefor, I decided to remove this sequence from the training data to avoid further complications.