baudm / parseq

The model currently cannot retain the original form of words. For example, in image if words are "sunflower oil", it returns "sunfloweroil" without space. Is there any work around to address it?

Also, is it possible to fine-tune this model on other dataset such as XFUND (https://github.com/doc-analysis/XFUND) ?

Hello @PSanni,

For your first problem, namely, retaining original form of words, I do not know how to adress it.

Though, for your second question, I was able to use another dataset of my own (actually being trained). Hereby the solution I came up with. I hope it can be applied to your usecase.

This project makes use of the datasets from this other project https://github.com/ku21fan/STR-Fewer-Labels, as mention in Datasets.md, with few workarounds.
If you look into this other project, you will find a section in the Readme.md named "When you need to train on your own dataset or Non-Latin language datasets.". I bet the name is explicit enough.
They provide a piece of code in create_lmdb_dataset.py as well as the input format to this file to generate a dataset well formatted to be used by the algorithm, and a fortiori, by parseq as well.

I thouroughly followed the instructions and was able to start a training with parseq on my own dataset.

Edit: the training terminates but the test shows really inconsistent results. Maybe the .mdb file is still problematic. I am exploring this issue

@PSanni for now, you can just directly edit and comment out

parseq/strhub/data/dataset.py

Line 85 in 98959c9

label = ''.join(label.split())

Note that some preprocessed datasets have had the spaces within labels removed. For the datasets which I preprocessed (COCO, OpenVINO, TextOCR), the spaces within the labels should be intact.

For fine-tuning on other datasets, you have two options:

Write your own Dataset subclass which follows the same public interface as LmdbDataset.
Preprocess your dataset into an LMDB database (see one of the converter scripts in tools to write your own preprocessing script. Then use create_lmdb_dataset.py to create the actual LMDB files).

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so:
./train.py data.remove_whitespace=false data.normalize_unicode=false

I think its a good idea to include an annotation samples and required input format to the model.

The LMDB format used is unchanged from prior work. create_lmdb_dataset.py expects a text file with one image path and label per line. The actual format is described in the README for the TextOCR and OpenVINO archives.

The conversion from text labels to token IDs is handled by Tokenizer.encode() (in strhub/data/utils.py).

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so: ./train.py data.remove_whitespace=false data.normalize_unicode=false

In addition to disabling whitespace (space, tabs, new line, etc.) removal, make sure you add the space character ' ' to charset_train and charset_test so it won't get removed by CharsetAdapter.

Closing this now since all issues have been addressed already. Feel free to reopen if I missed anything.

Any work around to retain original form of words ?