foodoh / ocrd_menus

OCR's text files for all the hotels in Bangalore. Tesseract OCR engine was used for the purpose

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Noise in image to text data-set.

codeorbit opened this issue · comments

@prodicus Dataset which contain text from images is not cleaned (i.e containing lots of special character and numbers) and not normalized as well.
for e.g. sambhar vada , vada sambar, vada with sambhar all are same but they are different in the dataset.

@codeorbit That would be a problem.

But it would be a pain to manually clean the dataset with the cases like sambhar vada. Skim through the dataset once, there are some properly OCR'd menus in there.

Does nltk (or any other module) have something which normalizes the data the way we want?

@prodicus There is no such module for that in nltk. Everyone makes their own custom algorithm for normalization based on some pattern in the dataset.

@codeorbit So would your models not work with this data?

Not only mine, any model will not work with noisy data.

What is the plan now then? OCR wont give be giving us the data in the form you want. Burrp hosts text menus on their website. So we could probably use them

@prodicus Cool .. we can go for it 👍