Noise in image to text data-set.
codeorbit opened this issue · comments
@codeorbit That would be a problem.
But it would be a pain to manually clean the dataset with the cases like sambhar vada. Skim through the dataset once, there are some properly OCR'd menus in there.
Does nltk
(or any other module) have something which normalizes the data the way we want?
@prodicus There is no such module for that in nltk. Everyone makes their own custom algorithm for normalization based on some pattern in the dataset.
@codeorbit So would your models not work with this data?
Not only mine, any model will not work with noisy data.
What is the plan now then? OCR wont give be giving us the data in the form you want. Burrp hosts text menus on their website. So we could probably use them
@prodicus Cool .. we can go for it 👍