Noise in image to text data-set.

Question

Noise in image to text data-set.

codeorbit opened this issue 8 years ago · comments

@prodicus Dataset which contain text from images is not cleaned (i.e containing lots of special character and numbers) and not normalized as well.
for e.g. sambhar vada , vada sambar, vada with sambhar all are same but they are different in the dataset.

Tasdik Rahman · Answer 1 · Wed Mar 30 2016 00:25:45 GMT+0800 (China Standard Time)

@codeorbit That would be a problem.

But it would be a pain to manually clean the dataset with the cases like sambhar vada. Skim through the dataset once, there are some properly OCR'd menus in there.

Does nltk (or any other module) have something which normalizes the data the way we want?

Akhil Gupta · Answer 2 · Wed Mar 30 2016 00:32:12 GMT+0800 (China Standard Time)

@prodicus There is no such module for that in nltk. Everyone makes their own custom algorithm for normalization based on some pattern in the dataset.

Tasdik Rahman · Answer 3 · Wed Mar 30 2016 01:09:36 GMT+0800 (China Standard Time)

@codeorbit So would your models not work with this data?

Akhil Gupta · Answer 4 · Wed Mar 30 2016 01:15:21 GMT+0800 (China Standard Time)

Not only mine, any model will not work with noisy data.

Tasdik Rahman · Answer 5 · Wed Mar 30 2016 01:31:48 GMT+0800 (China Standard Time)

What is the plan now then? OCR wont give be giving us the data in the form you want. Burrp hosts text menus on their website. So we could probably use them

Akhil Gupta · Answer 6 · Wed Mar 30 2016 02:17:24 GMT+0800 (China Standard Time)

@prodicus Cool .. we can go for it 👍