duims / fuzzySearch

Exploring levenshtein distance in regard to OCR as a tool for detecting scanning errors

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This code is playing around with statistics you can calculate using the Levenshein distance and words.

In particular, I'm interested in seeing if we can detect that any particular word is a "scanning error", as well as looking at other metrics.

I think scanno's will have the following properties:

1) They'll have a low frequency(either 1-3 occurences) or a low frequency wrt to the "correct word"
2) They'll have a real word that is within distance 1-2

About

Exploring levenshtein distance in regard to OCR as a tool for detecting scanning errors


Languages

Language:Python 100.0%