HanxunH / EditDistanceSpellingCorrection

COMP90049KT Project1

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A simple spelling correction system base on edit distance.
This is project 1 of Knowledge Technologies (COMP90049) Semester 2, 2018, University of Melbourne.

Environment

Python Verison: 2.7.10 External Packages

How to Use?

python coconutOrca.py

  • Arguments
  • Interactive Mode -i
  • Compare files Mode -f follow by path of misspelled file and correct file
  • Threshold -t follow by threshold integer
  • Edit Distance
  • -l Levenshtein
  • -d Damerau-Levenshtein
  • -ged Global Edit Distance
  • -led Local Edit Distance
  • -ngram N-Gram Distance
  • -pynGram N-Gram Distance(ngram 3.3.2)
  • -n follow by a number (For N-Gram Distance)
  • Example For sample_test_misspell.txt and sample_test_correct.txt using Levenshtein and Threshold equal to 1.
$ python coconutOrca.py -l -f 2018S2-90049P1-data/sample_test_misspell.txt 2018S2-90049P1-data/sample_test_correct.txt -t 1

Data Set

  • dict.txt: This is a list of approximately 370K English entries, which should comprise the dictionary for your approximate string search method(s). This dictionary is a slightly-altered version of the data from: https://github.com/dwyl/english-words The format of this file is one entry per line, in alphabetical order. You may use a different dictionary if you wish; if so, you should state the data source and justification in your report.

  • wiki_misspell.txt: This is a list of 4453 tokens that have been identified as common errors made by Wikipedia editors. It has been scraped from the following page: https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings The format of this file is one misspelling per line, in alphabetical order.

  • wiki_correct.txt: This is a list of the truly intended spellings of the corresponding misspelled tokens from wiki_misspell.txt - again, one item per line.

  • birkbeck_misspell.txt: This is a list of 34683 misspellings, comprising the "Birkbeck spelling error corpus". This is a machine-readable transcription of (hand-written) spelling mistakes made by schoolchildren, university students, and adult literacy students. The nature of these errors will probably be quite different to the typographical errors from Wikipedia. The corpus can be accessed through the Oxford Text Archive: http://ota.ox.ac.uk/ This particular dataset is a slightly-altered version of the one hosted by Roger Mitton: https://www.dcs.bbk.ac.uk/~ROGER/corpora.html

  • birkbeck_correct.txt: These are the corresponding corrections from birkbeck_misspell.txt - the format of these files is the same as the Wikipedia files; only the textual source is different.

About

COMP90049KT Project1


Languages

Language:Python 100.0%