keyword-extraction
Document Keyword Extraction
Libraries Used
- CountVectorizer – for calculation frequency of each word in the texts.
- Pandas – for handling data structures.
- Stopwords – English language stop words
- TfidfTransformer – transforming frequency distribution of words to tf-idf(Term Frequency – Inverse Document Frequency)
- Re – for working with regex
- Counter – for maintaining counts for each word (while using search for keywords)
- Json – for working with json formatted data Functions
- sort_on_count – Based on the tf-idf value, sort the tf-idf vector
- extract_topn – extract the top n features (keywords) using the tf-idf vector previously sorted.
- Tokenize – tokenize a given sentence (text)
- Probability – probability of each word present in the entire document corpus (‘train.txt’)
- Known – returns the set of known words for a given keyword (fuzzified)
- edit_dist_1 – returns the set of words which have a edit distance of 1 with any provided word.
- edit_dist_2 – returns the set of words which have a edit distance of 2 with any provided word.
- Find – finds a given keyword present in the short text. Overview The code does the following :-
- Keyword extraction – Tf-idf has been used for keyword extraction. The text is loaded into two variables ; train_docs and test_docs . The frequency count of each word is computed on the training document. Based on the count vector trained, values are transformed as TF-IDF. Based on the TF-IDF vector and count vector previously trained, keywords are found from the test documents using sort_on_count and extract_topn.
- Search Keywords – Based on the keywords extracted in the previous part, each keyword is searched for in the short text based on edit distance algorithm.
Check this for misspellings (https://norvig.com/spell-correct.html)