keyword-extraction

                                   Document Keyword Extraction
Libraries Used

CountVectorizer – for calculation frequency of each word in the texts.
Pandas – for handling data structures.
Stopwords – English language stop words
TfidfTransformer – transforming frequency distribution of words to tf-idf(Term Frequency – Inverse Document Frequency)
Re – for working with regex
Counter – for maintaining counts for each word (while using search for keywords)
Json – for working with json formatted data Functions
sort_on_count – Based on the tf-idf value, sort the tf-idf vector
extract_topn – extract the top n features (keywords) using the tf-idf vector previously sorted.
Tokenize – tokenize a given sentence (text)
Probability – probability of each word present in the entire document corpus (‘train.txt’)
Known – returns the set of known words for a given keyword (fuzzified)
edit_dist_1 – returns the set of words which have a edit distance of 1 with any provided word.
edit_dist_2 – returns the set of words which have a edit distance of 2 with any provided word.
Find – finds a given keyword present in the short text. Overview The code does the following :-
Keyword extraction – Tf-idf has been used for keyword extraction. The text is loaded into two variables ; train_docs and test_docs . The frequency count of each word is computed on the training document. Based on the count vector trained, values are transformed as TF-IDF. Based on the TF-IDF vector and count vector previously trained, keywords are found from the test documents using sort_on_count and extract_topn.
Search Keywords – Based on the keywords extracted in the previous part, each keyword is searched for in the short text based on edit distance algorithm.

Check this for misspellings (https://norvig.com/spell-correct.html)

Mkojok / keyword-extraction

keyword-extraction

About