Maria's repositories
BERT_monoware
BERT service implementation
BERT_NER_DIA
This repository contains script that extracts ICD codes and diagnoses from medical reports
BERT_classifier
Multiclass text classifier based on BERT architecture.
Law-terms-extraction-using-SpaCy-rules
Highlighting terms (nouns and predicates) and thematic modeling using SpaCy (for Russian). Calculating TF-IDF for the relevant terms extraction (sk-learn)
SpaCy-rules-for-Key-Words-extraction
SpaCy Part-of-Speech tagging model that can identify "Profiles", "Categories", "Goals", "Measures", "Actions" from text data. The grammar-based rules using POS tagging and dependency parsing upon for better accuracy.
Collocations_check
We say "make a mistake", but "do a favour"; we say "big surprise", but "great anger"; we say "highly unlikely", but "seriously wrong". Words collocate in interesting and unpredictable ways. Moreover, word collocations can tell us more about the meaning of the word. Your task is to research how verbs from the same synset collocate with adverbs. For example, we usually "love somebody dearly", "honor somebody highly", and "admire somebody greatly". The task: collect more synonyms for this synset: "say", "tell", "speak", "claim", "communicate" write a function that finds a verb from the synset in the sentence and collects all adverbs that this verb governs; consider only adverbs that end with "-ly" write a program that collects all verbs and their adverbs in the blog corpus the output of the program should be ten most frequent adverbs that collocate with the verb
ML-Star-rate-prediction
Predicting the star rate to the users` comments according using Supervised ML algorithm.
Pymorphy_textAnalyzer
Analization of ukrainian and russian texts using Pymorphy
Gender_Classification
NaiveBayesClassifier
Symantic_similarity
symantic_similarity, Reznik similarity
Python-Gematria
Read about Gematria, a method for assigning numbers to words and for mapping between words having the same number (http://en.wikipedia.org/wiki/Gematria). There are different views on how to count Gematria. Your script will incorporate two different scores. Write a function count_gematria(word, option) that sums the numerical values of the letters of a word using letter_values_1 if option is 1 and letter_values_2 if option is 2:
Python-Zen
Write a function real_zen(input_file) that reads zen.txt as input_file and prints "The Zen of Python" in the following format: the title, "by" + the author and then the Zen itself line by line, starting with the line number. You should ignore the comments: The Zen of Python by Tim Peters 1. Beautiful is better than ugly. 2. Explicit is better than implicit. ... 19. Namespaces are one honking great idea -- let's do more of those! Your function should print 2 lines with the title and the author and then 19 more lines with the wisdom about Python, starting with the numbers from 1 to 19. Read all necessary information from the file.
Text_Exploring
1)Text segmentation, 2)Tokenization, 3)Building concordance, 4)Steming, 5)Lematization
The_Most_Popular_N-Gramms
Finding the most popular n-gramms based on corpus words or corpus sentences
Head_lines_Correction
The Associated Press Stylebook is a style guide widely used among American journalists. It enforces the following rules for capitalization of news headlines: Capitalize words with 4 or more letters. Capitalize the first and the last word in the headline. Capitalize nouns, pronouns, adjectives, verbs, adverbs, numerals, and subordinating conjunctions. Lowercase all other parts of speech: articles, coordinating conjunctions, prepositions, particles, interjections.
WordNet-word-description
Giving all the information about word from WordNet
Word_Frequences
Calculating word frequences using NLTK methods
Data_Scraping
Scraping text data for analysis from the web-sites
Generate-a-Song
2-gramm based song generator
WordDistribution
Word distribution in NLTK gutenberg
stopwords-ru
Russian stopwords collection