singhranjodh / Text-Semantic-Similarity-MachineLearning

Machine Learning Project of Semester VI students(Group 3) at School of Engineering and Applied Science, Ahmedabad University.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text Semantic Similarity - MachineLearning

Machine Learning Project of Semester VI students(Group 3) at School of Engineering and Applied Science, Ahmedabad University.

About

Machine Learning has found its place in the technological world rapidly since the past few years. One of the applications of Machine Learning includes Plagiarism Checking which is an application of Text Semantic Similarity. Text Semantic Similarity is a measure of the degree of semantic equivalence between two pieces of text.

How do we know whether a document that we are reading is authorized? Are students copying the content/ideas from other sources or are they produced by them?
In this project, we build algorithms (one or more) and analyse the algorithms suitable for plagiarism checking software by applying the already understood concepts of Machine Learning.

Team

1)Aneri Sheth- 1401072

Aneri Sheth

2)Himanshu Budhia- 1401039

Himanshu Budhia

3)Raj Shah- 1401050

Raj Shah

4)Twinkle Vaghela- 1401106

Twinkle Vaghela

NATURAL LANGUAGE PROCESSING

Natural Language processing is a wide domain coveringconcepts of Computer Science, Artificial Intelligence and Machine Learning. It is used to analyze text or how humansspeak. One of the applications of NLP is Semantic Analysis(Understanding the meaning of text).

Alt text

CORPUS-BASED APPROACH

This approach uses semantically annotated corpora to train Machine learning algorithms to decide which word to use in which context. Corpus-based methods are supervised learning approaches when the training data is trained by the algorithms. The corpora and the lexical resource used is WordNet.

Alt text


Sentence 1 - A cemetery is a place where dead people’s bodies or their ashes are buried.
Sentence 2 - A graveyard is an area of land, sometimes near a church, where dead people are buried.
  • Tokenizing:

Splitting sentences and words from the body of text. Words are separated by space after the word, i.e.after every word there is a space. It counts punctuation as a separate token.

Alt text
Tokenize.py

  • Stop Words:

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . Stop words can be filtered from the text to be processed.

Alt text
StopWords.py

  • Lemmatizing:

The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. By default, an attempt will be made to find the closest noun of a word.

Alt text
Lemmatizing.py

  • Synsets:

WordNet is a lexical database for the English language, and is part of the NLTK corpus. We can use WordNet alongside the NLTK module to find the meaning of words, synonyms, antonyms and more.

Alt text

Alt text
Wordnet.py

Results

  • Let S1 be ”I was given a card by her in the garden” and S2 be ”In the garden, she gave me a card.”
  • For semantic analysis, two phrases/sentences are taken. The two sentences are similar, dissimilar or somewhat similar.
  • After that, set of stopwords are defined for English language.
  • After eliminating the special characters and punctuations and then removing all the stop words and lemmatizing, we get S1={I, given, card, garden} and S2={In, garden, gave, card}.
  • After lemmatizing, we find the synonyms of the lemmatized words which are called synsets. Then, we compare first word of S1 with all the words of S2 and continue this iteratively and find the similarity index of each word with words in the S2.
  • We find the mean of the computed similarity indexes and thus we we anaylze the semantic similarity using machine learning.
  • If the similarity index is less than 0.60, the sentences are labeled as ’Not Similar’, if it is between 0.60 and 0.8, the sentences are labeled as ’Somewhat Similar’ and more than 0.8, the sentences are ’Similar’.

Output

  • Similarity Example 1:
    Alt text
  • Similarity Example 2:
    Alt text
  • Somewhat Similarity Example:
    Alt text
  • Dissimilarity Example:
    Alt text

[Final.py](https://github.com/budhiahimanshu96/Text-Semantic-Similarity-MachineLearning/blob/master/NLTK/Final.py)

Discussion and Future Work

  • Semantics Similarity has been done for sentences and phrases. However, paragraphs and short texts will need complex algorithms for separating of sentences and finding their semantics similarity.
  • We find similarity word by word and thus we may get false positives and negatives.
  • We would try to decrease the false positive and negative rates by using sentence- sentence similarity instead of word-word similarity.
  • Our implementation does not consider spellings. To implement that, Longest Common Subsequence (LCS) Algorithm can be used.

About

Machine Learning Project of Semester VI students(Group 3) at School of Engineering and Applied Science, Ahmedabad University.


Languages

Language:Python 45.9%Language:MATLAB 34.9%Language:TeX 19.2%