The purpose of this project is to train (and test) and algorithm for plagiarism detection based on the dataset created by Clough and Stevenson (2011).
Clough and Stevenson constructed a corpus consisting of answers in which plagiarism was simulated. The main benefit of this dataset is the inclusion of 4 different types of plagiarism: Near copy, Light revision, Heavy revision, Non-plagiarised.
The creation of this datased tackled the persistent problem in the plagiarism detection literature of null access to genuine examples of plagiarised work.
Plagiarism is an increasing problem for education institutions. There exist some tools to help in its detection; however, testing its effectiveness is a challenge when there is no access to reliable data.
The task of building tools to detect plagiarised work is not straightforward due to the problems of obtaining real examples of plagiarised text. As stated by Clough and Stevenson (2011), the main problems that hamper obtaining reliable plagiarism labelled data are:
-
Plagiarised text is not intended to be identified and plagiarists are unlikely to admit their act.
-
If a plagiarised text is detected, because of legally and ethics issues, it may not be freely available.
The dataset used in this project is a modified version of the one create by Paul Clough and Mark Stevenson. The complete description of the data generation process is described in their research article (Clough, P., Stevenson, M. Developing a corpus of plagiarised short answers, 2011)
For more details in the above mentioned points, please refer to the research document (pages 9-12)
Two measures of similarity will be used as features to predict plagiarism: containment and largest common subsequence.
Containment is a measure of text similarity proposed by Andrei Broder in his paper "On the resemblance and containment of documents". The containment of two documents A and B is a number between 0 and 1 that contains the proportion of A's unique n-grams that are also in B.
Formally, containment is:
Where and represent the set of n-grams for document A and B respectively.
The numerator is the intersection of unique n-grams between documents A and B. The denominator equals the number of unique n-grams in document A.
LCS between two strings is the longest subsequence that is common to both strings. To be considered a subsequence, the words should not necessarily be in consecutive order.
The final model is trained and deployed in Amazon Sagemaker