izikeros / sentence-plagiarism

Compare sentences from input document with all sentences from reference documents - find very similar ones.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Plagiarism Checker

img

This is a command-line tool for checking the similarity between a given text and a set of reference documents. The tool uses the Jaccard similarity algorithm to compare the input text with the reference documents.

Installation

Install in an isolated environment using pipx (or normal pip):

pipx install sentence-plagiarism

CLI Usage

To run the plagiarism checker, use the following command:

sentence-plagiarism <path-to-input-file> <path-to-reference-file-1> <path-to-reference-file-2> ... [--threshold <threshold-value>] [--output_file <path-to-output-file>] [--quiet]
  • <path-to-input-file>: Path to the input file to be checked for plagiarism.
  • <path-to-reference-file-1> ...: Paths to the reference files to compare against.
  • --threshold: (optional) The minimum similarity score required to consider a sentence as plagiarized. The value should be between 0 and 1.
  • --output-file (optional): Path to the output file to save the results in JSON format.
  • --quiet (optional): Flag to suppress the display of similar sentences in the console.

Example

The following command:

sentence-plagiarism  input.txt --reference-files ref1.txt ref2.txt --similarity-threshold 0.8 --output-file results.json

can produce the following output on stdout:

Input Sentence:     The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks.
Reference Sentence:  foobar  The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks.
Reference Document: ref1.txt
Similarity Score: 0.9667

Input Sentence:      Closing thoughts  For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.
Reference Sentence:  barfoo  For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.
Reference Document: ref2.txt
Similarity Score: 0.8966

Results saved to results.json

and save results to results.json.

Programmatic Usage

from sentence_plagiarism import check

check(
    examined_file="txt/txt1.txt",
    reference_files=["txt/txt2.txt", "txt/txt3.txt"],
    similarity_threshold=0.8,
    output_file=None,
    quiet=False,
)

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Krystian Safjan - ksafjan@gmail.com

Project Link: https://github.com/izikeros/sentence-plagiarism

About

Compare sentences from input document with all sentences from reference documents - find very similar ones.

License:MIT License


Languages

Language:Python 100.0%