balthasars / DocuToads

DocuToads is an open source minimum edit distance algorithm that can handle cut-paste edit operation.

Home Page:https://arxiv.org/abs/1608.06459

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DocuToads

This particular version is courtesy of Prune Truong who has improved comments, disabled multicore functionality which prevented the script from running at all on our computers.

DocuToads is an open source minimum edit distance algorithm written in Python 2.7 that can handle cut-paste edit operations, created by Henrik Hermansson, who reserves some rights. This code may be modified and used by anyone, granted that this source is cited.

Instructions

  1. The texts
    1. DocuToads accepts texts in .txt, utf8 format. Make sure all your texts are in this format.
    2. If you want to perform an article-by-article comparison of the two texts, you need to mark the break-points between articles.
    3. DocuToads finds the break-points using a regular expression which is looking for the word DTBREAKPOINT. Insert this word between each article (not before the first nor after the last).
  2. Install Python 2.7 and the following necessary Python packages:
    1. time
    2. math
    3. string
    4. re
    5. os
    6. sys
    7. traceback
    8. pp
    9. numpy
    10. subprocess
    11. pylab
    12. collections
    13. csv
    14. matplotlib
  3. A list of cases
    1. By a "case" is meant one pair of texts to be compared
    2. You will need a python list of cases where each entry (case) is a list containing (in string format) in the correct order:
    3. Path to text 1
    4. Path to text 2
    5. Short name for text 1
    6. Short name for text 2
    7. Short name for case
    8. A list of article names in the first text, for example: ["Article 1", "Article 2", "Article 3"]. Make sure the list matches the actual number of articles separated by DTBREAKPOINT markers. Enter empty list if you don't want to perform article-by-article comparison of the two texts.
    9. A list of article names in the second text. Make sure the list matches the actual number of articles separated by DTBREAKPOINT markers. Enter empty list if you don't want to perform article-by-article comparison of the two texts.

Shouuld look like this (single pair for whole document)

caselist = [
    ['example/1751_1.txt', 'example/1751_2.txt', 'example_pre_draft', 'example_draft', 'example_case_id', 'artname1', 'artname2']
]

or for multiple text pairs:

caselist = [
    ['example/1751_1.txt', 'example/1751_2.txt', 'example_pre_draft', 'example_draft', 'example_case_id', 'artname1', 'artname2'],\
    ['example/1752_1.txt', 'example/1752_2.txt', 'example_pre_draft_2', 'example_draft_2', 'example_case_id_2', 'artname1', 'artname2']

]
  1. Name this lists "caselist" and enter into DocuToads_main.py.
  2. Set the parameters
    1. Open DocuToads_main.py and enter the following parameters:
    2. outpath – The path of the desired output folder, no need to worry about sub-folders
    3. plottype – Determines which kind of graph to create - "block", "bar" and "line" are available. If no plot wanted set to "none".
    4. backtrace - Determines whether to save a table of the exact edit operations found, i.e. the backtrace. Set to "yes" or "no".
    5. by_article - Determines whether to split results article-by-article, based on the first text. Set to "article" or "noarticle".
    6. cutoff - Determines how many words in sequence there must be for the algorithm to detect a transposition, default value is 5.
    7. ncpus – Determines how many CPU:s DocuToads will use to process several cases simultaneously. Default is 1.
  3. Run DocuToads_main.py

About

DocuToads is an open source minimum edit distance algorithm that can handle cut-paste edit operation.

https://arxiv.org/abs/1608.06459


Languages

Language:Python 100.0%