CIRCSE / text-reuse-aquinas

Automatic text reuse detection in the Summa contra Gentiles with TRACER. Data and code repository for the CLiC 2018 paper submission.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text reuse in the Summa contra Gentiles

This research seeks to automatically detect text reuse (verbatim and paraphrase) between the Summa contra Gentiles of Thomas Aquinas and a number of other works.

The detection software used is TRACER. The automatically-detected reuses are evaluated against the manually-annotated quotations in the Index Thomisticus, which is used as a gold standard. The objective is to evaluate TRACER as an information retrieval tool for automatic text reuse detection and work towards the creation of an Index fontium computatus containing true positives, false positives as well as false negatives to better understand the limitations of retrieval methods and linguistic resources for Latin.

TreeTagger (Brandolini)

Overview of the tagging performance of the Brandolini TreeTagger Latin parameter file across the texts under study. Punctuation is excluded from the token count.

Author Work Tokens Unknowns (%)
Thomas Aquinas Summa contra Gentiles 378,160 4,396 (1.16%)
Aristoteles Latinus Metaphysica 59,314 4,232 (7.13%)
Cicero De divinatione 28,744 2,690 (9.35%)
Boethius Philosophiae Consolationis 24,924 2,279 (9.14%)
Apuleius De Deo Socratis 4,633 410 (8.84%)
Boethius De Trinitate 2,902 60 (2.06%)

Average tagging accuracy: 93.72%

Summary of results and F1 scores

Summa contra Gentiles vs. De Trinitate

  • Total number of sentences (ScG and De Trinitate combined): 19,560
  • Total number of TRACER results: 10,708
  • Total number of TRACER results without duplicates: 10,631
  • Reuses to find: 4
  • TPs: 3
  • FPs: 10,631-4 = 10,627
  • FN: 1

Precision = 3/(3+10,627) = 0,00028 | Recall = 3/(3+1) = 0,75 | F1 score = 2 · (P·R)/(P+R) = 5,59 · 10-4

Summa contra Gentiles vs. Philosophiae Consolationis

  • Total number of sentences (ScG and Philosophiae Consolationis combined): 21,108
  • Total number of TRACER results: 1,319
  • Total number of TRACER results without duplicates: 1,306
  • Reuses to find: 7
  • TPs: 3
  • FPs: 1,306-7 = 1,299
  • FN: 4

Precision = 3/(3+1,299) = 0,0023 | Recall = 3/(4+3) = 0,42 | F1 score = 2 · (P·R)/(P+R) = 4,57 · 10-3

Summa contra Gentiles vs. De Deo Socratis

  • Total number of sentences (ScG and De Deo Socratis combined): 19,600
  • Total number of TRACER results: 167,075
  • Total number of TRACER results without duplicates: 155,848
  • Reuses to find: 2
  • TPs: 2
  • FPs: 155,848-2 = 155,846
  • FN: 0

Precision = 2/(2+155,846) = 0,0000128 | Recall = 2/(0+2) = 1 | F1 score = 2 · (P·R)/(P+R) = 2,57 · 10-5

Summa contra Gentiles vs. De Divinatione

  • Total number of sentences (ScG and De Divinatione combined): 20,820
  • Total number of TRACER results: 1,585,719
  • Total number of TRACER results without duplicates:
  • Reuses to find: 1

No results.

Summa contra Gentiles vs. Metaphysica

  • Total number of sentences (ScG and De Divinatione combined): 22,550
  • Total number of TRACER results: 506,418
  • Total number of TRACER results without duplicates: 502,877
  • Reuses to find: 97
  • TPs: 19
  • FPs: 502,877-19 = 502,858
  • FN: 78

Precision = 97/(97+502,858) = 0,000192 | Recall = 2/(78+19) = 0,02 | F1 score = 2 · (P·R)/(P+R) = 3,8 · 10-4

Copyright and Acknowledgements

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and was funded by the German Federal Ministry of Education and Research (eTRAP, No. 01UG1409).

About

Automatic text reuse detection in the Summa contra Gentiles with TRACER. Data and code repository for the CLiC 2018 paper submission.


Languages

Language:Python 100.0%