Extract key information from Edgar NDA documents
Extract the information from NDAs (Non-Disclosure Agreements) about the involved parties, jurisdiction, contract term, etc.
Note that this an information extraction task, you are given keys (labels, attribute names) and you are expected to guess their respective values. It is not a NER task, we are not interested in where the information or entity is to be found, just the information itself.
The metric used is F1 score calculated on upper-cased values. As an auxiliary metric, also F1 on true-cased values is calculated.
It should not be assumed that for each key, a corresponding value is to be extracted from a document. There might be some “decoy” keys, for which no value should be given.
There might be more than value given for a given key. In such cases, more than one value should be given. You are allowed to give more than one value, even if one is expected (e.g. if you have two options, but you are not sure which is right), though, of course, the metric will be lower than just guessing the right value.
You can carry out evaluation using the GEval,
when you generate
out.tsv files (in the same format as
wget https://gonito.net/get/bin/geval chmod u+x geval ./geval -t dev-0
Textual and graphical features
1D (textual) and/or 2D (graphical) features can be considered, as both the generated PDF documents and the extracted text is available. PDF files were generated using Puppeteer package from the original HTML files. We provide 4 different text outputs based on:
- pdf2djvu/djvu2hocr tools, ver.
- tesseract tool, ver.
4.1.1-rc1-7-gb36c, ran with
--oem 2 -l eng --dpi 300flags (meaning both new and old OCR engines were used simultaneously, and language and pixel density were forced for better results),
- textract tool, ver.
March 1, 2020,
- combination of pdf2djvu/djvu2hocr and tesseract tools. Documents are processed with both tools, by default we take the text from pdf2djvu/djvu2hocr, unless the text returned by tesseract is 1000 characters longer.
It should not be assumed that the OCR-ed text layer is perfect. You are free to use alternative OCR software.
The texts are not tokenized nor pre-processed in any manner.
README.md— this file
config.txt— GEval configuration file
in-header.tsv— one-line TSV file with column names for input data (features),
train/— directory with training data
train/in.tsv.xz— input data for the train set
train/expected.tsv— expected (reference) data for the train set
dev-0/— directory with dev (test) data from the same sources as the train set
dev-0/in.tsv.xz— input data for the dev set
dev-0/expected.tsv— expected (reference) data for the dev set
test-A— directory with test data
test-A/in.tsv.xz— input data for the test set
test-A/expected.tsv— expected (reference) data for the test set (hidden)
documents/— all documents (for train, dev-0 and test-A), they are references in TSV files
Note that we mean TSV, not CSV files. In particular, double quotes
are not considered special characters here! In particular, set
QUOTE_NONE in the Python
import csv with open('file.tsv', 'r') as tsvfile: reader = csv.reader(tsvfile, delimiter='\t', quoting=csv.QUOTE_NONE) for item in reader: ...
The files are sorted by MD5 sum hashes.
Structure of data sets
The original dataset was split into train, dev-0 and test-A subsets in a stable pseudorandom manner using the hashes (fingerprints) of the document contents:
- the train set contains 254 items,
- the dev-0 set contains 83 items,
- the test-A set contains 203 items.
Format of the test sets
The input file (
in.tsv.xz) consists of 6 TAB-separated columns:
- the file name of the document (MD5 sum for binary contents with the right extension), to be taken from the `documents/' subdirectory,
- list of keys in alphabetical order to be considered during prediction, keys are given in English with underscores in place of spaces and are separated with spaces,
- the plain text extracted by pdf2djvu/djvu2hocr tools from the document with the end-of-lines TABs and non-printable characters replaced with spaces (so that they would not be confused with TSV special characters),
- the plain text extracted by tesseract tool from the document with the end-of-lines TABs and non-printable characters replaced with spaces (so that they would not be confused with TSV special characters),
- the plain text extracted by textract tool from the document with the end-of-lines TABs and non-printable characters replaced with spaces (so that they would not be confused with TSV special characters),
- the plain text extracted by combination of pdf2djvu/djvu2hocr and tesseract tools from the document with the end-of-lines TABs and non-printable characters replaced with spaces (so that they would not be confused with TSV special characters).
expected.tsv file is just a list of key-value pairs sorted
alphabetically (by keys). Pairs are separated with spaces, value is
separated from a key with the equals sign (
=). The spaces and colons in values are
replaced with underscores.
In case of “decoy” keys (with no expected values), they are omitted in
expected.tsv files (they are not given with empty value).
Escaping special characters
The following escape sequences are used for the OCR-ed text:
\f— page break (
\n— end of line,
\\— literal backslash
Information to be extracted
There are up to 6 attributes to be extracted from each document:
effective_date- date in
YYYY-MM-DDformat, at which point the contract is legally binding,
jurisdiction- under which state or country jurisdiction is the contract signed,
party- party or parties of the contract,
term- length of the legal contract as expressed in the document.
party usually occur more than once.
The expected pieces of information were normalized to some degree:
- in attribute values, all spaces
:were replaced with an underscores
- all expected dates should be returned in
- values for attribute
termare normalized with the same original units e.g.
eleven monthsis changed to
11_months; all of them are in the same format:
Format of the output files for test sets
The format of the output is the same as the format of
expected.tsv files. The order of key-value pairs does not matter.
Format of the train set
The format of the train set is the same as the format of a test set.
Original data was gathered from the Edgar Database.