Sample output:
1 CONST-1 you can remove it .
....................................................................................................
1 SEQUE-B you can take it off .
1 SEQUE-1 you can withdraw .
====================================================================================================
3 CONST-1 but , let 's face it , underachiever , dead @-@ end life , okay ?
....................................................................................................
3 SEQUE-B let us be frank . he 's got a lousy job , he ain 't got no prospects .
^ ---- ----------- -
3 SEQUE-1 let us be frank . he has a lousy job , he no longer has any prospect .
^^ +++++++++++++++
====================================================================================================
Usage: sequence-diff.py -f FILE [FILE ...] [-ft FILE_TAG [FILE_TAG ...]]
[-c CONST [CONST ...]] [-ct CONST_TAG [CONST_TAG ...]] [-d] [-m {char,token}] [-v]
Example: python sequence-diff.py -c source_file -f reference_file hypothesis_file
Optional arguments:
-f FILE [FILE ...], --file FILE [FILE ...]
input files of sequences to be compared (the first file is the base to be compared with,
such as reference translations) (default: None)
-c CONST [CONST ...], --const CONST [CONST ...]
files of sequences not participating in the comparison,
such as source sentences to be translated (default: [])
-ft FILE_TAG [FILE_TAG ...], --file-tag FILE_TAG [FILE_TAG ...]
tags of input files (default: None)
-ct CONST_TAG [CONST_TAG ...], --const-tag CONST_TAG [CONST_TAG ...]
tags of const files (default: None)
-d, --condense condense the comparison of multiple sequences without showing diffs (default: False)
-m {char,token}, --mode {char,token}
compute diffs at character level or token level (default: char)
-v, --verbose print all sequences in the condense mode (default: False)
Bitext Identical Pairs
Sample output:
19056 inclusion=True
FILE-1 We' re on our way , way , way , we' re on our way
FILE-2 ♪ We 're on our way , way , way ♪ ♪ We 're on our way , way , way , we 're on our way ... ♪
====================================================================================================
21584 similarity=0.68
FILE-1 We want to make a place we can learn to love , anywhere we can be proud of .
FILE-2 ♪ We wanna make a place where we can learn to love ♪ ♪ Build a world that we can be proud of ♪
====================================================================================================
27541623 bitext pairs were read
770532 pairs (2.80%) were identical with inclusion and threshold=0.50
Sample output:
3 length-ratio=2.85
FILE-1 Саvеndіѕh , mais la totalité s' élève à ... 2,343 livres et 16 cts .
FILE-2 2,343 pounds and 16 pence .
====================================================================================================
15 uppercase=True
FILE-1 ЅΑΝ FRΑΝСΙЅСΟ , 1973
FILE-2 SAN FRANCISCO , 1973
====================================================================================================
27541623 bitext pairs were read
2960667 pairs (10.75%) were filtered out
- 2944687 pairs (10.69%) were imbalanced with length-ratio >= 2.00
- 19386 pairs (0.07%) were uppercased (both source and target)
233423 pairs (0.85%) have been capitalized
Usage: bitext-cleaning.py [-f FILE [FILE ...]] [-o OUTPUT [OUTPUT ...]] [-r RATIO] [-i] [-u] [-v]
Example: python bitext-cleaning.py -f file1 file2 -o output1 output2 -r 2.0 -u -v
Optional arguments:
-f FILE [FILE ...], --file FILE [FILE ...]
input bitext file(s) (default: None)
-o OUTPUT [OUTPUT ...], --output OUTPUT [OUTPUT ...]
output bitext file(s) (default: None)
-r RATIO, --ratio RATIO
remove pairs which length ratios are no less than a threshold (default: None)
-i, --incomplete remove pairs if they contain incomplete sentences,
i.e. no .!?" at the end (default: False)
-u, --uppercase remove pairs if both source and target are uppercased,
otherwise capitalize uppercase strings (default: False)
-v, --verbose print identified pairs (default: False)