bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bifixer doesn't work with new ftfy >=6.0

lpla opened this issue · comments

Running Bifixer through Bitextor automatic tests, shown that it won't work using last month releases of ftfy >=6.0. This is the error:

(log test 101) rule bifixer:
(log test 101)     input: /home/runner/work/bitextor/bitextor/transient-mto2-en-fr/en_fr/06_02.segalign/0.gz
(log test 101)     output: /home/runner/work/bitextor/bitextor/transient-mto2-en-fr/en_fr/07_01.bifixer/0
(log test 101)     jobid: 26
(log test 101)     wildcards: batch=0
(log test 101) 
(log test 101) 2021-04-13 11:05:37,021 - ERROR - Traceback (most recent call last):
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 242, in <module>
(log test 101)     main(args)  # Running main program
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 234, in main
(log test 101)     perform_fixing(args)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 218, in perform_fixing
(log test 101)     fix_sentences(args)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 144, in fix_sentences
(log test 101)     fixed_source = restorative_cleaning.fix(source_sentence, args.srclang, chars_slang, charsRe_slang, punctChars_slang, punctRe_slang)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/restorative_cleaning.py", line 640, in fix
(log test 101)     ftfy_fixed_text = " ".join([ftfy.fix_text_segment(word, fix_entities=True, uncurl_quotes=False, fix_latin_ligatures=False) for word in text.split()])
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/restorative_cleaning.py", line 640, in <listcomp>
(log test 101)     ftfy_fixed_text = " ".join([ftfy.fix_text_segment(word, fix_entities=True, uncurl_quotes=False, fix_latin_ligatures=False) for word in text.split()])
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/lib/python3.8/site-packages/ftfy/__init__.py", line 537, in fix_text_segment
(log test 101)     config = config._replace(**kwargs)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/lib/python3.8/collections/__init__.py", line 413, in _replace
(log test 101)     raise ValueError(f'Got unexpected field names: {list(kwds)!r}')
(log test 101) ValueError: Got unexpected field names: ['fix_entities']

Seems like ftfy modified the heuristics, so the arguments for fix_text_segment call. Then, fix_entities does not work unless using version 5.9 of ftfy (as forced in 931ba2b).

Proper solution should be using new ftfy calls in Bifixer for a future-proof fix, in case of an urgent version bump for security reasons, for example.

It's strange, the changelog says that keyword arguments still work and fix_entities appears in the documentation, this error should not happen ?

Anyway, despite the argument being disappeared, it still fixes the entities because all the options are True by default. So, omitting the parameter results in the same behaviour than before. Latest commit should fix the issue.