bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Output file encoding should be set to UTF-8

rxzhangGH opened this issue · comments

Hi,

When using the bifixer command line tool, I noticed an issue with line 53 of bifixer.py:

parser.add_argument('output', type=argparse.FileType('w'), default=sys.stdout, help="Fixed corpus")

Since no encoding is specified in the type, a platform-specific encoding will be used and that caused problems for me. I suggest changing the above to:

parser.add_argument('output', type=argparse.FileType('w', encoding='UTF-8'), default=sys.stdout, help="Fixed corpus")

Could you please post the error? The OS and Python would be helpful, also.

It's fixed anyway now.