hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Special quote filter for CCMatrix and CCAligned

XapaJIaMnu opened this issue · comments

A reminder to myself to fix in the near future

CCMatrix and CCALigned contain a lot of quotes that are cut off arbitrary on both sides. Sometimes one side will have quotes and the other won't or one side would start with a number and the other won't. Same for the end of the sentence. Perhaps a filter can be added that will "fix up" those sentences by removing superfluous quotes from both sides when they are mismatched.