n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Postprocess_wikitext does not separate wikipedia articles.

PiotrCzapla opened this issue · comments

The LM needs to be trained on long articles to be useful for downstream tasks. We solve it in wikitext-103 by splitting articles as follows:

def istitle(line):
    return len(re.findall(r'^ = [^=]* = $', line)) != 0

postporcess_wikitext need to add similar separators to the articles to train correctly BOS and EOS tokens. Until this is fixed the training of LM on custom wikipedia will break.

I noticed that titles were still in tokens and my quick fix was to force number of words/tokens per line in [lang].wiki.[split].tokens to be between 10 and 250 (I found sentences with 4k tokens!). Does this make sense?

@abedkhooli long sentences are fine for language modeling as they are joined together into one super long string anyway and then chopped to bppt chunks. So 4k is fine. The problem appears when we have very short sentences as the training examples are randomized before they are joined so then our language models never learn long dependencies and they start to thread "." as EOS. Which completely breaks imdb.

I see you already fixed titles and merged refactor into master. Great efforts!
My worry about extra long 'sentences' is one BOS/EOS for a full article (average sentence length for Arabic wiki is around 36 words including title effect).

My worry about extra long 'sentences' is one BOS/EOS for a full article
What's wrong with one BOS/EOS per article?

My worry about extra long 'sentences' is one BOS/EOS for a full article
What's wrong with one BOS/EOS per article?
I thought it was odd to have a sentence that long (in this case it is a full charter of 4903 words due to bad punctuation - no spaces after periods, no new lines.
I just rebuilt the wiki tokens from new scripts and noticed there are still one-word items (before applying postprocess) and the mean words per line was less than before (this depends on the distribution of title lengths so can't make conclusions). So, may be another filtering step is needed.

Not sure about the effect, but filtering out similar cases as outliers (one long sentence articles are not expected to be seen in real life). Basically, the article goes like this:
"I went to the shop.The shop was closed.I did not buy anything."