Postprocess_wikitext does not separate wikipedia articles.

Question

Postprocess_wikitext does not separate wikipedia articles.

PiotrCzapla opened this issue 6 years ago · comments

The LM needs to be trained on long articles to be useful for downstream tasks. We solve it in wikitext-103 by splitting articles as follows:

def istitle(line):
    return len(re.findall(r'^ = [^=]* = $', line)) != 0

postporcess_wikitext need to add similar separators to the articles to train correctly BOS and EOS tokens. Until this is fixed the training of LM on custom wikipedia will break.

Abed khooli · Answer 1 · Tue Jan 01 2019 01:35:12 GMT+0800 (China Standard Time)

I noticed that titles were still in tokens and my quick fix was to force number of words/tokens per line in [lang].wiki.[split].tokens to be between 10 and 250 (I found sentences with 4k tokens!). Does this make sense?

Piotr Czapla · Answer 2 · Tue Jan 01 2019 21:53:22 GMT+0800 (China Standard Time)

@abedkhooli long sentences are fine for language modeling as they are joined together into one super long string anyway and then chopped to bppt chunks. So 4k is fine. The problem appears when we have very short sentences as the training examples are randomized before they are joined so then our language models never learn long dependencies and they start to thread "." as EOS. Which completely breaks imdb.

Abed khooli · Answer 3 · Tue Jan 01 2019 22:31:22 GMT+0800 (China Standard Time)

I see you already fixed titles and merged refactor into master. Great efforts!
My worry about extra long 'sentences' is one BOS/EOS for a full article (average sentence length for Arabic wiki is around 36 words including title effect).

Piotr Czapla · Answer 4 · Tue Jan 01 2019 22:34:24 GMT+0800 (China Standard Time)

My worry about extra long 'sentences' is one BOS/EOS for a full article
What's wrong with one BOS/EOS per article?

Abed khooli · Answer 5 · Wed Jan 02 2019 02:05:50 GMT+0800 (China Standard Time)

My worry about extra long 'sentences' is one BOS/EOS for a full article
What's wrong with one BOS/EOS per article?
I thought it was odd to have a sentence that long (in this case it is a full charter of 4903 words due to bad punctuation - no spaces after periods, no new lines.
I just rebuilt the wiki tokens from new scripts and noticed there are still one-word items (before applying postprocess) and the mean words per line was less than before (this depends on the distribution of title lengths so can't make conclusions). So, may be another filtering step is needed.

Piotr Czapla · Answer 6 · Wed Jan 02 2019 02:10:33 GMT+0800 (China Standard Time)

New scripts just add articles separators so that one article (not a sentence) become a training example separated by BOS. That way LM learns to forget parts of the sentence when it sees "." and to reset it self when it sees "BOS" as it means that the completely new context is comming. I still don't see how this behaviour may cause issues though, are you worried that there will be to llitle bos /eos for LM to learn something meaningful?

…

On 1 Jan 2019, at 19:05, Abed khooli ***@***.***> wrote: My worry about extra long 'sentences' is one BOS/EOS for a full article What's wrong with one BOS/EOS per article? I thought it was odd to have a sentence that long (in this case it is a full charter of 4903 words due to bad punctuation - no spaces after periods, no new lines. I just rebuilt the wiki tokens from new scripts and noticed there are still one-word items (before applying postprocess) and the mean words per line was less than before (this depends on the distribution of title lengths so can't make conclusions). So, may be another filtering step is needed. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#25 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUw1FQKTjmNQjsIyy4fhsSB5j1lNIOaks5u-6N-gaJpZM4ZlYHI>.

Abed khooli · Answer 7 · Wed Jan 02 2019 02:34:17 GMT+0800 (China Standard Time)

Not sure about the effect, but filtering out similar cases as outliers (one long sentence articles are not expected to be seen in real life). Basically, the article goes like this:
"I went to the shop.The shop was closed.I did not buy anything."