valentinmace / noisy-text

Add noise to your text, can be used to improve synthetic training corpus for Neural Machine Translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Noisy-Text

Add noise to your text, inspired by Edunov et al. (2018) "Understanding Back-Translation at scale"

Made at Qwant Research during my internship

It is often a good idea to add noise to your syntetic text data, when using backtranslation for example

Edunov et al. (2018) showed that doing so can help to provide a stronger training signal

This repository contains:

  • A script to reproduce experiments described by Edunov et al. (2018) in their noise approach
  • A simple architecture so you can play with noise parameters or implement your own noise functions

Installation

Libraries you'll need to run the project:

{tqdm}

Clone the repo using

git clone https://github.com/valentinmace/noisy-text.git

Usage

I've implemented the 3 noise functions described in the paper:

  1. Delete words with given probability (default is 0.1)
  2. Replace words by a filler token with given probability (default is 0.1)
  3. Swap words up to a certain range (default range is 3)

The default parameters are to reproduce Edunov et al. (2018) experiments but you can play with them and maybe find better values

Example of simple usage

python add_noise.py data/example --progress

Example of complete usage

python add_noise.py data/example --delete_probability 0.9 --replace_probability 0.9  --filler_token 'MASK' --permutation_range 3

Important Note

If you are using a subword tool such as SentencePiece after adding noise to your corpus, notice that your replacement token (which is 'BLANK' by default) might be segmented into somthing like '▁B LAN K'

I recommend to make a pass on your corpus to correct it: (adapt it to your token and segmentation)

sed -i 's/▁B LAN K/▁BLANK/g' yourtextfile

Results

I've run NMT experiments on WMT 2019 de-en corpus, using all available parallel data and adding the monolingual news-crawl 2018 via backtranslation.

After translating it from german to english to have my syntetic data, I added noise to it using this repo, giving the following results. All results are BLEU Scores

The first table reports a Transformer model identical to the "base model" in Vaswani et al. (2017), the second table reports a "Transformer Big" model, from the same paper

Model newstest2017 newstest2018
baseline 26.62 40.47
backtranslation only 27.06 40.06
backtranslation + noise 27.88 41.92

Transformer base model

Model newstest2017 newstest2018
baseline 29.75 45.8
backtranslation + noise 31.33 47.4

Transformer Big model

Notes

Do not hesitate to contact me if you need some help, need a feature or see some bug

Feel free and welcome to contribute

Meta

Valentin Macé – LinkedInYouTubeTwitter - valentin.mace@kedgebs.com

Distributed under the MIT license. See LICENSE for more information.

About

Add noise to your text, can be used to improve synthetic training corpus for Neural Machine Translation

License:MIT License


Languages

Language:Python 100.0%