ctongfei / tape4nmt

a ducttape workflow for neural machine translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tape4nmt

What is that?

tape4nmt is a DuctTape workflow created to replace your bash scripts for running NMT experiments. DuctTape is a workflow management system written by Jonathon Clark. In general, if you are writing bash scripts to run any experiments, you may want to consider using DuctTape instead. It organizes your experiment scripts such that:

  1. you are much less likely to do stupid things by e.g. deleting your best model by accident, use the wrong data for the experiment, the list goes on;
  2. you can easily run, delete, and/or re-run batch experiments, re-using any partial results possible, with correctness guarantee;
  3. you can easily re-use part of your modularized experiment script (called "workflow") to create new workflows.

I barely get rid of EMS. Why creating another workflow for MT?

Face the truth: MT researchers are dealing with pipeline, even we are having a end-to-end system. For example, this is a minimalistic NMT system building pipeline with BPE:

Minimalistic Workflow

Doing this by hand, from my own experiences in several evaluation campaigns, is very error-prone and inefficient. Hence the need for a experiment management system.

tape4nmt also fixes a lot of problems EMS has. For example, EMS consolidated workflow management and workflow definition into one gigantic perl script, with some regular expressions failing about every 6 months. tape4nmt however, uses DuctTape for workflow management and only take cares of workflow definition. It also keeps data/workflow separation in mind throughout its design. Besides, hopefully you'll also find tape4nmt has a better structure for maintainence.

tape4nmt aims to support all the NMT toolkits on the market. By default we use fairseq, but we currently support OpenNMT-py as well, while sockeye support is in the near future. If your favorite NMT toolkit hasn't been supported yet, you should still be able to make a moderate amount of modifications and work mostly with this pipeline. (If you actually did, please consider contributing your effort!)

I'm sold. How do I start to use it?

Before you start, note that this workflow is tested on the sun grid manager configured on CLSP grid out-of-box. But don't worry, you may only need to slightly tweak action_flags and resource_flags in the main.tconf file to be able to use this on other grid that uses sun grid manager. Or in the worse case, you could substitute all submitter with "bash" to run everything locally or define your own submitter!

The workflow is pre-configured to build an IWSLT German-English system with fairseq. Follow this tutorial to get started.

Hey I found your workflow does not do xxx...

This workflow is not supposed to include everything at the first place! I created this repository mainly in hope that people could benefit from running a basic workflow to learn the basics, and then modify it to do whatever they would like to do.

In order to modify this repository properly, you need to learn how to write DuctTape workflow. DuctTape itself contains a pretty comprehensive tutorial. Nathan Schneider has a better one, albeit seems incomplete.

What are the limitations I should keep in mind while using this?

While we try our best to make the workflow intuitive and easy to use, it should be kept in mind that our upstream project DuctTape hasn't been developed for quite a while. As a consequence, there are some loose ends that haven't been properly tied, e.g., this and this, which means there a moderate amounts of work-arounds we have to make (e.g., all the dummy tasks we've made) to achieve what we aim to do. I plan to make a short-list of failure cases that you can hopefully avoid should you modify the workflow to achieve your goal.

What if I want to contribute?

If you think some of your changes are really general and other people should benefit from your changes as well, pull requests are always appreciated!

The developers of this project are actively collaborating with Tongfei Chen to make improvements for DuctTape to weed out issues we've identified. If you love this project and write Scala, consider joining our force on this separate effort!

About

a ducttape workflow for neural machine translation