sanjibnarzary / InstructMT

A collection of instruction data and scripts for machine translation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ParroT

InstructMT Data & Scripts @ParroT

A collection of instruction data and scripts for machine translation.

The resulting files mainly fit the format of ParroT and partially that of Stanford-Alpaca.

Machine Translation

Data Resources

Below lists the resources of high-quality translation data for instruction tuning. You can access the data through the links. The previous links are problematic, now they are fixed.

Data Source Zh-En En-Zh De-En En-De Format
Translation newstest17-20 12.2k 12.2k 13.3k 13.3k `TXT`
MQM-Score newstest20 20.0k n/a n/a 14.1k `JSON`
MQM-Error newstest20 124.3k n/a n/a 79.0k `TXT`
COMET-Score newstest20 n/a 19.8k 9.4k n/a `JSON`
Translation wmt20 475.0k 475.0k n/a n/a `TXT`: Filtered from 26M

ParroT Instructions

parrot
├── alpaca
│   └── convert_alpaca_to_hf.py
├── contrastive-instruction
│   ├── convert_cometscore_to_csi_alpaca.py
│   ├── convert_mqmscore_to_csi_alpaca.py
│   └── instruct_t2t.txt
├── error-guided-instruction
│   ├── convert_cometscore_to_egi_alpaca.py
│   ├── convert_mqmerror_to_egi_alpaca.py
│   └── instruct_e2t.txt
└── translation-instruction
    ├── convert_pair_to_alpaca.py
    └── instruct_follow.txt

1. Translation Instruction

Example usage and output:

cd ./parrot/translation-instruction

# Download the Translation data into the folder

python3 convert_pair_to_alpaca.py \
    -s brx_Deva -t eng_Latn \
    -if instruct_follow.txt \
    -sf ~/datas/eng-brx/train.brx \
    -tf ~/datas/eng-brx/train.eng \
    -of data_ti_alp.brx_Deva-eng_Latn.json
[
    {
        "instruction": "I'd appreciate it if you could present the English translation for these sentences.",
        "input": "28岁厨师被发现死于旧金山一家商场",
        "output": "28-Year-Old Chef Found Dead at San Francisco Mall"
    },
    ...
]

2. Contrastive Instruction

Example usage and output for MQM Zh-En:

cd ./parrot/contrastive-instruction

# Download the MQM-Score data into the folder

python3 convert_mqmscore_to_csi_alpaca.py \
    -s zh -t en \
    -if instruct_t2t.txt \
    -i sys_rating_mqm.zh-en.json \
    -o data_csi_alp.zh-en.json
[
    {
        "instruction": "Could you supply the English translation for the upcoming sentences?",
        "input": "国有企业和优势民营企业走进赣南革命老区。\n\n### Hint: A superior translation would be",
        "output": "<p>State-owned enterprises and advantageous private enterprises entered the old revolutionary area of Gannan.</p> rather than <p>State-owned enterprises and dominant private enterprises entered the old revolutionary area of southern Jiangxi.</p>"
    },
    ...
]

Example usage and output for COMET En-Zh:

cd ./parrot/contrastive-instruction

# Download the COMET-Score data into the folder

python3 convert_cometscore_to_csi_alpaca.py \
    -s en -t zh \
    -if instruct_t2t.txt \
    -i sys_rating_comet.en-zh.json \
    -o data_csi_alp.en-zh.json
[
    {
        "instruction": "Could you supply the Chinese translation for the upcoming sentences?",
        "input": "Michael Jackson wore tape on his nose to get front pages, former bodyguard claims\n\n### Hint: A superior translation would be",
        "output": "<p>前保镖声称迈克尔·杰克逊为登上头条新闻在鼻子上贴上胶带</p> rather than <p>前保镖称迈克尔·杰克逊为上头版在鼻子上贴胶带</p>"
    },
    ...
]

3. Error-Guided Instruction

Example usage and output for MQM En-Zh:

cd ./parrot/error-guided-instruction

# Download the MQM-Error data into the folder

python3 convert_mqmerror_to_egi_alpaca.py \
    -s zh -t en \
    -if instruct_e2t.txt \
    -i mqm_newstest2020_zhen.txt \
    -o data_egi_alp.zh-en.json
[
    {
        "instruction": "Could you supply the English translation for the upcoming sentences?",
        "input": "国有企业和优势民营企业走进赣南革命老区。\n\n### Hint: A rendition having minor fluency/grammar errors is possible",
        "output": "State-owned enterprises and dominant private enterprises entered the old revolutionary area of southern Jiangxi <v>State-owned enterprises and dominant private enterprises entered the old revolutionary area of southern Jiangxi.</v> "
    },
    ...
]

Example usage and output for COMET En-Zh:

cd ./parrot/error-guided-instruction

# Download the COMET-Score data into the folder

python3 convert_cometscore_to_egi_alpaca.py \
    -s en -t zh \
    -if instruct_e2t.txt \
    -i sys_rating_comet.en-zh.json \
    -o data_egi_alp.en-zh.json
[
    {
        "instruction": "Could you supply the Chinese translation for the upcoming sentences?",
        "input": "Michael Jackson wore tape on his nose to get front pages, former bodyguard claims\n\n### Hint: A rendition having no errors is possible",
        "output": "前保镖声称迈克尔·杰克逊为登上头条新闻在鼻子上贴上胶带"
    },
    ...
]

* Alpaca Format

The above three instruction types can be used for Stanford-Alpaca directly.

Or you can transform them to fit the format of ParroT as follows:

cd ./parrot/translation-instruction

python3 ../alpaca/convert_alpaca_to_hf.py \
    -i data_ti_alp.zh-en.json \
    -o data_ti_hf.zh-en.json
# Each dict is saved as one line but we show it in multiple lines for better appearance
{
    "text": "28-Year-Old Chef Found Dead at San Francisco Mall</s>",
    "prefix": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nI'd appreciate it if you could present the English translation for these sentences.\n\n### Input:\n28岁厨师被发现死于旧金山一家商场\n\n### Response:"
}

Instruction Variants (To Be Optimized)

1. Translation Instruction

【 Instruction + Source > Target 】: Input the instruction and source sentence at the same time.

image

【 Instruction > Response > Source > Target 】: Input the instruction only, then the LLMs should remind the user to input the source sentence.

image

【 Source > Instruction > Target 】: Translate the last chat record.

image

Public Impact

Citation

Please kindly cite our paper if you find the data resources here helpful:

@inproceedings{jiao2023parrot,
  title={ParroT: Translating During Chat Using Large Language Models}, 
  author={Wenxiang Jiao and Jen-tse Huang and Wenxuan Wang and Xing Wang and Shuming Shi and Zhaopeng Tu},
  booktitle = {ArXiv},
  year      = {2023}
}

About

A collection of instruction data and scripts for machine translation.


Languages

Language:Python 100.0%