Must-c EN-DE dataset not found

Question

Must-c EN-DE dataset not found

EyjafjalIa opened this issue 7 months ago · comments

Hi,
Thanks for open-sourcing your code. I try to use link FBK given on example Step 1. Download and Preprocess Dataset to download Must-c dataset, but the link change to https://www.fbk.eu/en/research-centers/ when I click. Do you have Must-c dataset from other platform? Thanks!

Biao Zhang · Answer 1 · Mon Jan 22 2024 12:04:55 GMT+0800 (China Standard Time)

Hey, it seems that the original FBK link was broken. If you just need the textual translation of mustc, you can check out the huggingface datasets, e.g. https://huggingface.co/datasets/enimai/MuST-C-de

Wan Jiarui · Answer 2 · Tue Jan 23 2024 00:44:03 GMT+0800 (China Standard Time)

Thanks a lot! I have found Must-c dataset on huggingface and noticed that the dataset is csv files. Could you give structure of the Must-c dataset on FBK before? So that I can check whether the preprocess_phoneix.sh work.

Biao Zhang · Answer 3 · Tue Jan 23 2024 22:13:56 GMT+0800 (China Standard Time)

they are in a different format. In mustc, textual data is organised in plain text with two files: train.en and train.de. each line in the file contains one sentence, and sentences at the same line are translations to each other. You can split the data in csv to plain text.

Wan Jiarui · Answer 4 · Tue Jan 23 2024 22:18:27 GMT+0800 (China Standard Time)

I got it! Thank you again!