bzhangGo / sltunet

SLTUNET: A Simple Unified Model for Sign Language Translation (ICLR 2023)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Must-c EN-DE dataset not found

EyjafjalIa opened this issue · comments

Hi,
Thanks for open-sourcing your code. I try to use link FBK given on example Step 1. Download and Preprocess Dataset to download Must-c dataset, but the link change to https://www.fbk.eu/en/research-centers/ when I click. Do you have Must-c dataset from other platform? Thanks!

Hey, it seems that the original FBK link was broken. If you just need the textual translation of mustc, you can check out the huggingface datasets, e.g. https://huggingface.co/datasets/enimai/MuST-C-de

Thanks a lot! I have found Must-c dataset on huggingface and noticed that the dataset is csv files. Could you give structure of the Must-c dataset on FBK before? So that I can check whether the preprocess_phoneix.sh work.

they are in a different format. In mustc, textual data is organised in plain text with two files: train.en and train.de. each line in the file contains one sentence, and sentences at the same line are translations to each other. You can split the data in csv to plain text.

I got it! Thank you again!