Trouble Extracting Monolingual Datasets from SeamlessAlign

Question

Trouble Extracting Monolingual Datasets from SeamlessAlign

nassergharbi opened this issue 7 months ago · comments

Problem Description

The dataset provided at this link presents challenges in extracting Maltese datasets. Specifically, the metadata for Textual <-> Audio alignment includes a subset seemingly sourced from common-crawl, with specified data URLs, and another subset from other corpora lacking specified URLs.

Questions

Data Retrieval Without URLs:
- How can one retrieve datasets for which there are no specified URLs?
Linking Audio to Transcription for Non-Common Crawl Corpora:
- For datasets from sources other than Common Crawl, how can the link between audio and transcription be established when no URL is provided?

Objective

I am particularly interested in extracting Maltese audio datasets with corresponding transcriptions to establish a gold standard.

Dataset Details

Dataset Link: https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md

Thank you very much in advance for your support!

Nas · Answer 1 · Mon Jan 22 2024 23:58:57 GMT+0800 (China Standard Time)

Sorry, I've posted this issue in the wrong place. The correct repo is here: facebookresearch/seamless_communication#338