kpu / preprocess

Corpus preprocessing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trouble Extracting Monolingual Datasets from SeamlessAlign

nassergharbi opened this issue · comments

commented

Problem Description

The dataset provided at this link presents challenges in extracting Maltese datasets. Specifically, the metadata for Textual <-> Audio alignment includes a subset seemingly sourced from common-crawl, with specified data URLs, and another subset from other corpora lacking specified URLs.

Questions

  1. Data Retrieval Without URLs:

    • How can one retrieve datasets for which there are no specified URLs?
  2. Linking Audio to Transcription for Non-Common Crawl Corpora:

    • For datasets from sources other than Common Crawl, how can the link between audio and transcription be established when no URL is provided?

Objective

I am particularly interested in extracting Maltese audio datasets with corresponding transcriptions to establish a gold standard.

Dataset Details

Thank you very much in advance for your support!

commented

Sorry, I've posted this issue in the wrong place. The correct repo is here: facebookresearch/seamless_communication#338