ViktorAlm / Nasjonalbank-converter

Converts nasjonalbank 16khz dataset into libirispeech format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nasjonalbank-converter

Converts nasjonalbiblioteks språkbank 16khz dataset into something like libirispeechs format

swe_nor/{author_name}/{recording_session}/{recording}.wav
swe_nor/{author_name}/{recording_session}/texts

recording|text

This is a hack that i created with the intention of never showing anyone. Its javscripty python with random weirdness thrown into it. I am ashamed. it creates some empty folders and some folders with their recording session as authors. I just deleted those instead of fixing the code so some manual clean up is needed afterwards if you do not wish do fix yourself.

The extracted folders from Swedish and Norwegian has different naming conventions. The 0467 for Swedish is a dataset ID like 0463 is for Norwegian. This name id is repeated within the folder structures. Theres an adb folder which has the id in it. I grab the ID from the main folder name, but since the name of the folders follow a different naming convention i suggest renaming either of them into one naming convention and choosing the default parameter of

def openFolderStations(folder, data, wavs, spls, lang="swe"):

to "swe" or "no" depending of which format you choose, i have no idea wich format the danish convention follows

swe:

"0467 sv train 1/", "0467 sv train 2/", "0467 sv train 3/", "0468 sv test/"

no:

"no.16khz.0463-1/", "no.16khz.0463-2/", "no.16khz.0463-3/", "no.16khz.0463-4/", "no.16khz.0464-testing/"]

Run it in the same path as the folders. Line 177-178 sets which folders to convert merge them or remove howevery you feel like. The paths works on ubuntu. This has no fancy threading. 1 core to rule them all.

Swedish:

  • sve.16khz.0467-1.tar.gz
  • sve.16khz.0467-2.tar.gz
  • sve.16khz.0467-3.tar.gz
  • sve.16khz.0468.tar.gz

https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-16&lang=en

Norwegian:

  • no.16khz.0463-1.tar.gz
  • no.16khz.0463-2.tar.gz
  • no.16khz.0463-3.tar.gz
  • no.16khz.0463-4.tar.gz
  • no.16khz.0464-testing.tar.gz

https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-13&lang=en

Danish(untested needs recoding depending on folder naming convention):

  • da.16kHz.0565-1.tar.gz
  • da.16kHz.0565-2.tar.gz
  • da.16kHz.0611.tar.gz

https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-19&lang=en

Vaugely based on https://github.com/codemandosch/taco2swe

About

Converts nasjonalbank 16khz dataset into libirispeech format


Languages

Language:Python 100.0%