zzkzzk1996 / RU_capstone

Rutgers ECE capstone(2019): Multilingual ASR data collection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RU_capstone

Rutgers ECE capstone(2019): Multilingual ASR data collection

Introduction

Crawl multilingual audio and text reasources from web, achieve forced alignment on those data.

There would be two part of our project, the first is Crawler, the second is Aligner.

Crawler

In this part, we achieved web crawling on two website. We crawled multilanguage audio and corresponding text data.

WordProject

WordProject is a website that provide multilingual version of Bible. Actually, it support 37 languages. The reasources from this website have a perfect match rate.

SBS News

SBS News is a news website that provide news in over 60 kinds of languages.

Aligner

In this part, we achieved forced alignment based on Montreal-Forced-Aligner and Kaldi using the data we crawled before.

Our output would be TextGrid format files.

TextGrid demo:

TextGrid photo

Video Demo

demo

Team Member

Mo Shi, Chaoji Zuo, Ziqi Wang, Zekun Zhang, Duc Le

About

Rutgers ECE capstone(2019): Multilingual ASR data collection


Languages

Language:Makefile 48.1%Language:C 30.2%Language:Python 21.5%Language:Shell 0.2%