spirineta / labphon2018

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Variation in Georgian using large scale data collection

Background

Georgian is an understudied agglutinative language spoken in the Caucas Mountans in Eastern Europe. Geogian is a low resource language which has very little access to software and tools (such as spell checkers, dictionaries and search engine tokenizers) which would facilitate using Georgian in its written form. Georgian is spoken by 1.4 million speakers in the Republic of Georgia and members of the Georgian diaspora throughout the world. While Georgian is the national language of Georgia, most computer systems sold in Georgia are offered in Russian or in English, and because most search engines lack support for Geogian, users perform internet searches using Russian or English keywords (Sherouse, 2014).

Methodology

In 2014 we created open source libraries and tools to facilitate usage of the Georgian language by Georgian speakers (Dunham et al 2014). One of these tools was Gismet, an Android application which can be used by Georgian speakers to train their Android smartphones to recognize their speech using PocketSphinx (Huggins-Daines 2006). The software was made free, public and also open source on GitHub, a social coding site where developers can discover and contribute to the source code.

Participants discover the application from the Google Play App Store. After installing the application they are led throug a tutorial where they record 2-7 utterances to train the application to their voice. After training they can add additional training sentences or begin using the application anywhere in the Android system where keyboard input is provided. The training utterances are uploaded to a central server where they are processed using Praat and the CMUSphinx language model toolkit (Walker et al 2004) to customize the acoustic model for the speaker. The stimuli are comprised of 7 utterances which were chosen among frequent SMS dictations in a corpora offered by 3 speakers in Batumi, Georgia.

Data

Since 2014 1,000 users have used the application to train the default language model to their voices. The resulting dataset contains only elicited training recordings, no user defined messages are included in the dataset. In this paper we discuss preliminary findings in the data collected and variation in the data along two directions, corelation of prosodic variation and GPS location of the recording, and prosodic variation across participants.

Figures

Map of Georgia (source)

georgia

Specrogram of "sad xar" with question intonation Specrogram of "sad xar" with statement intonation

References

@InProceedings{dunham-chiodo-horner:2014:W14-22,
  author    = {Dunham, Joel  and  Chiodo, Gina  and  Horner, Joshua},
  title     = {LingSync \& the Online Linguistic Database: New Models for the Collection and Management of Data for Language Communities, Linguists and Language Learners},
  booktitle = {Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {24--33},
  url       = {http://www.aclweb.org/anthology/W14-2204}
}


@article{SHEROUSE20141,
  title = "Hazardous digits: Telephone keypads and Russian numbers in Tbilisi, Georgia",
  journal = "Language & Communication",
  volume = "37",
  number = "Supplement C",
  pages = "1 - 11",
  year = "2014",
  issn = "0271-5309",
  doi = "https://doi.org/10.1016/j.langcom.2014.03.001",
  url = "http://www.sciencedirect.com/science/article/pii/S0271530914000172",
  author = "Perry Sherouse",
  keywords = "Sociotechnical system, Telephone, Language ideology, Numeral system, Numbers"
}

@incollection{juhar2012recent,
  title={Recent progress in development of language model for Slovak large vocabulary continuous speech recognition},
  author={Juh{\'a}r, Jozef and Sta{\v{s}}, J{\'a}n and Hl{\'a}dek, Daniel},
  booktitle={New technologies-trends, innovations and research},
  year={2012},
  publisher={InTech}
}

@article{walker2004sphinx,
  title={Sphinx-4: A flexible open source framework for speech recognition},
  author={Walker, Willie and Lamere, Paul and Kwok, Philip and Raj, Bhiksha and Singh, Rita and Gouvea, Evandro and Wolf, Peter and Woelfel, Joe},
  year={2004},
  publisher={Sun Microsystems, Inc.}
}

@INPROCEEDINGS{Huggins-daines06pocketsphinx:a,
    author = {David Huggins-Daines and Mohit Kumar and Arthur Chan and Alan W Black and Mosur Ravishankar and Alex I. Rudnicky},
    title = {PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices},
    booktitle = {in Proceedings of ICASSP},
    year = {2006}
}

TODO

  • Ioana Chitoran georgian clusters

Links

Georgia maps

World map

D3 maps

Georgian SMS

http://www.sciencedirect.com/science/article/pii/S0271530914000172

About


Languages

Language:Shell 100.0%