Speech-Analysis-for-Speaker-Characteristics-Estimation

hello

Table of Content:

Introduction
Topics to be Covered
Age:
- Recent Publications
- Datasets
- Comparison of different techniques
- Resources & Code
Height:
- Recent Publications
- Datasets
- Comparison of different techniques
- Resources & Code
Accent:
- Recent Publications
- Datasets
- Comparison of different techniques
- Resources & Code
Miscellaneous

Introduction:

Hello everyone!
I, under the guidance of my thesis supervisor: Prof. Chng Eng Siong (School of Computing, Nanyang Technological University Singapore) and Dr Van Tung Pham, am compiling this repository with a sincere hope of benefitting the community from the same.
This is a one-stop-repository for most of the recent works and developments in the domain of 'speech analysis for speaker characteristic recignition and profiling'.
I shall try to cover as much content as I can on the said topic including my own work.

Topics to be Covered:

For the time being, we shall be covering predominantly three aspects of speaker profiling:

Age
Height
Accent

I shall be adding all the recent publications along with their respective analysis and a brief comparisons of their results with other works separately for Age, Height, and Accent. Moreover, I shall be briefing about the popular datasets which are being used in the literature over the years of these purposes. Finally, I shall be sharing all the resources and codes that I compile or come across for our purpose.

Age:

Publications:

I am citing some of the recent of works of literature that I have come across and found useful:

Popular Datasets:

TIMIT:
- No. of utterances: 6300
- No. of Speaker: 630 and 8 Dialects
- Sampling Rate: 16 kHz
- Purpose: ASR & Transcription
- Male : Female :: 2:1
- No. of Samples/ person: 10
- Includes time-aligned orthographic, phonetic and word transcriptions
- License: Copyright 1993 Linguistic Data Consortium
- Year: 1993

NIST SRE 2010:
- No. of utterances: 5583
- No. of Speakers: 442
- Quality: Telephone
- Contains 2,255 hours of American English telephone speech and speech recorded over a microphone channel involving an interview scenario
- Sampling Rate: 8 kHz
- Duration: ~ 5 mins
- Year: 2017

NIST SRE 2008:
- No. of utterances: ~ 3500
- No. of Speakers: ~ 350
- Quality: Telephone
- Contains 942 hours of multilingual telephone speech and English interview speech along with transcripts
- Sampling Rate: 8 kHz.
- Duration: ~ 3 mins
- Year of publication: 2011

Comparison of different Techniques/ Works:

S. No.	Paper Cited	Dataset Used	Methodology	Gender	RMSE	MAE	Correlation Coefficient
1.	Kalluri et al. (2019)	TIMIT	GMM-Posteriors + DNN + SVR	Male	7.60	-	-
				Female	8.63	-	-
2.	Singh et al. (2016)	TIMIT	GMM + Random Forest	Male	8.10	5.70	-
				Female	9.10	6.20	-
3.	Babu et al. (2020)	TIMIT	Fstats + Formant + Harmonic Features + SVR	Male	8.10	5.20	-
				Female	8.70	5.60	-
4.	Poorjam et al. (2014)	NIST SRE	i-Vectors + Linear Kernel SVR	Male	-	-	0.76
				Female	-	-	0.85
5.	Fedorovai et al. (2015))	NIST SRE	MFCC + i-Vectors + ANN	Male	-	6.42	0.75
				Female	-	5.56	0.81
6.	Ghahremani et al. (2018)	NIST SRE	Fusion (i-Vector + x-Vector) + LDA	Male	-	5.84	0.83
				Female	-	4.68	0.92
7.	Bahari et al. (2012)	NIST SRE	i-Vectors + SVR	Male	-	7.63	-
				Female	-	7.61	-
8.	Zazo et al. (2018)	NIST SRE	MFCC + LSTMs	Male	-	7.79	0.48
				Female	-	6.97	0.65
9.	Sadjadi et al. (2016)	NIST SRE	fMLLR + i-Vectors	Male	-	4.70	0.89
				Female	-	4.70	0.91

Resources & Codes:

I shall be updating this section as I go along with this project.

Height:

Publications:

Popular Datasets:

TIMIT:
- No. of utterances: 6300
- No. of Speaker: 630 and 8 Dialects
- Sampling Rate: 16 kHz
- Purpose: ASR & Transcription
- Male : Female :: 2:1
- No. of Samples/ person: 10
- Includes time-aligned orthographic, phonetic and word transcriptions
- License: Copyright 1993 Linguistic Data Consortium
- Year: 1993

NIST SRE 2010:
- No. of utterances: 5583
- No. of Speakers: 442
- Quality: Telephone
- Contains 2,255 hours of American English telephone speech and speech recorded over a microphone channel involving an interview scenario
- Sampling Rate: 8 kHz
- Duration: ~ 5 mins
- Year: 2017

NIST SRE 2008:
- No. of utterances: ~ 3500
- No. of Speakers: ~ 350
- Quality: Telephone
- Contains 942 hours of multilingual telephone speech and English interview speech along with transcripts
- Sampling Rate: 8 kHz.
- Duration: ~ 3 mins
- Year of publication: 2011

Comparison of different Techniques/ Works:

S. No.	Paper Cited	Dataset Used	Methodology	Gender	RMSE	MAE	Correlation Coefficient
1.	Kalluri et al. (2019)	TIMIT	GMM-Posteriors + DNN + SVR	Male	6.85	-	-
				Female	6.29	-	-
2.	Singh et al. (2016)	TIMIT	GMM + Random Forest	Male	7.00	5.30	-
				Female	6.50	5.20	-
3.	Babu et al. (2020)	TIMIT	Fstats + Formant + Harmonic Features + SVR	Male	6.80	5.20	-
				Female	6.10	4.80	-
4.	Williams et al. (2013)	TIMIT	Fusion (MFTR + GMM-HDBC)	Male	-	5.37	-
				Female	-	5.49	-
5.	Poorjam et al. (2015)	NIST SRE	i-Vector + LSSVR	Male	-	-	0.41
				Female	-	-	0.40
6.	Mporas et al. (2009)	TIMIT	MFCC + Bagging	Male	6.80	5.30	-
				Female	6.40	5.20	-
7.	Arsikere et al. (2013))	TIMIT	MFCC + SGR + GMM	Male	6.40	-	-
				Female	5.80	-	-

Resources & Codes:

I shall be updating this section as I go along with this project.

Accent:

Publications:

Popular Datasets:

CSLU Foreign-Accented English (FAE) Dataset:

No. of utterances: 4925
No. of Accents: 23
Quality: Telephone
Duration: ~20 sec
Type of Speakers: All Non-Native
Three native speakers of American English independently listened to each utterance and judged the speakers' accents on a 4-point scale:
- negligible/no accent,
- mild accent,
- strong accent and
- very strong accent.
Year: 2007

TS Corpus of Non-Native Spoken English:

No. of utterances: 5132
No. of Accents: 11
Sampling Rate: 16 Hz
Type of Speakers: All Non-Native
Year: 2014

Speech Accent Archive:

No. of Samples: 2140
Type of Speakers: Native & non-native
Purpose: Speaker Profiling
Year: 2013
Speakers with 214 different native languages.
Speakers from 177 different countries
Questions answered by subjects:
- Where were you born?
- What is your native language?2
- What other languages besides English and your native language do you know?
- How old are you?
- How old were you when you first began to study English?
- How did you learn English? (academically or naturalistically)
- How long have you lived in an english-speaking country? Which country?
License: CC BY-NC-SA 4.0

Comparison of different Techniques/ Works:

S. No.	Paper Cited	Dataset Used	Methodology	UAR	Accuracy/ Detection Rate
1.	Schuller et al. (2016)	ETS Corpus	16-bit signed integer PCM WAV + SVM	47.5%	-
2.	Choueiter et al. (2008)	CLSU Foreign Accented English Corpus	GT + MMI + HLDA	-	32.7%
3.	Ahmed et al. (2019)	Speech Accent Archive	Spectrograms + CNNs	-	70.33%
4.	Williams et al. (2013)	Speech Accent Archive	MFCC + LSTM	-	52.27%
5.	Poorjam et al. (2015)	ETS Corpus	MFCC + DNN + RNN	50.40%	50.20%

tarun360 / Speech-Analysis-for-Speaker-Characteristics-Estimation

Speech-Analysis-for-Speaker-Characteristics-Estimation

Table of Content:

Introduction:

Topics to be Covered:

Age:

Publications:

Popular Datasets:

Comparison of different Techniques/ Works:

Resources & Codes:

Height:

Publications:

Popular Datasets:

Comparison of different Techniques/ Works:

Resources & Codes:

Accent:

Publications:

Popular Datasets:

Comparison of different Techniques/ Works:

Resources & Codes:

About

Languages