Liujingxiu23/TTS-arxiv-daily

Updated on 2024.06.13

Usage instructions: here

This page is modified from here

Table of Contents

TTS

Publish Date	Title	Authors	PDF	Code
2024-06-11	Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Qingkai Fang et.al.	2406.07289	null
2024-06-11	AudioMarkBench: Benchmarking Robustness of Audio Watermarking	Hongbin Liu et.al.	2406.06979	link
2024-06-11	Controlling Emotion in Text-to-Speech with Natural Language Prompts	Thomas Bott et.al.	2406.06406	link
2024-06-10	Meta Learning Text-to-Speech Synthesis in over 7000 Languages	Florian Lux et.al.	2406.06403	link
2024-06-10	MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance	Semin Kim et.al.	2406.05965	null
2024-06-11	WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark	Linhan Ma et.al.	2406.05763	null
2024-06-09	An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS	Xiaofei Wang et.al.	2406.05699	null
2024-06-11	Text-aware and Context-aware Expressive Audiobook Speech Synthesis	Dake Guo et.al.	2406.05672	null
2024-06-08	Autoregressive Diffusion Transformer for Text-to-Speech Synthesis	Zhijun Liu et.al.	2406.05551	null
2024-06-08	VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	Sanyuan Chen et.al.	2406.05370	null
2024-06-07	Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis	Ryan Langman et.al.	2406.05298	null
2024-06-07	XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model	Edresson Casanova et.al.	2406.04904	null
2024-06-07	TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking	Junzuo Zhou et.al.	2406.04840	null
2024-06-07	Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study	Chong Zhang et.al.	2406.04633	null
2024-06-06	Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis	Théodor Lemerle et.al.	2406.04467	null
2024-06-06	Total-Duration-Aware Duration Modeling for Text-to-Speech Systems	Sefik Emre Eskimez et.al.	2406.04281	null
2024-06-06	Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining	Jinlong Xue et.al.	2406.03714	null
2024-06-06	Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model	Jinlong Xue et.al.	2406.03706	null
2024-06-05	Style Mixture of Experts for Expressive Text-To-Speech Synthesis	Ahad Jawaid et.al.	2406.03637	null
2024-06-07	Harder or Different? Understanding Generalization of Audio Deepfake Detection	Nicolas M. Müller et.al.	2406.03512	null
2024-06-05	LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes	Trung Dang et.al.	2406.02897	null
2024-06-04	Seed-TTS: A Family of High-Quality Versatile Speech Generation Models	Philip Anastassiou et.al.	2406.02430	null
2024-06-05	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	Dongchao Yang et.al.	2406.02328	null
2024-06-04	BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation	Hui-Peng Du et.al.	2406.02162	null
2024-06-04	Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis	Kun Zhou et.al.	2406.02009	null
2024-06-03	ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec	Shengpeng Ji et.al.	2406.01205	link
2024-06-03	Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training	Jan Melechovsky et.al.	2406.01018	null
2024-06-02	Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	Chen Chen et.al.	2406.00654	null
2024-05-31	Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities	Vicky Zayats et.al.	2405.18669	null
2024-05-28	TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	Chenyang Le et.al.	2405.17809	null
2024-05-27	RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis	Haoxiang Shi et.al.	2405.17028	null
2024-05-24	Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition	Zijin Gu et.al.	2405.15216	null
2024-05-23	Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models	Jingyi Chen et.al.	2405.14632	null
2024-05-22	A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction	Yue Li et.al.	2405.13477	null
2024-05-20	Multi-speaker Text-to-speech Training with Speaker Anonymized Data	Wen-Chin Huang et.al.	2405.11767	null
2024-05-19	VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications	Mikhail Konenkov et.al.	2405.11537	null
2024-05-18	Exploring speech style spaces with language models: Emotional TTS without emotion labels	Shreeram Suresh Chandra et.al.	2405.11413	null
2024-05-16	Faces that Speak: Jointly Synthesising Talking Face and Speech from Text	Youngjoon Jang et.al.	2405.10272	null
2024-05-16	Building a Luganda Text-to-Speech Model From Crowdsourced Data	Sulaiman Kagumire et.al.	2405.10211	null
2024-05-16	Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model	Siyang Wang et.al.	2405.09768	null
2024-05-15	Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer	Weifei Jin et.al.	2405.09470	null
2024-05-15	Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis	Sho Inoue et.al.	2405.09171	null
2024-05-14	PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset	Yang Hou et.al.	2405.08838	link
2024-04-30	Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech	Hankun Wang et.al.	2404.19723	null
2024-04-29	MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis	Xiang Li et.al.	2404.18398	null
2024-04-28	USAT: A Universal Speaker-Adaptive Text-to-Speech Approach	Wenbin Wang et.al.	2404.18094	link
2024-04-27	TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality	Tiantian Feng et.al.	2404.17983	null
2024-04-26	An RFP dataset for Real, Fake, and Partially fake audio detection	Abdulazeez AlAli et.al.	2404.17721	null
2024-04-23	StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations	Sen Liu et.al.	2404.14946	null
2024-04-23	Retrieval-Augmented Audio Deepfake Detection	Zuheng Kang et.al.	2404.13892	null
2024-04-14	Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling	Quanxiu Wang et.al.	2404.09192	null
2024-04-11	Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network	Mayura Manawadu et.al.	2404.07807	null
2024-04-18	Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness	Xincan Feng et.al.	2404.06714	null
2024-04-10	CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations	Leying Zhang et.al.	2404.06690	null
2024-04-10	The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge	Yiwei Guo et.al.	2404.06079	null
2024-04-07	Cross-Domain Audio Deepfake Detection: Dataset and Analysis	Yuang Li et.al.	2404.04904	null
2024-04-06	HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks	Yingting Li et.al.	2404.04645	link
2024-04-18	Open vocabulary keyword spotting through transfer learning from speech synthesis	Kesavaraj V et.al.	2404.03914	null
2024-04-06	RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Detai Xin et.al.	2404.03204	null
2024-04-03	CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech	Jaehyeon Kim et.al.	2404.02781	null
2024-04-13	PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders	Yu Pan et.al.	2404.02702	null
2024-03-31	Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation	Rohan Chaudhury et.al.	2404.01339	link
2024-03-28	A Review of Multi-Modal Large Language and Vision Models	Kilian Carolan et.al.	2404.01322	null
2024-04-09	KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis	Adal Abilbekov et.al.	2404.01033	link
2024-03-31	CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models	Xiang Li et.al.	2404.00569	link
2024-03-25	VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild	Puyuan Peng et.al.	2403.16973	link
2024-03-20	Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning	Shivam Ratnakant Mhaskar et.al.	2403.15469	null
2024-03-20	UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge	Wataru Nakata et.al.	2403.13720	null
2024-03-20	Building speech corpus with diverse voice characteristics for its prompt-based representation	Aya Watanabe et.al.	2403.13353	null
2024-03-17	Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations	Claudio Pinhanez et.al.	2403.11209	null
2024-03-17	EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech	Ziqi Liang et.al.	2403.08164	null
2024-03-09	HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling	Chunhui Wang et.al.	2403.05989	null
2024-03-05	AttentionStitch: How Attention Solves the Speech Editing Problem	Antonios Alexos et.al.	2403.04804	null
2024-03-07	Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation	Sai Akarsh et.al.	2403.04178	null
2024-03-27	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	Zeqian Ju et.al.	2403.03100	null
2024-03-04	Brilla AI: AI Contestant for the National Science and Maths Quiz	George Boateng et.al.	2403.01699	link
2024-03-02	Towards Accurate Lip-to-Speech Synthesis in-the-Wild	Sindhu Hegde et.al.	2403.01087	null
2024-02-29	Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data	Takaaki Saeki et.al.	2402.18932	null
2024-02-26	An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation	Ahmet Gunduz et.al.	2402.16380	link
2024-02-22	Efficient data selection employing Semantic Similarity-based Graph Structures for model training	Roxana Petcu et.al.	2402.14888	null
2024-02-22	Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition	Rendi Chevi et.al.	2402.14523	null
2024-02-19	On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models	Miri Varshavsky-Hassid et.al.	2402.12423	null
2024-02-19	Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting	Haolin Chen et.al.	2402.12220	null
2024-02-18	Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru	Zining Wang et.al.	2402.11571	null
2024-02-14	MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	Shengpeng Ji et.al.	2402.09378	null
2024-02-15	BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data	Mateusz Łajszczak et.al.	2402.08093	null
2024-03-04	Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like	Naoyuki Kanda et.al.	2402.07383	null
2024-02-09	A New Approach to Voice Authenticity	Nicolas M. Müller et.al.	2402.06304	null
2024-02-08	Unified Speech-Text Pretraining for Spoken Dialog Modeling	Heeseung Kim et.al.	2402.05706	null
2024-02-05	Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations	Álvaro Martín-Cortinas et.al.	2402.03407	null
2024-02-02	Natural language guidance of high-fidelity text-to-speech with synthetic annotations	Dan Lyth et.al.	2402.01912	null
2024-01-23	Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization	Wei-Ping Huang et.al.	2402.01692	null
2024-02-01	Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech	Dong Yang et.al.	2402.00288	null
2024-02-01	PAM: Prompting Audio-Language Models for Audio Quality Assessment	Soham Deshmukh et.al.	2402.00282	link
2024-01-31	Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2	Jiatong Shi et.al.	2401.17619	link
2024-01-28	MunTTS: A Text-to-Speech System for Mundari	Varun Gumma et.al.	2401.15579	null
2024-01-30	VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech	Chenpeng Du et.al.	2401.14321	null
2024-01-25	Text to speech synthesis	Harini s et.al.	2401.13891	null
2024-01-25	SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation	Dong Zhang et.al.	2401.13527	link
2024-01-22	Benchmarking Large Multimodal Models against Common Corruptions	Jiawei Zhang et.al.	2401.11943	link
2024-01-22	Adversarial speech for voice privacy protection from Personalized Speech generation	Shihao Chen et.al.	2401.11857	null
2024-02-16	Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis	Vinotha R et.al.	2401.11771	null
2024-01-19	Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech	Abhinav Garg et.al.	2401.10465	null
2024-02-28	MLAAD: The Multi-Language Audio Anti-Spoofing Dataset	Nicolas M. Müller et.al.	2401.09512	null
2024-01-15	MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory	Robert G. Kimelman et.al.	2401.07967	null
2024-01-14	ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering	Yakun Song et.al.	2401.07333	null
2024-01-12	Multi-Task Learning for Front-End Text Processing in TTS	Wonjune Kang et.al.	2401.06321	link
2024-01-11	End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2	Aniket Tathe et.al.	2401.06183	null
2024-01-11	Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection	Lian Huang et.al.	2401.05614	null
2024-01-10	Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters	Kenichi Fujita et.al.	2401.05111	null
2024-01-07	Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments	Zhonghao Shi et.al.	2401.03581	null
2024-01-07	Transfer the linguistic representations from TTS to accent conversion with non-parallel data	Xi Chen et.al.	2401.03538	null
2024-01-03	Incremental FastPitch: Chunk-based High Quality Text to Speech	Muyang Du et.al.	2401.01755	null
2024-01-03	Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction	Minchan Kim et.al.	2401.01498	null
2023-12-18	Assisting Blind People Using Object Detection with Vocal Feedback	Heba Najm et.al.	2401.01362	null
2023-12-30	Boosting Large Language Model for Speech Synthesis: An Empirical Study	Hongkun Hao et.al.	2401.00246	null
2024-01-01	Normalization of Lithuanian Text Using Regular Expressions	Pijus Kasparaitis et.al.	2312.17660	null
2023-12-27	AE-Flow: AutoEncoder Normalizing Flow	Jakub Mosiński et.al.	2312.16552	null
2023-12-22	Creating New Voices using Normalizing Flows	Piotr Bilinski et.al.	2312.14569	null
2023-12-22	ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations	Cheng Gong et.al.	2312.14398	null
2023-12-19	External Knowledge Augmented Polyphone Disambiguation Using Large Language Model	Chen Li et.al.	2312.11920	null
2023-12-17	A review-based study on different Text-to-Speech technologies	Md. Jalal Uddin Chowdhury et.al.	2312.11563	null
2024-01-31	MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis	Wenhao Guan et.al.	2312.10687	null
2024-02-22	Amphion: An Open-Source Audio, Music and Speech Generation Toolkit	Xueyao Zhang et.al.	2312.09911	link
2023-12-11	Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism	Georgios Milis et.al.	2312.06613	link
2023-12-08	An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis	Via Nielson et.al.	2312.05415	null
2023-12-06	Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis	Zehua Chen et.al.	2312.03491	null
2023-12-02	Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning	Raviraj Joshi et.al.	2312.01107	null
2023-12-02	Code-Mixed Text to Speech Synthesis under Low-Resource Constraints	Raviraj Joshi et.al.	2312.01103	null
2023-11-29	Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes	Pavel Korshunov et.al.	2311.17655	null
2024-02-06	Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech	Enting Zhou et.al.	2311.14816	link
2023-12-07	Guided Flows for Generative Modeling and Decision Making	Qinqing Zheng et.al.	2311.13443	null
2023-11-27	HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	Sang-Hoon Lee et.al.	2311.12454	link
2023-11-18	Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots	Farideh Majidi et.al.	2311.11116	null
2023-11-18	Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys	Gabriel Cosache et.al.	2311.11030	null
2023-11-17	A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness	Mathias Vogel et.al.	2311.10804	null
2023-11-16	Improving fairness for spoken language understanding in atypical speech with Text-to-Speech	Helin Wang et.al.	2311.10149	link
2024-02-02	DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation	Jianzong Wang et.al.	2311.07965	null
2023-11-12	ChatAnything: Facetime Chat with LLM-Enhanced Personas	Yilin Zhao et.al.	2311.06772	null
2023-11-11	NewsGPT: ChatGPT Integration for Robot-Reporter	Abdelhadi Hireche et.al.	2311.06640	link
2023-11-08	Synthetic Speaking Children -- Why We Need Them and How to Make Them	Muhammad Ali Farooq et.al.	2311.06307	null
2023-09-25	Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image	Minki Kang et.al.	2311.05844	null
2023-11-07	Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning	Rishabh Jain et.al.	2311.04313	link
2023-11-07	Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment	Jakir Hasan et.al.	2311.03792	null
2023-11-08	Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction	Minchan Kim et.al.	2311.02898	null
2023-11-02	Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations	Hanglei Zhang et.al.	2311.01260	null
2023-11-02	E3 TTS: Easy End-to-End Diffusion-based Text to Speech	Yuan Gao et.al.	2311.00945	null
2023-10-31	An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation	Yingjie Zhou et.al.	2310.20251	link
2023-10-27	Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN	Neeraj Kumar et.al.	2310.18169	null
2023-10-25	ArTST: Arabic Text and Speech Transformer	Hawau Olamide Toyin et.al.	2310.16621	link
2023-10-25	Generative Pre-training for Speech with Flow Matching	Alexander H. Liu et.al.	2310.16338	null
2023-10-23	DPP-TTS: Diversifying prosodic features of speech via determinantal point processes	Seongho Joo et.al.	2310.14663	null
2023-10-22	An overview of text-to-speech systems and media applications	Mohammad Reza Hasanabadi et.al.	2310.14301	null
2023-10-14	Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling	Tiberiu Boros et.al.	2310.09636	link
2023-10-14	Attentive Multi-Layer Perceptron for Non-autoregressive Generation	Shuyang Jiang et.al.	2310.09512	link
2023-12-22	Crowdsourced and Automatic Speech Prominence Estimation	Max Morrison et.al.	2310.08464	link
2023-10-12	On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition	Nick Rossenbach et.al.	2310.08132	null
2023-10-12	Vec-Tok Speech: speech vectorization and tokenization for neural speech generation	Xinfa Zhu et.al.	2310.07246	link
2023-10-10	Prosody Analysis of Audiobooks	Charuta Pethe et.al.	2310.06930	null
2023-10-09	JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions	Detai Xin et.al.	2310.06072	null
2024-01-09	Unified speech and gesture synthesis using flow matching	Shivam Mehta et.al.	2310.05181	null
2023-10-08	Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset	Ze Liu et.al.	2310.04982	null
2023-10-11	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	Jiaming Wang et.al.	2310.04673	null
2024-01-22	Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis	Jae-Sung Bae et.al.	2310.03538	null
2023-10-07	The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains	Erica Cooper et.al.	2310.02640	null
2023-10-02	Towards human-like spoken dialogue generation between AI agents from written dialogue	Kentaro Mitsui et.al.	2310.01088	null
2023-10-01	Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech	Dareen Alharthi et.al.	2310.00706	null
2024-03-11	Fewer-token Neural Speech Codec with Time-invariant Codes	Yong Ren et.al.	2310.00014	link
2024-01-31	ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech	Wenhao Guan et.al.	2309.17056	null
2023-09-29	Low-Resource Self-Supervised Learning with SSL-Enhanced TTS	Po-chun Hsu et.al.	2309.17020	null
2023-09-29	Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features	Yuxiang Zhang et.al.	2309.16954	null
2023-12-18	High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models	Chunyu Qiang et.al.	2309.15512	null
2024-01-09	BiSinger: Bilingual Singing Voice Synthesis	Huali Zhou et.al.	2309.14089	link
2023-10-07	HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS	Dake Guo et.al.	2309.13907	null
2023-09-24	VoiceLDM: Text-to-Speech with Environmental Context	Yeonghyeon Lee et.al.	2309.13664	null
2023-09-24	Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control	Aya Watanabe et.al.	2309.13509	null
2023-09-22	DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2309.12792	null
2023-09-22	Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts	Shun Lei et.al.	2309.11977	null
2023-09-21	The Impact of Silence on Speech Anti-Spoofing	Yuxiang Zhang et.al.	2309.11827	null
2023-09-21	Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech	Rui Liu et.al.	2309.11724	link
2023-09-20	Speak While You Think: Streaming Speech Synthesis During Text Generation	Avihu Dekel et.al.	2309.11210	null
2023-09-20	Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model	Xinyu Zhou et.al.	2309.11000	link
2023-09-19	Exploring Speech Enhancement for Low-resource Speech Synthesis	Zhaoheng Ni et.al.	2309.10795	null
2023-09-19	Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition	Ziyang Ma et.al.	2309.10294	null
2023-09-17	Augmenting text for spoken language understanding with Large Language Models	Roshan Sharma et.al.	2309.09390	null
2023-09-16	FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework	Jianzong Wang et.al.	2309.08837	null
2023-09-15	Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech	Dariusz Piotrowski et.al.	2309.08255	null
2023-09-15	HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods	Hyun-seo Shin et.al.	2309.08208	link
2023-12-27	PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions	Reo Shimizu et.al.	2309.08140	null
2023-09-15	Diversity-based core-set selection for text-to-speech with linguistic and acoustic features	Kentaro Seki et.al.	2309.08127	null
2023-09-14	Direct Text to Speech Translation System using Acoustic Units	Victoria Mingote et.al.	2309.07478	null
2023-10-07	FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	Zhihao Du et.al.	2309.07405	link
2023-09-13	DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation	Zhichao Wu et.al.	2309.06787	null
2023-09-11	Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP	Jinzuomu Zhong et.al.	2309.05423	link
2024-01-16	VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching	Yiwei Guo et.al.	2309.05027	null
2023-09-08	Cross-Utterance Conditioned VAE for Speech Generation	Yang Li et.al.	2309.04156	null
2023-09-07	Large-Scale Automatic Audiobook Creation	Brendan Walsh et.al.	2309.03926	null
2023-09-11	GRASS: Unified Generation Model for Speech-to-Semantic Tasks	Aobo Xia et.al.	2309.02780	null
2023-09-12	MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023	Zhihang Xu et.al.	2309.02743	null
2023-10-12	PromptTTS 2: Describing and Generating Voices with Text Prompt	Yichong Leng et.al.	2309.02285	null
2023-09-04	A Comparative Analysis of Pretrained Language Models for Text-to-Speech	Marcel Granero-Moya et.al.	2309.01576	null
2023-09-02	DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin	Tao Li et.al.	2309.00883	null
2023-12-18	Learning Speech Representation From Contrastive Token-Acoustic Pretraining	Chunyu Qiang et.al.	2309.00424	null
2023-09-01	The FruitShell French synthesis system at the Blizzard 2023 Challenge	Xin Qi et.al.	2309.00223	null
2023-08-31	QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning	Haohan Guo et.al.	2309.00126	null
2024-01-23	SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models	Xin Zhang et.al.	2308.16692	link
2023-08-31	Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis	Weiqin Li et.al.	2308.16593	null
2023-08-31	Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information	Jie Chen et.al.	2308.16577	null
2023-08-31	LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech	Jie Chen et.al.	2308.16569	null
2023-08-30	CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis	Yi Meng et.al.	2308.16021	null
2023-09-01	The DeepZen Speech Synthesis System for Blizzard Challenge 2023	Christophe Veaux et.al.	2308.15945	null
2023-08-28	Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech	Hyungchan Yoon et.al.	2308.14909	null
2023-09-04	Rep2wav: Noise Robust text-to-speech Using self-supervised representations	Qiushi Zhu et.al.	2308.14553	null
2023-08-28	TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models	Shengpeng Ji et.al.	2308.14430	link
2023-09-02	Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder	Xuyuan Li et.al.	2308.13365	null
2023-08-24	Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations	Wenbin Wang et.al.	2308.13007	null
2023-09-22	Sparks of Large Audio Models: A Survey and Outlook	Siddique Latif et.al.	2308.12792	null
2023-10-25	SeamlessM4T: Massively Multilingual & Multimodal Machine Translation	Seamless Communication et.al.	2308.11596	link
2023-08-31	Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models	Heyang Xue et.al.	2308.10428	null
2023-08-16	AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis	Hrishikesh Viswanath et.al.	2308.08577	null
2023-08-14	SpeechX: Neural Codec Language Model as a Versatile Speech Transformer	Xiaofei Wang et.al.	2308.06873	null
2023-08-12	Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation	Zhichao Wang et.al.	2308.06457	link
2023-09-09	AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining	Haohe Liu et.al.	2308.05734	link
2023-08-09	Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay	Leixian Shen et.al.	2308.04703	null
2023-08-08	Towards an AI to Win Ghana's National Science and Maths Quiz	George Boateng et.al.	2308.04333	link
2023-08-08	WonderFlow: Narration-Centric Design of Animated Data Videos	Yun Wang et.al.	2308.04040	null
2023-08-04	Let's Give a Voice to Conversational Agents in Virtual Reality	Michele Yin et.al.	2308.02665	link
2023-08-03	Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation	Minsu Kim et.al.	2308.01831	link
2023-08-02	SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	Ramanan Sivaguru et.al.	2308.01018	null
2023-07-07	Artificial Eye for the Blind	Abhinav Benagi et.al.	2308.00801	null
2023-07-31	Multilingual context-based pronunciation learning for Text-to-Speech	Giulia Comini et.al.	2307.16709	null
2023-07-31	Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Guangyan Zhang et.al.	2307.16679	null
2023-07-31	Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings	Manuel Sam Ribeiro et.al.	2307.16643	null
2023-07-31	DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training	Hyung-Seok Oh et.al.	2307.16549	link
2023-07-31	VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design	Jungil Kong et.al.	2307.16430	null
2023-07-30	Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation	Yuanhao Chen et.al.	2307.16199	link
2023-07-29	METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer	Xinfa Zhu et.al.	2307.15951	null
2023-12-18	Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding	Chunyu Qiang et.al.	2307.15484	null
2023-07-20	SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer	Daegyeom Kim et.al.	2307.10550	link
2023-07-18	SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs	Yinghao Aaron Li et.al.	2307.09435	null
2023-09-28	Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts	Ziyue Jiang et.al.	2307.07218	null
2023-07-13	Controllable Emphasis with zero data for text-to-speech	Arnaud Joly et.al.	2307.07062	null
2023-07-11	On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis	Siyang Wang et.al.	2307.05132	null
2023-07-10	The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task	Kun Song et.al.	2307.04630	null
2023-10-07	ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading	Yujia Xiao et.al.	2307.00782	null
2023-06-28	EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech	Daria Diatlova et.al.	2307.00024	link
2023-06-29	High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units	Junchen Lu et.al.	2306.17005	null
2023-06-28	UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data	Heeseung Kim et.al.	2306.16083	link
2023-10-19	Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale	Matthew Le et.al.	2306.15687	null
2023-06-27	GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech	Yahuan Cong et.al.	2306.15304	null
2023-06-25	DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech	Sen Liu et.al.	2306.14145	null
2023-06-21	Visual-Aware Text-to-Speech	Mohan Zhou et.al.	2306.12020	null
2023-06-21	Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer	Jakub Swiatkowski et.al.	2306.11662	null
2023-06-16	Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation	Kishor Kayyar Lakshminarayana et.al.	2306.10152	null
2023-06-16	CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages	Frederico S. Oliveira et.al.	2306.10097	null
2023-06-14	Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation	Zheng Liang et.al.	2306.08588	null
2023-06-14	Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects	Xinghua Qu et.al.	2306.08219	link
2023-11-20	StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models	Yinghao Aaron Li et.al.	2306.07691	null
2024-01-18	UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding	Chenpeng Du et.al.	2306.07547	null
2023-06-13	PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling	Ji-Sang Hwang et.al.	2306.07489	null
2023-06-09	Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech	Shijun Wang et.al.	2306.05709	null
2023-06-08	VIFS: An End-to-End Variational Inference for Foley Sound Synthesis	Junhyeok Lee et.al.	2306.05004	link
2023-07-11	Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge	Wenhao Guan et.al.	2306.04301	null
2023-06-06	Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias	Ziyue Jiang et.al.	2306.03509	null
2023-08-02	Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis	Zhenhui Ye et.al.	2306.03504	null
2023-06-05	Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis	Dengfeng Ke et.al.	2306.02593	null
2023-06-05	Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model	Hoyeon Lee et.al.	2306.02579	null
2023-06-05	Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming	Xinlei Niu et.al.	2306.02568	link
2023-06-02	Towards Robust FastSpeech 2 by Modelling Residual Multimodality	Fabian Kögel et.al.	2306.01442	link
2023-05-30	Towards Selection of Text-to-speech Data to Augment ASR Training	Shuo Liu et.al.	2306.00998	null
2023-06-01	EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis	Haobin Tang et.al.	2306.00648	null
2023-06-01	The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech	Phat Do et.al.	2306.00535	null
2023-05-31	Text-to-Speech Pipeline for Swiss German -- A comparison	Tobias Bollinger et.al.	2305.19750	null
2023-05-31	XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech	Linh The Nguyen et.al.	2305.19709	link
2023-06-01	PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions	Guanghou Liu et.al.	2305.19522	null
2023-05-30	Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages	Phat Do et.al.	2305.19396	null
2023-05-30	Make-A-Voice: Unified Voice Synthesis With Discrete Representation	Rongjie Huang et.al.	2305.19269	null
2023-05-30	STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions	Michel Plüss et.al.	2305.18855	null
2023-05-30	LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus	Yuma Koizumi et.al.	2305.18802	null
2023-10-09	An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization	Fei Kong et.al.	2305.18355	link
2023-05-29	ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation	Ambuj Mehrish et.al.	2305.18028	link
2023-05-29	Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis	Erik Ekstedt et.al.	2305.17971	null
2023-07-25	StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation	Kun Song et.al.	2305.17732	null
2023-05-28	Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS	Sewade Ogun et.al.	2305.17724	link
2023-07-19	Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing	Julia Kaiwen Lau et.al.	2305.17445	link
2023-05-26	DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction	Vineet Bhat et.al.	2305.16957	null
2023-05-25	Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion	Rui Liu et.al.	2305.16353	link
2023-05-22	Text Generation with Speech Synthesis for ASR Data Augmentation	Zhuangqun Huang et.al.	2305.16333	null
2023-05-25	VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	Tianrui Wang et.al.	2305.16107	null
2023-05-25	Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration	Rustem Yeshpanov et.al.	2305.15749	link
2024-02-05	LAraBench: Benchmarking Arabic AI with Large Language Models	Ahmed Abdelali et.al.	2305.14982	null
2023-05-23	EfficientSpeech: An On-Device Text to Speech Model	Rowel Atienza et.al.	2305.13905	link
2023-05-23	ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	Minki Kang et.al.	2305.13831	null
2023-05-22	U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech	Xin Jing et.al.	2305.13195	null
2023-05-25	EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels	Kari Ali Noriy et.al.	2305.13137	link
2023-05-22	ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer	Huadai Liu et.al.	2305.12708	null
2023-05-21	VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages	Shivam Mhaskar et.al.	2305.12518	null
2023-05-26	Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus	Detai Xin et.al.	2305.12442	link
2023-05-20	ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios	Yuyue Wang et.al.	2305.12200	null
2023-05-19	MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting	Neil Shah et.al.	2305.11926	null
2024-02-20	Data Redaction from Conditional Generative Models	Zhifeng Kong et.al.	2305.11351	null
2023-05-18	Parameter-Efficient Learning for Text-to-Speech Accent Adaptation	Li-Jen Yang et.al.	2305.11320	link
2023-05-19	Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation	Martijn Bartelds et.al.	2305.10951	link
2023-09-30	Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data	Yusheng Tian et.al.	2305.10891	link
2023-05-18	FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs	Won Jang et.al.	2305.10823	null
2023-05-18	CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training	Zhenhui Ye et.al.	2305.10763	null
2023-08-29	a unified front-end framework for english text-to-speech synthesis	Zelin Ying et.al.	2305.10666	null
2023-09-19	Controllable Speaking Styles Using a Large Language Model	Atli Thor Sigurgeirsson et.al.	2305.10321	null
2023-05-23	Better speech synthesis through scaling	James Betker et.al.	2305.07243	link
2023-10-29	CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model	Zhen Ye et.al.	2305.06908	link
2023-05-08	Accented Text-to-Speech Synthesis with Limited Data	Xuehao Zhou et.al.	2305.04816	null
2023-05-03	M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis	Jinlong Xue et.al.	2305.02269	null
2023-05-30	A Review of Deep Learning Techniques for Speech Processing	Ambuj Mehrish et.al.	2305.00359	null
2023-04-26	Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis	Ye-Xin Lu et.al.	2304.13270	null
2023-04-25	Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge	Chenpeng Du et.al.	2304.13121	null
2023-04-24	Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model	Kenichi Fujita et.al.	2304.11976	null
2023-04-23	DiffVoice: Text-to-Speech with Latent Diffusion	Zhijun Liu et.al.	2304.11750	null
2023-04-23	SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model	Jianzong Wang et.al.	2304.11547	null
2023-05-30	NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	Kai Shen et.al.	2304.09116	null
2023-04-16	A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers	Juan Zuluaga-Gomez et.al.	2304.07842	null
2023-04-13	Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis	Shun Lei et.al.	2304.06359	null
2023-04-10	Enhancing Speech-to-Speech Translation with Multiple TTS Targets	Jiatong Shi et.al.	2304.04618	null
2023-04-07	ArmanTTS single-speaker Persian dataset	Mohammd Hasan Shamgholi et.al.	2304.03585	null
2023-04-03	Ensemble prosody prediction for expressive speech synthesis	Tian Huey Teh et.al.	2304.00714	null
2023-03-29	AraSpot: Arabic Spoken Command Spotting	Mahmoud Salhab et.al.	2303.16621	link
2023-03-28	Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages	Seongyeon Park et.al.	2303.15669	link
2023-03-27	Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis	Karren Yang et.al.	2303.14885	null
2023-03-24	Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis	Takuhiro Kaneko et.al.	2303.13909	null
2023-04-02	A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI	Chenshuang Zhang et.al.	2303.13336	null
2023-03-20	Code-Switching Text Generation and Injection in Mandarin-English ASR	Haibin Yu et.al.	2303.10949	null
2023-03-14	Controlling High-Dimensional Data With Sparse Input	Dan Andrei Iliescu et.al.	2303.09446	null
2023-03-09	Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports	Hyunseung Chung et.al.	2303.09395	link
2023-03-15	Cross-speaker Emotion Transfer by Manipulating Speech Style Latents	Suhee Jo et.al.	2303.08329	null
2023-03-14	QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis	Haobin Tang et.al.	2303.07682	null
2023-03-10	An End-to-End Neural Network for Image-to-Audio Transformation	Liu Chen et.al.	2303.06078	null
2023-03-09	Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation	Qi Chen et.al.	2303.05322	link
2023-03-07	Do Prosody Transfer Models Transfer Prosody?	Atli Thor Sigurgeirsson et.al.	2303.04289	null
2023-03-07	Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Ziqiang Zhang et.al.	2303.03926	null
2023-03-02	Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding	Yingting Li et.al.	2303.03267	link
2023-03-08	FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model	Ruiqing Xue et.al.	2303.02939	null
2023-08-14	Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations	Yuma Koizumi et.al.	2303.01664	null
2023-03-11	Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities	Shijun Wang et.al.	2303.01508	null
2023-12-17	ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations	Neil Shah et.al.	2303.01261	null
2023-03-02	LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion	Chunfeng Wang et.al.	2303.01086	null
2023-03-02	Leveraging Large Text Corpora for End-to-End Speech Summarization	Kohei Matsuura et.al.	2303.00978	null
2023-03-01	DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction	Raviteja Anantha et.al.	2303.00171	null
2023-02-28	ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus	Ajinkya Kulkarni et.al.	2303.00069	null
2023-02-28	Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners	Jocelyn Huang et.al.	2302.14523	null
2023-06-12	CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis	Ji-Hoon Kim et.al.	2302.14370	null
2023-05-19	UniFLG: Unified Facial Landmark Generator from Text or Speech	Kentaro Mitsui et.al.	2302.14337	null
2023-02-27	Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech	Jiyoung Lee et.al.	2302.13700	link
2023-02-27	Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech	Dong Yang et.al.	2302.13652	null
2023-02-27	Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow	Yoonhyung Lee et.al.	2302.13458	null
2023-06-06	PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS	Junhyeok Lee et.al.	2302.12391	link
2023-02-21	Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition	Leyuan Qu et.al.	2302.09723	null
2023-02-23	QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion	Houjian Guo et.al.	2302.08296	link
2023-02-13	Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages	Sudhanshu Srivastava et.al.	2302.06227	null
2023-02-08	A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech	Li-Wei Chen et.al.	2302.04215	link
2023-02-07	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision	Eugene Kharitonov et.al.	2302.03540	null
2023-02-15	MAC: A unified framework boosting low resource automatic speech recognition	Zeping Min et.al.	2302.03498	null
2023-06-25	InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt	Dongchao Yang et.al.	2301.13662	link
2023-03-01	UzbekTagger: The rule-based POS tagger for Uzbek language	Maksud Sharipov et.al.	2301.12711	null
2023-05-27	Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining	Takaaki Saeki et.al.	2301.12596	link
2023-01-31	Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker	Navjot Kaur et.al.	2301.12331	link
2023-01-26	On granularity of prosodic representations in expressive text-to-speech	Mikolaj Babianski et.al.	2301.11446	null
2023-01-26	Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study	Massa Baali et.al.	2301.09099	link
2023-01-20	Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions	Yinghao Aaron Li et.al.	2301.08810	null
2023-01-11	Modelling low-resource accents without accent-specific TTS frontend	Georgi Tinchev et.al.	2301.04606	null
2022-12-11	BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm	Yu-Wen Chen et.al.	2301.04120	link
2023-01-10	UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion	Haogeng Liu et.al.	2301.03801	null
2023-01-10	Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation	Abdullah Shahid et.al.	2301.03751	null
2023-09-19	Applying Automated Machine Translation to Educational Video Courses	Linden Wang et.al.	2301.03141	null
2023-01-06	Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition	David M. Chan et.al.	2301.02736	null
2023-01-05	Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	Chengyi Wang et.al.	2301.02111	link
2022-12-11	MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset	Kailin Liang et.al.	2301.00657	link
2022-12-30	ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech	Zehua Chen et.al.	2212.14518	null
2022-12-29	StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models	Yinghao Aaron Li et.al.	2212.14227	link
2022-12-22	HMM-based data augmentation for E2E systems for building conversational speech synthesis systems	Ishika Gupta et.al.	2212.11982	null
2022-12-21	ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Wei-Ning Hsu et.al.	2212.11377	null
2022-12-20	TTS-Guided Training for Accent Conversion Without Parallel Data	Yi Zhou et.al.	2212.10204	null
2023-06-28	Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling	Tuomo Raitio et.al.	2212.10075	null
2022-12-16	Speech Aware Dialog System Technology Challenge (DSTC11)	Hagen Soltau et.al.	2212.08704	null
2022-12-16	Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder	Yusuke Yasuda et.al.	2212.08329	null
2022-12-16	Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language	Yusuke Yasuda et.al.	2212.08321	null
2022-12-15	RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis	Shinhyeok Oh et.al.	2212.07939	link
2022-12-14	Probing Deep Speaker Embeddings for Speaker-related Tasks	Zifeng Zhao et.al.	2212.07068	null
2022-12-08	SpeechLMScore: Evaluating speech generation using speech language model	Soumi Maiti et.al.	2212.04559	link
2023-04-04	Learning to Dub Movies via Hierarchical Prosody Models	Gaoxiang Cong et.al.	2212.04054	link
2022-12-07	Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning	Ankur Debnath et.al.	2212.03558	null
2022-12-07	Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue	Daxin Tan et.al.	2212.03398	null
2022-12-06	UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis	Yi Lei et.al.	2212.01546	null
2022-11-30	SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech	Byoung Jin Choi et.al.	2211.16866	null
2022-11-29	Controllable speech synthesis by learning discrete phoneme-level prosodic representations	Nikolaos Ellinas et.al.	2211.16307	null
2023-05-25	Evaluating and reducing the distance between synthetic and real speech distributions	Christoph Minixhofer et.al.	2211.16049	null
2022-11-26	Contextual Expressive Text-to-Speech	Jianhong Tu et.al.	2211.14548	null
2022-12-05	Efficient Incremental Text-to-Speech on GPUs	Muyang Du et.al.	2211.13939	null
2023-03-21	Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?	Xuan Shi et.al.	2211.13868	link
2022-11-23	IMaSC -- ICFOSS Malayalam Speech Corpus	Deepa P Gopinath et.al.	2211.12796	null
2022-11-22	PromptTTS: Controllable Text-to-Speech with Text Descriptions	Zhifang Guo et.al.	2211.12171	null
2022-11-04	Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech	Xin Zhang et.al.	2211.09731	null
2023-02-17	Towards Building Text-To-Speech Systems for the Next Billion Users	Gokul Karthik Kumar et.al.	2211.09536	link
2023-02-16	EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance	Yiwei Guo et.al.	2211.09496	null
2022-11-17	Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation	Chunyu Qiang et.al.	2211.09495	null
2022-11-17	NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis	Hyeong-Seok Choi et.al.	2211.09407	null
2023-03-14	Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Minki Kang et.al.	2211.09383	null
2023-01-04	Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation	Xin Yuan et.al.	2211.09365	null
2022-11-14	SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech	Perry Lam et.al.	2211.07283	null
2023-05-24	Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing	Jacob J Webber et.al.	2211.06989	null
2023-05-29	OverFlow: Putting flows on top of neural transducers for better TTS	Shivam Mehta et.al.	2211.06892	link
2023-05-29	Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations	Yoori Oh et.al.	2211.06160	null
2022-12-04	ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech	Xiaoran Fan et.al.	2211.03545	link
2022-11-07	Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder	Jan Melechovsky et.al.	2211.03316	link
2022-11-06	Parallel Attention Forcing for Machine Translation	Qingyun Dou et.al.	2211.03237	null
2022-11-06	An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space	Jihwan Lee et.al.	2211.03078	null
2022-11-04	NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS	Dongchao Yang et.al.	2211.02448	null
2022-11-04	Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts	Detai Xin et.al.	2211.02336	null
2023-04-16	Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS	Ziqi Liang et.al.	2211.01948	null
2022-11-01	Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages	Anusha Prakash et.al.	2211.01338	null
2023-05-28	DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP	Kun Song et.al.	2211.01087	null
2022-11-22	Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement	Wei Song et.al.	2211.00967	null
2022-11-01	Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers	Cheng-Ping Hsieh et.al.	2211.00585	link
2023-06-11	Generating Multilingual Gender-Ambiguous Text-to-Speech Voices	Konstantinos Markopoulos et.al.	2211.00375	null
2023-05-07	Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features	Alexandra Vioni et.al.	2211.00342	null
2022-11-02	Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS	Kun Song et.al.	2210.17349	null
2024-02-27	Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation	Nikolaos Ellinas et.al.	2210.17264	null
2022-10-31	Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection	Luigi Attorresi et.al.	2210.17222	null
2022-10-31	Structured State Space Decoder for Speech Recognition and Synthesis	Koichi Miyazaki et.al.	2210.17098	null
2022-10-28	Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders	Jason Fong et.al.	2210.16045	null
2023-02-21	Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform	Masaya Kawamura et.al.	2210.15975	link
2023-02-22	Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis	Yuma Shirahata et.al.	2210.15964	null
2022-10-28	Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation	Nobuyuki Morioka et.al.	2210.15868	null
2023-03-15	Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Takaaki Saeki et.al.	2210.15447	null
2022-10-27	Explicit Intensity Control for Accented Text-to-speech	Rui Liu et.al.	2210.15364	null
2022-10-27	FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis	Yifan Hu et.al.	2210.15360	link
2022-10-26	Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection	Kentaro Seki et.al.	2210.14850	null
2022-10-25	Semi-Supervised Learning Based on Reference Model for Low-resource TTS	Xulong Zhang et.al.	2210.14723	null
2022-10-26	Cover Reproducible Steganography via Deep Generative Models	Kejiang Chen et.al.	2210.14632	null
2022-10-26	Improving Speech-to-Speech Translation Through Unlabeled Text	Xuan-Phi Nguyen et.al.	2210.14514	null
2022-10-26	The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge	Yuhao Liang et.al.	2210.14448	null
2022-10-25	Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data	Xulong Zhang et.al.	2210.13803	null
2023-09-17	HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation	Chunhui Wang et.al.	2210.12740	null
2022-10-21	Low-Resource Multilingual and Zero-Shot Multispeaker TTS	Florian Lux et.al.	2210.12223	link
2022-10-21	Adaptive re-calibration of channel-wise features for Adversarial Audio Classification	Vardhan Dongre et.al.	2210.11722	null
2022-10-20	Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS	Chunyu Qiang et.al.	2210.11429	null
2022-10-17	Towards Relation Extraction From Speech	Tongtong Wu et.al.	2210.08759	link
2023-02-08	Generating Synthetic Speech from SpokenVocab for Speech Translation	Jinming Zhao et.al.	2210.08174	link
2022-10-17	LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge	Yan Jia et.al.	2210.07749	null
2022-10-20	Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy	Sarina Meyer et.al.	2210.07002	link
2022-10-13	Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar	Aolan Sun et.al.	2210.06877	null
2022-10-12	Can we use Common Voice to train a Multi-Speaker TTS system?	Sewade Ogun et.al.	2210.06370	null
2023-06-01	SQuId: Measuring Speech Naturalness in Many Languages	Thibault Sellam et.al.	2210.06324	null
2022-11-22	Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech	Byoung Jin Choi et.al.	2210.05979	null
2022-10-06	An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Andreas Triantafyllopoulos et.al.	2210.03538	null
2022-09-29	Facial Landmark Predictions with Applications to Metaverse	Qiao Han et.al.	2209.14698	link
2022-09-26	Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech	Yusuke Nakai et.al.	2209.12549	null
2022-09-22	EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models	Perry Lam et.al.	2209.10890	null
2022-09-22	MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline	Yifan Hu et.al.	2209.10848	link
2022-09-22	Controllable Accented Text-to-Speech Synthesis	Rui Liu et.al.	2209.10804	null
2022-09-16	TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection	Davide Salvi et.al.	2209.08000	null
2022-09-14	Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset	Michael Chinen et.al.	2209.06358	null
2022-09-08	SANIP: Shopping Assistant and Navigation for the visually impaired	Shubham Deshmukh et.al.	2209.03570	null
2022-09-07	Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech	Huu-Tien Dang et.al.	2209.02971	null
2022-09-02	Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model	Jennifer Drexler Fox et.al.	2209.01250	null
2022-08-28	Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks	Lev Finkelstein et.al.	2208.13183	null
2022-10-04	Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale	Aditya Agarwal et.al.	2208.09796	null
2022-08-21	Visualising Model Training via Vowel Space for Text-To-Speech Systems	Binu Abeysinghe et.al.	2208.09775	link
2022-08-15	Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0	Mohammed Salah Al-Radhi et.al.	2208.07122	null
2022-12-28	Speech Synthesis with Mixed Emotions	Kun Zhou et.al.	2208.05890	null
2022-08-03	A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis	Qibing Bai et.al.	2208.02189	null
2022-07-29	Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation	Giulia Comini et.al.	2207.14607	null
2022-07-25	Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis	Raul Fernandez et.al.	2207.12262	null
2022-07-01	A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese	Song Zhang et.al.	2207.12089	null
2022-07-20	When Is TTS Augmentation Through a Pivot Language Useful?	Nathaniel Robinson et.al.	2207.09889	link
2022-07-11	LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech	Harshvardhan Anand et.al.	2207.07118	null
2022-07-13	ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech	Rongjie Huang et.al.	2207.06389	link
2022-07-13	Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech	Zhengxi Liu et.al.	2207.06088	null
2022-07-13	SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate	Nabarun Goswami et.al.	2207.06011	null
2022-07-13	Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS	Yookyung Shin et.al.	2207.06000	null
2022-07-13	A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System	Yi-Chiao Wu et.al.	2207.05913	null
2022-07-12	Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition	Rodolfo Zevallos et.al.	2207.05498	null
2022-07-12	End-to-end speech recognition modeling from de-identified data	Martin Flechl et.al.	2207.05469	null
2022-07-11	Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data	Naoki Makishima et.al.	2207.04659	null
2022-07-11	DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders	Yanqing Liu et.al.	2207.04646	null
2023-01-02	Dreamento: an open-source dream engineering toolbox for sleep EEG wearables	Mahdad Jafarzadeh Esfahani et.al.	2207.03977	link
2022-07-07	BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus	Josh Meyer et.al.	2207.03546	link
2022-07-05	Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion	Yi Lei et.al.	2207.01832	null
2022-07-04	BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model	Brooke Stephenson et.al.	2207.01718	null
2022-07-04	Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)	Ariadna Sanchez et.al.	2207.01547	null
2022-07-04	Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)	Ziyao Zhang et.al.	2207.01507	null
2023-03-13	DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech	Keon Lee et.al.	2207.01063	link
2022-07-02	Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need	Daniel Korzekwa et.al.	2207.00774	null
2022-07-01	Building African Voices	Perez Ogayo et.al.	2207.00688	link
2022-07-01	Automatic Evaluation of Speaker Similarity	Deja Kamil et.al.	2207.00344	null
2022-08-03	Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding	Wei-Ping Huang et.al.	2206.15427	null
2022-06-30	R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Kyle Kastner et.al.	2206.15276	null
2022-07-01	Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems	Hyun-Wook Yoon et.al.	2206.15067	null
2022-06-30	TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder	Eunwoo Song et.al.	2206.14984	null
2022-06-29	Improving Deliberation by Text-Only and Semi-Supervised Training	Ke Hu et.al.	2206.14716	null
2022-06-29	Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody	Peter Makarov et.al.	2206.14643	null
2022-06-28	Expressive, Variable, and Controllable Duration Modelling in TTS	Ammar Abbas et.al.	2206.14165	null
2022-06-28	Comparison of Speech Representations for the MOS Prediction System	Aki Kunikoshi et.al.	2206.13817	null
2022-06-22	A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data	Raviraj Joshi et.al.	2206.13240	null
2022-06-25	Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations	Chin-Cheng Hsu et.al.	2206.12662	null
2022-10-21	Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech	Florian Lux et.al.	2206.12229	link
2022-06-24	SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech	Hyunjae Cho et.al.	2206.12132	null
2022-06-24	End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue	Kentaro Mitsui et.al.	2206.12040	null
2022-05-29	Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning	Sameea Naeem et.al.	2206.11860	null
2022-06-21	Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS	Kenta Udagawa et.al.	2206.10256	null
2022-06-24	Towards Optimizing OCR for Accessibility	Peya Mowar et.al.	2206.10254	null
2022-06-16	Automatic Prosody Annotation with Pre-Trained Text-Speech Model	Ziqian Dai et.al.	2206.07956	link
2022-11-16	NatiQ: An End-to-end Text-to-Speech System for Arabic	Ahmed Abdelali et.al.	2206.07373	null
2022-06-15	Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning	Rui Liu et.al.	2206.07229	link
2022-12-12	A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation	Junhui Zhang et.al.	2206.04922	null
2022-06-09	Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos	Alexander Waibel et.al.	2206.04523	null
2022-06-07	FlexLip: A Controllable Text-to-Lip System	Dan Oneata et.al.	2206.03206	null
2022-10-11	UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder	Jiachen Lian et.al.	2206.02512	null
2023-10-19	Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech	Ziyue Jiang et.al.	2206.02147	link
2022-11-02	AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation	Kun Song et.al.	2206.00208	null
2022-05-31	Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish	Alp Öktem et.al.	2205.15599	link
2023-11-20	StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis	Yinghao Aaron Li et.al.	2205.15439	link
2022-05-30	Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data	Sungwon Kim et.al.	2205.15370	null
2022-05-26	QSpeech: Low-Qubit Quantum Speech Application Toolkit	Zhenhou Hong et.al.	2205.13221	link
2022-11-10	T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation	Paul-Ambroise Duquenne et.al.	2205.12216	null
2022-05-20	PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit	Hui Zhang et.al.	2205.12007	link
2022-05-24	TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS	Xulong Zhang et.al.	2205.11824	null
2022-10-12	GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech	Rongjie Huang et.al.	2205.07211	link
2022-05-13	Talking Face Generation with Multilingual TTS	Hyoung-Kyu Song et.al.	2205.06421	null
2022-05-10	NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	Xu Tan et.al.	2205.04421	link
2022-05-09	Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech	Yang Li et.al.	2205.04120	link
2022-05-09	ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	Sangshin Oh et.al.	2205.04104	null
2022-07-14	Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss	Efthymios Georgiou et.al.	2204.13437	null
2022-04-25	SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech	Zhenhui Ye et.al.	2204.11792	null
2022-04-22	LibriS2S: A German-English Speech-to-Speech Translation Corpus	Pedro Jeuris et.al.	2204.10593	link
2022-07-05	Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation	Ryo Terashima et.al.	2204.10020	null
2022-04-21	FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis	Rongjie Huang et.al.	2204.09934	link
2022-04-20	Audio Deep Fake Detection System with Neural Stitching for ADD 2022	Rui Yan et.al.	2204.08720	null
2022-04-14	Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech	Cong Zhang et.al.	2204.07228	null
2022-12-09	Study of Indian English Pronunciation Variabilities relative to Received Pronunciation	Priyanshi Pal et.al.	2204.06502	null
2022-04-12	Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch	Hanbin Bae et.al.	2204.05753	null
2023-01-30	The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance	Lin Zhang et.al.	2204.05177	null
2022-10-27	Fine-grained Noise Control for Multispeaker Speech Synthesis	Karolos Nikitaras et.al.	2204.05070	null
2022-08-31	Karaoker: Alignment-free singing voice synthesis with speech training data	Panos Kakoulidis et.al.	2204.04127	null
2022-08-15	Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech	Jae-Sung Bae et.al.	2204.04004	null
2022-04-07	Arabic Text-To-Speech (TTS) Data Preparation	Hala Al Masri et.al.	2204.03255	null
2022-04-07	Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis	Yutian Wang et.al.	2204.03238	null
2022-08-24	SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis	Georgia Maniati et.al.	2204.03040	null
2022-09-13	Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation	Sravya Popuri et.al.	2204.02967	null
2022-07-02	Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification	Jin Woo Lee et.al.	2204.02639	null
2023-08-28	Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech	Hyungchan Yoon et.al.	2204.02172	null
2022-09-07	Deliberation Model for On-Device Spoken Language Understanding	Duc Le et.al.	2204.01893	null
2022-12-14	Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck	Youngsik Eom et.al.	2204.01387	null
2022-11-11	Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis	Yixuan Zhou et.al.	2204.00990	null
2022-06-30	VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Chenpeng Du et.al.	2204.00768	null
2022-04-01	AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios	Yihan Wu et.al.	2204.00436	null
2022-04-01	Text-To-Speech Data Augmentation for Low Resource Speech Recognition	Rodolfo Zevallos et.al.	2204.00291	null
2022-07-19	Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech	Guangyan Zhang et.al.	2203.17190	null
2022-03-31	An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer	Wenlin Dai et.al.	2203.16954	link
2022-07-11	WavThruVec: Latent speech representation as intermediate features for neural speech synthesis	Hubert Siuzdak et.al.	2203.16930	null
2022-03-31	A Character-level Span-based Model for Mandarin Prosodic Structure Prediction	Xueyuan Chen et.al.	2203.16922	link
2022-07-01	JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech	Dan Lim et.al.	2203.16852	link
2022-03-31	Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset	Zehui Yang et.al.	2203.16844	null
2022-03-31	NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism	Jingbei Li et.al.	2203.16838	link
2022-03-31	Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition	Anirudh Gupta et.al.	2203.16823	null
2022-04-21	Does Audio Deepfake Detection Generalize?	Nicolas M. Müller et.al.	2203.16263	null
2022-03-30	End to End Lip Synchronization with a Temporal AutoEncoder	Yoav Shalev et.al.	2203.16224	link
2022-08-15	Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition	Junrui Ni et.al.	2203.15796	link
2022-06-29	DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning	Takaaki Saeki et.al.	2203.15683	null
2022-11-05	Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation	Rendi Chevi et.al.	2203.15643	link
2022-10-06	Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus	Minchan Kim et.al.	2203.15447	null
2022-07-11	VoiceMe: Personalized voice generation in TTS	Pol van Rijn et.al.	2203.15379	link

(back to top)

Liujingxiu23 / TTS-arxiv-daily

Updated on 2024.06.13

TTS

About

Languages