AITRICS’ Two Papers Accepted at ICASSP 2025, the World’s Largest Conference on Acoustics, Speech, and Signal Processing
2025-04-15
Proved the Strength of Its Speech AI Technology Even in Data-Constrained Environments
Laid the Technical Foundation for Expansion from Text-Based to Multimodal LLMs

AITRICS(CEO Kwang joon Kim), a company specializing in artificial intelligence (AI) technology announced on April 15 that two of its research papers have been accepted at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025, the world’s largest academic conference in the field of acoustics, speech, and signal processing, held in Hyderabad, India, from April 6 to 11.
The accepted papers are ▲ Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting ▲ Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping
AITRICS presented its advanced speech AI technology through these two poster papers.
The first paper proposes a speaker-adaptive TTS framework, Stable-TTS, which can naturally replicate a speaker’s tone and intonation using only a small amount of speech data. The model was developed to overcome the instability in sound quality commonly found in existing speaker-adaptive TTS models. It enables stable speech synthesis even in limited and noisy environments.
This speaker-adaptive approach utilizes high-quality speech samples used during pretraining with a Prosody Language Model (PLM) and prior-preservation techniques to maintain robust synthesis performance. The study demonstrated the model’s effectiveness in generating natural and speaker-similar speech even with low-quality or minimal data.
The second paper presents a zero-shot TTS model that generates realistic speech using only face images. By extracting speaker-specific features from a face image and combining them with prosody codes, the model produces more natural and lifelike speech. The method significantly improves the mapping between facial features and speech style, resulting in better overall voice quality than previous face-based speech synthesis models.
Wooseok Han, a researcher at AITRICS, commented, “This research shows that it is possible to generate stable and natural speech with limited data, which makes it highly applicable in real-world environments like healthcare, where data availability is often limited. We believe this study is a step toward expanding from text-based LLMs to multimodal LLMs that integrate voice and images. We will continue our research and development efforts to deliver medical AI services that offer enhanced user experience and high reliability.”