Files
Abstract
Concatenative synthesis, the dominant Text-to-Speech (TTS) method, often produces audible discontinuities due to mismatched phonemic and prosodic contexts. Previous linear cross-fading approaches improved smoothness but generated unnatural formant trajectories. This thesis proposes a unit-dependent, parameterized cross-fading algorithm guided by a perceptual cost function predicting speech quality from acoustic distance measures. Using a custom corpus and perceptual experiments, we show that output quality depends on formant trajectory shape across the vowel and correlates with both absolute distance and its derivative. Results demonstrate feasibility of perceptual cost-based optimization for natural-sounding TTS, advancing speech synthesis beyond traditional concatenation techniques.