This study investigates the performance of the VITS model for Lithuanian speech synthesis under different training configurations. Experiments were conducted using datasets with phoneme-based and grapheme-based text representations, accented text, and both single-speaker and multi-speaker setups. The goal was to evaluate how linguistic pre-processing and speaker diversity influence synthesis quality. Model outputs were compared using objective measures. The results provide insights into the impact of phoneme representation and accent information on the quality of Lithuanian neural TTS systems.

This work is licensed under a Creative Commons Attribution 4.0 International License.