A two-level item response theory model to evaluate automatic speech synthesis and recognition systems

OLIVEIRA, Chaina Santos

Por favor, use este identificador para citar o enlazar este ítem: https://repositorio.ufpe.br/handle/123456789/52538

Comparte esta pagina

Título :	A two-level item response theory model to evaluate automatic speech synthesis and recognition systems
Autor :	OLIVEIRA, Chaina Santos
Palabras clave :	Inteligência computacional; Benchmark de fala; Reconhecimento da fala
Fecha de publicación :	19-jun-2023
Editorial :	Universidade Federal de Pernambuco
Citación :	OLIVEIRA, Chaina Santos. A two-level item response theory model to evaluate automatic speech synthesis and recognition systems. 2023. Tese (Doutorado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2023.
Resumen :	Automatic speech recognition systems (ASRs) have become popular in different ap- plications. Ideally, ASRs should be tested under different scenarios by adopting diverse speech test data (e.g., diverse sentences and speakers). Relying on audio test data recorded using human speakers is time-consuming. An alternative is to use text-to-speech (TTS) tools to synthesize audios given a set of sentences and virtual speakers. The ASR under test receives the synthesized audios and the transcription errors are recorded for evalu- ation. Despite the availability of TTS tools, not all synthesized speeches have the same quality. It is important to evaluate the usefulness of speakers and the relevance of sen- tences for ASR evaluation. In this work, we propose a two-level Item Response Theory (IRT) model to simultaneously evaluate ASRs, speakers and sentences, which is original in the literature. IRT is a paradigm from psychometrics to estimate the ability of human respondents based on their responses to items with different levels of difficulty. In the first level of the proposed model, an item is a synthesized speech, a respondent is an ASR system and each response is the transcription accuracy observed when a synthesized speech is adopted for testing an ASR system. IRT is then used to estimate the difficulty of each synthesized speech as well as the ability of each ASR system. In the second level, the difficulty of each synthesized speech is decomposed into the sentence’s difficulty and discrimination and the speaker’s quality. The difficulty of a synthesized speech tends to be high when it is generated from a difficult sentence and a bad speaker, and sentences with greater discriminations tend to better differentiate between good and bad speakers. The ASR’s ability is high when it is robust to hard speeches in turn. Before performing the experiments with the two-IRT level model we propose in this work, we executed a preliminary case study to verify the viability of applying IRT in the context of speech evaluation. In this first case study, IRT was applied to evaluate 62 speakers (from four TTS tools) and to characterize the difficulty of 12 different sentences. The experiments presented interesting insights about the relevance of applying IRT to evaluate sentences and speakers, which inspired us to explore other scenarios. So, we modeled the two-IRT level model already introduced and executed the second case study. Four ASR systems were adopted to transcribe synthesized speeches from 100 benchmark sentences and 75 speakers. Performed experiments revealed useful insights on how the quality of speech synthesis and recognition can be affected by distinct factors (e.g., sentence difficulty and speaker ability). We also explored the impact of pitch, rate, and noise insertion on pa- rameter estimation and system performance.
URI :	https://repositorio.ufpe.br/handle/123456789/52538
Aparece en las colecciones:	Dissertações de Mestrado - Ciência da Computação

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
TESE Chaina Santos Oliveira.pdf Artículo embargado hasta 2025-09-30		2,65 MB	Adobe PDF	Visualizar/Abrir Item embargoed

Este ítem está protegido por copyright original

Visualizar la licencia

Mostrar el registro Dublin Core completo del ítem Recomiende este ítem

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons