Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
PAPERS
Multi-setting acoustic feature training for data augmentation of speech recognition
Sei UenoAkinobu Lee
Author information
JOURNAL OPEN ACCESS

2024 Volume 45 Issue 4 Pages 195-203

Details
Abstract

This paper presents simple multi-setting log Mel-scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end-to-end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi-setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi-setting lmfb features with its setting ID embedded in the text-to-Mel network. Experimental evaluations showed that both proposed multi-setting training methods achieved better ASR performance than the baseline single-setting training augmentation methods.

Content from these authors
© 2024 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
Previous article Next article
feedback
Top