Robustness over time-varying channels in DNN-HMM ASR based human-robot interaction

Novoa, José; Wuth, Jorge; Escudero, Juan Pablo; Fredes, Josué; Mahu, Rodrigo; Stern, Richard; Becerra Yoma, Nestor

Keywords: speech recognition, human-computer interaction, time varying channels, locally-normalized filter banks


This paper addresses the problem of time-varying channels in speech-recognition-based human-robot interaction using Locally-Normalized Filter-Bank features (LNFB), and training strategies that compensate for microphone response and room acoustics. Testing utterances were generated by re-recording the Aurora-4 testing database using a PR2 mobile robot, equipped with a Kinect audio interface while performing head rotations and movements toward and away from a fixed source. Three training conditions were evaluated called Clean, 1-IR and 33-IR. With Clean training, the DNN-HMM system was trained using the Aurora-4 clean training database. With 1-IR training, the same training data were convolved with an impulse response estimated at one meter from the source with no rotation of the robot head. With 33-IR training, the Aurora-4 training data were convolved with impulse responses estimated at one, two and three meters from the source and 11 angular positions of the robot head. The 33-IR training method produced reductions in WER greater than 50% when compared with Clean training using both LNFB and conventional Mel filterbank features. Nevertheless, LNFB features provided a WER 23% lower than MelFB using 33-IR training. The use of 33-IR training and LNFB features reduced WER by 64% compared to Clean training and MelFB features.

Más información

Editorial: ISCA
Fecha de publicación: 2017
Página de inicio: 839
Página final: 343
Idioma: English