Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

12/26/2021

∙

Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings. In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83 to 33.12 27.92 terms of 9 important keywords, our approach gave an average sensitivity of 38 compared to 24 average specificity of 90 On average, sensitivity improved from 24 On the other hand, specificity remained high for both methods (90

READ FULL TEXT

Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Sign in with Google

Consider DeepAI Pro