Authors
Georg F Meyer, Jeffrey B Mulligan, Sophie M Wuerger
Publication date
2004/6/1
Journal
Information Fusion
Volume
5
Issue
2
Pages
91-101
Publisher
Elsevier
Description
Audio–visual speech recognition systems can be categorised into systems that integrate audio–visual features before decisions are made (feature fusion) and those that integrate decisions of separate recognisers for each modality (decision fusion). Decision fusion has been applied at the level of individual analysis time frames, phone segments and for isolated word recognition but in its basic form cannot be used for continuous speech recognition because of the combinatorial explosion of possible word string hypotheses that have to be evaluated. We present a case for decision fusion at the utterance level and propose an algorithm that can be applied efficiently to continuous speech recognition tasks, which we call N-best decision fusion. The system was tested on a single-speaker, continuous digit recognition task where the audio stream was contaminated by additive multi-speaker babble noise. The audio …
Total citations
20052006200720082009201020112012201320142015201620172018201920202021202220232024117124341611632221
Scholar articles