Authors
Xin Wang, Chuan Xie, Qiang Wu, Huayi Zhan, Ying Wu
Publication date
2022
Conference
INTERSPEECH
Pages
4775-4779
Description
Text-independent speaker identification attracted growing attention while it remains challenging to extract speaker-specific features from a speech with arbitrary content. End-to-end systems trained with utterance-level features suffer from performance degradation caused by speech content variation. To address this issue, this paper proposes a novel phoneme-based approach with the following key features: first, it restricts the variety of speech content by splitting each utterance into a set of phoneme segments and develops the phoneme-constrained models to extract segment-level embeddings of speakers; second, it leverages a soft-voting mechanism with mono-phonemic thresholds and weights to combine the results of different phonemes. Experimental results on AISHELL and ASRU2019 datasets show that the proposed approach is effective and robust, which outperforms the state-of-the-art methods in both EER and accuracy, especially with a larger phonemic mismatch between the enrollment and test utterances. In addition, the proposed system is efficient that can be trained well on a small-scale dataset.
Total citations