蓋晁旭+梁隆愷+何勇軍
摘 要:在過去的數(shù)十年里,研究者們對說話人識別進行了廣泛而深入的研究,提出了許多有效的方法。目前主流的說話人識別方法如高斯混合-通用背景模型(Gaussian mixture modelUniversal background model, GMMUBM)和高斯混合-支持向量機模型(Gaussian mixture modelSupport vector machine, GMMSVM),雖然能取得比較理想的識別效果,但都需要充分的訓(xùn)練和測試數(shù)據(jù)。而這一要求在現(xiàn)實應(yīng)用中通常難以滿足,導(dǎo)致其識別率急劇降低。針對這一問題,提出了一種基于稀疏編碼的說話人識別方法。該方法在訓(xùn)練階段為每個說話人訓(xùn)練一個語音字典;在識別階段,將測試語音分別表示在每個字典上然后根據(jù)重構(gòu)誤差打分。實驗表明,在少量無噪的訓(xùn)練和測試語音數(shù)據(jù)情況下,所提出的方法取得了比GMMUBM和GMMSVM更好的識別效果。
關(guān)鍵詞:說話人識別;高斯混合;支持向量機;稀疏編碼
DOI:1015938/jjhust201703003
中圖分類號: TN9123
文獻標(biāo)志碼: A
文章編號: 1007-2683(2017)03-0013-06
Abstract:Speaker recognition has attracted broad and deep research in the past few decades, and many methods have been proposed At present, the popular methods such as the Gaussian mixture modelUniversal background model(GMMUBM) and Gaussian mixture modelSupport vector machine(GMMSVM) have got a better recognition result, but they all need too much training and testing data They will suffer severe performance degradation in practical application, because their data needs always could not be satisfied To solve this problem, a speaker recognition method based on sparse coding is presented In the training stage, the method learns a dictionary for each speaker; and in the recognition stage, it represents test speech over each dictionary sparsely and gets scores from the reconstitution error Experiments show that the proposed method achieves better recognition results than GMMUBM and GMMSVM, when the training and testing data are clean and limited
Keywords:speaker recognition; gaussian mixture model; support vector machine; sparse coding
通過實驗得到的DET曲線可以直觀地看出,在訓(xùn)練時長為8s,測試時長分別為2s,3s,4s,5s和訓(xùn)練時長分別為4s,5s,6s,7s,測試時長為2s這兩種條件下,當(dāng)實驗所用語音數(shù)據(jù)時長逐步增加時,兩種方法的識別結(jié)果在不同程度上都有所改善。而在每一次實驗中,基于稀疏編碼的說話人識別方法取得的識別效果要明顯優(yōu)于GMMUBM和GMMSVM的識別效果,這是因為稀疏編碼利用了語音信號本身具有的稀疏性,在語音數(shù)據(jù)相對較少的情況下具有比高斯混合模型更好的語音特征表示能力。
3 結(jié) 語
在目前的現(xiàn)實應(yīng)用中,諸如GMMUBM,GMMSVM這些基于高斯混合模型的主流說話人識別方法,它們的識別率隨著訓(xùn)練、測試數(shù)據(jù)的減少急劇下降,因此,在確保識別效果的前提下,減少識別方法對數(shù)據(jù)的需求量具有重要的意義。本文提出了一種基于稀疏編碼的說話人識別方法。字典是通過訓(xùn)練的方法而不是收集樣例的方法來獲取,這進一步確保了語音在字典上稀疏。然后將測試語音分別在已訓(xùn)練好的字典上進行打分,根據(jù)得分情況給出最后識別結(jié)果。最后的實驗結(jié)果表明,無噪環(huán)境中,在訓(xùn)練語音和測試語音較少的情況下,基于稀疏編碼的說話人識別方法取得了比GMMUBM和GMMSVM更好的識別效果。
本方法在語音數(shù)量較少且無噪聲的條件下有較好的識別效果,因此具有更加廣泛的實用價值,可用于現(xiàn)實中語音環(huán)境較為理想的說話人識別任務(wù)。我們將在未來的工作中降低方法的計算量,提高方法的抗噪能力,以增強其實時性和對環(huán)境噪聲的魯棒性。
參 考 文 獻:
[1] ALNA B, KAMARAUSKAS J Evaluation of Effectiveness of Different Methods in Speaker Recognition[J]. Elektronika ir Elektrotechnika, 2015, 98(2): 67-70
[2] SOONG F K, ROSENBERG / E, RABINER L R, et al A Vector Quantization Approach to Speaker Recognition[C]// Acoustics, Speech, and Signal Processing(ICASSP),1985: 387-390
[3] FURUI S Cepstral Analysis Technique for Automatic Speaker Verification[J]. IEEE Transactions on Acoustics Speech & Signal Processing, 1981, 29(2): 254-272
[4] BENZEGHIBA M F, BOURLARD H Usercustomized Password Speaker Verification Using Multiple Reference and Background Models[J]. Speech Communication, 2006, 48(9): 1200-1213
[5] REYNOLDS D A, ROSE R C Robust Textindependent Speaker Identification Using Gaussian Mixture Speaker Models[J]. IEEE Transactions on Speech & Audio Processing, 1995, 3(1): 72-83
[6] FARRELL K R, MAMMONE R J, ASSALEH K T Speaker Recognition Using Neural Networks and Conventional Classifiers[J]. IEEE Transactions on Speech & Audio Processing, 1994, 2(1): 194-205
[7] KENNY P, GUPTA V, STAFYLAKIS T, et al Deep Neural Networks for Extracting Baumwelch Statistics for Speaker Recognition[C]//Proc Odyssey,2014: 293-298
[8] SUN H, LEE K A, MA B A New Study of GMMSVM System for Textdependent Speaker Recognition[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015: 4195-4199
[9] REYNOLDS D A Speaker Verification Using Adapted Gaussian Mixture Models[J]. Digital Signal Processing, 2000, 7(1): 19-41
[10]劉明輝 基于GMM和SVM的文本無關(guān)的說話人確認(rèn)方法研究[D]. 合肥:中國科學(xué)技術(shù)大學(xué), 2007
[11]DEHAK N, KENNY P, DEHAK R, et al FrontEnd Factor Analysis for Speaker Verification[J]. IEEE Transactions on Audio Speech & Language Processing, 2011, 19(4): 788-798
[12]張陳昊, 鄭方, 王琳琳 基于多音素類模型的文本無關(guān)短語音說話人識別[J]. 清華大學(xué)學(xué)報 (自然科學(xué)版), 2013(6):17
[13]林琳, 陳虹, 陳建, 等 基于多核 SVMGMM 的短語音說話人識別[J]. 吉林大學(xué)學(xué)報: 工學(xué)版, 2013 (2): 504-509
[14]何勇軍, 付茂國, 孫廣路 語音特征增強方法綜述[J]. 哈爾濱理工大學(xué)學(xué)報, 2014, 19(2): 19-25
[15]PATI Y C, REZAIIFAR R, KRISHNAPRASAD P S Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition[C]// in Conference Record of The TwentySeventh Asilomar Conference on Signals, Systems and Computers,1995: 1-3
[16]MALLAT S G, ZHANG Z Matching Pursuits with Timefrequency Dictionaries[J]. IEEE Transactions on Signal Processing, 1994, 41(12): 3397-3415
[17]CHEN S S, DONOHO D L, SAUNDERS M A Atomic Decomposition by Basis Pursuit[J]. Siam Review, 1998, 20(1): 129-159
[18]TIBSHIRANI R J Regression Shrinkage and Selection via the LASSO[J]. Journal of the Royal Statistical Society, 1996, 58:267-288
[19]AHARON M, ELAD M, BRUCKSTEIN A KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation[J]. IEEE Transactions on Signal Processing, 2006, 54(11): 4311-4322
[20]MACQUEEN J Some Methods for Classification and Analysis of Multivariate Observations[C]// In 5th Berkeley Symp Math Statist Prob 1967: 281-297
(編輯:溫澤宇)