LICui-mei(李翠梅),ZENG Ping-ping(曾萍萍),ZHU Jin-qiang(朱勁強(qiáng)),WU Jian-hua(吳建華)*
1 School of Communication and Electronics,Jiangxi Science&Technology Normal University,Nanchang 330031,China
2 College of Science and Technology,Nanchang University,Nanchang 330029,China
3 Department of Electronic Information Engineering,Nanchang University,Nanchang 330031,China
Human Mouth-State Recognition Based on Image Warping and Sparse Representation Combined with Homotopy
LICui-mei(李翠梅)1,ZENG Ping-ping(曾萍萍)2,ZHU Jin-qiang(朱勁強(qiáng))3,WU Jian-hua(吳建華)3*
1 School of Communication and Electronics,Jiangxi Science&Technology Normal University,Nanchang 330031,China
2 College of Science and Technology,Nanchang University,Nanchang 330029,China
3 Department of Electronic Information Engineering,Nanchang University,Nanchang 330031,China
It is often necessary to recognize human mouth-states for detecting the number of audio sources and im proving the speech recognition capability of an intelligent robot auditory system.A human mouth-state recognition method based on image warping and sparse representation(SR)combined with homotopy is proposed.Using properly warped training mouth-state images as atoms of the overcomplete dictionary overcomes the impact of the diversity of the mouths'scales,shapes and positions so that further im provement of the robustness can be achieved and the requirement for a large number of training sam ples can be relieved.The homotopy method is employed to compute the expansion coefficients effectively,i.e.,for sparse coding.The orthogonalmatching pursuit(OMP)is also tested and compared with the homototy method.Experimental results and com parisons with the state-of-the-art methods have proved the effectiveness of the proposed approach.
mouth-state recognition;image warping;sparse representation(SR);sparse coding;homotopy
In the application of a conventional intelligent robot auditory system,several speech signals need to be detected from a mixture of audio signals and noise[13].The recognition performance of the auditory system may be affected by the noise when only the auditory signal is used.To address this problem,the visual information can be employed to facilitate the speech signal recognition.Thismeans determining the number of audio sources by observing the state of themouth,opened or closed.Although the visual information about the human mouth-state cannotmake the number of speakers known completely,ithelps increase the accuracy of this process.To recognize themouthstate is a problem of pattern recognition which is a meaningful and well-known application in computer vision.Pattern recognition technology was born in 1920s and established in the early 1960s[4].A test pattern will be identified by a classifier trained by relevant algorithms like the unsupervised classification algorithms which have the superiority of no requirementof training setor supervised classification algorithms such as likelihood classifier[5],the support vector machine (SVM)[6],the nearest neighbors(NN)[7]and the minimum distance(MD),which need enough training samples[8].Although the traditional supervised classification algorithms like SVM bring excellent results,directly working with high dimensional original data with varying scales and shapes and positions is generally difficult.So feature extraction seems important to pattern recognition.The extracted low-dimensional features are more robust and cheaper to the classification than the original data.In the past decade,many feature extraction methods were proposed such as principle component analysis (PCA)[9],active appearance model(AAM)[10-11]and independent component analysis(ICA)[12].Although there are many feature extraction methods,feature extraction is a challenging task yet.In 2009,W right et al.proposed a sparse representation-based classification(SRC)[13].It shows that the choice of features is no longer critical when sparsity in the recognition problem is properly harnessed.Thus the downsampled images can be used as features and stacked as dictionary directly.Thismethod notonly performs dimensionality reduction but also has robustness to noise,shape and illumination.There are two key steps in the sparse representation(SR)approach involved in SRC method.One is the sparse coding used to optimally represent the test signal with the linear combination of few elements selected from a dictionary.The other is the need to generate this beforehand dictionary.
In sparse coding phase,the?0-norm m inimization is regarded as the penalization in the optimization process of SR.But solving the?0-norm m inimization is a non-deterministic polynomial hard(NP-hard)problem.Instead,many researchers have addressed it by convex optimization[14],for example,using the?1-norm m inimization in place of the?0-norm m inimization as the penalization.Many related algorithms emerged such as basis pursuit(BP)[15],matching pursuit (MP)[16],orthogonal matching pursuit(OMP)[17]and homotopy[18].Different sparse coding methods can result in different performances.Compared with other approaches,homotopy method has the advantages of high speed and robustness to noisy data when a sufficient sparsity is present,and it ismuch better for classification[18-20].
Dictionaries can be classified into two generic groups: unlearned dictionariesand learned ones.The former is generated by stacking the standard orthogonal basis or the training samples.The latter is learned by relevant algorithms,like the method of optimal directions(MOD)and the K-SVD algorithm[21-22],with training samples.Although the K-SVD algorithm can decrease the number of atoms of dictionary,it pays attention only to the representational power of the dictionary rather than the discrimination power[23].While in the SRC method the dictionary stacked with the down-sampled images of unprocessed original images is cursory.
In this paper,SR is applied to human mouth-state recognition.And except for our team's previous work[22],it is rarely applied yet to the recognition of human mouth-state.To further improve the discrim ination power of the features and dictionary,the original images had better to be pre-processed first.In Ref.[24],Li et al.paid more attention to preprocessing like accurate face alignment.The experimental results showed that a pre-processing approach was of great significance for the later classification task with minor computation and ameliorated classification performance remarkably.Inspired by this,we propose a human mouth-staterecognition algorithm based on image warping and sparse representation(SR)combined with homotopy.It is the first time that the image warping method[2527]and homotopy are used for human mouth-state recognition.At the onset,the lip contour will be extracted automatically from a mouth-state image[28].Then the extracted mouth region is warped into a standard template which is constructed by the average feature points of all the mouth images in training set including both mouth-opened and mouth-closed samples.By this way all the warped mouth-state images are of the same size and the feature points are in fixed locations.The dictionary involved in our method is generated by stacking the down-sampled images of warped training samples.It is robust and discrim inative even if there are a lim ited number of samples or defective ones.Experimental results show that the proposed method leads to higher classification rates(CR)than other approaches.
Considering nigrayscalemouth image patches of size w×h pixels from the i th classes,which are reordered as column vectors xi∈m(m=w×h)of amatrix Ai=[xi,1,xi,2,…,x]∈m×ni,i=1,2.Thematrix A will be treated as a sub-i,niidictionary and each column of itasan atom.Every testsample y∈mfrom the same class will be approximately represented sparsely over the sub-dictionary,i.e.,linear combination of the training samples(atoms)associated with class i:
where ai,j∈,j=1,2,…,ni.
There are only twomouth states,mouth-closed and mouthopened,associated with two sub-dictionaries A1and A2respectively.A new matrix A=[A1,A2] is defined as a whole dictionary for the entire two training sets:
Eq.(1)can then be rew ritten as:
If a test sample y belongs to the mouth-closed class,ideally,the SR coefficient vector can be expected to be[a1,1,…,a1,n1,0,…,0]T∈n,otherw ise,if itbelongs to themouth-opened class,α=[0,…,0,a2,1,…,a2,n]T∈2n,n=n1+n2.For simplicity,we define a vectorδi(α) (i=1,2)as a new SR coefficient vector which keeps the nonzero entries inαthatare associated with the i th class.So we can approximately reconstruct y by using only the coefficients associated with the ith class,i.e.,Then y can be recognized based on the two approximations by assigning it to the class thatminimizes the reconstruction error(the difference between y and yi):
It is obvious that finding the correct SR coefficientvectorα is the key task to SRC.Traditionally,the?2-norm minim ization is used for solving the problem: Althoughcan be simply calculated,it includesmany nonzero entries spanning over the two classes,i.e.,it is not so sparse.Aswe know that the sparser the SR coefficients are,the easier they will be recognized to the class the testsample y belongs to.So the?2-norm minimization optim ization problem seems powerless,and then the researchers focus on finding the solution to the?0-norm m inimization problem:
where‖α‖0is the?0-norm ofand equals the number of nonzero entries inα.However,to find the sparsestsolution ofα is an NP-hard problem due to its nature of combinational optim ization.There are several greedy pursuit methods[29]proposed such as MP[16]and OMP[17].Recently,researches in the field of SR and compressed sensing[30]have shown that the?0-norm m inim ization problem can be replaced by?1-norm minim ization problem(P1)which can be addressed by using linear programming method like homotopy when the solutionα is sparse enough:
This optimization notonly guarantees the sufficient sparsity ofα,but also is easy to implement.
The proposed human mouth-state recognition algorithm is also a two-class pattern recognition problem.Its systematic principle is shown in Fig.1.
Fig.1 The systematic diagram of the proposed mouth-state recognition algorithm
In the training phase,each mouth-closed ormouth-opened training sample is warped into a pre-defined standard template,and then all warped mouth-closed training samples and mouthopened samples are stacked into sub-dictionary matrixesandrespectively.The resultingare thenmerged into a single dictionary Awaccording to Eq.(2).A testmouth-state sample y can be sparsely represented by Eq.(3)over the dictionary Aw,and the SR coefficient vector is solved with the homotopy method.Finally,which class the test sample belongs to can be determined according to Eq.(4).
2.1 Lip outer contour extraction
The automatic lip contour extraction method in Ref.[28]is employed.A 16-point lip model with some geometric constraints is used to describe the lip contour.A total of 16 fixed locations Lt=are defined as shown in Fig.2.A region-based cost function which maximizes the joint probability of the lip region and the non-lip region is adopted to extract the optimum lip contour after several iterations[31].Thus the 16 points of optimum lip contour Ls=are obtained(Fig.3).
Fig.2 Feature points in a mouth image:(a)the standard template image and(b)16 standard lip feature points
Fig.3 Image warping procedure of a frontalmouth-closed image:(a) original,(b)16 lip feature points,(c)filtered by the morphological filtering,and(d)the warped
2.2 Image warping
The mouth images with extracted lip contour are warped into the standard template shown in Fig.2(b).The thin-plate spline(TPS)based image warping method in Ref.[27]is employed.The two TPS interpolation functions which are used as coordinatemapping functions are defined:
Then,Eq.(8)together with Eq.(9)can be reformulated to: where kijSimplifying Eq. (10)to TC=S,the TPS coefficients can be calculated by C= T-1S.With TPS coefficients C and Eq.(8),every lip region point(x,y)of the training and testing mouth-state image can be warped into the point(x',y')in the standard template image.All the warped images have the same size(36×72) with the feature points in fixed locations.The results are shown in Fig.3 for a mouth-closed image and Fig.4 for a mouthopened one.
Fig.4 Imagewarping procedure of a frontalmouth-opened image:(a) original,(b)16 mouth feature points,(c)filtered by the morphological filtering,and(d)the warped
The redundancy is increased in the warped images,resulting in a sparser SR and better performance of mouth-state recognition.Compared with conventional affine transformations such as scaling,shifting and rotation,image warping can be used for lateral facial images.Figure 5 shows several examples for lateralmouth images.
Fig.5 Examples of profile mouth images(up)and its warped results (down):(a)left side view of amouth-closed image,(b)right side view of amouth-closed image,(c)left side view of amouthopened image,and(d)right side view of amouth-opened image
2.3 Homotopy method
The homotopy method in Ref.[18]is employed to pinpoint the SR coefficient vectorαinvolved in the?1-norm minim ization.It pursuits a solution path parameterized by the parameter vector evolved from an accessible initial value to the desired value.The homotopy method is based on iterative calculation,and the step size needs to be calculated each time.It is proved that the?1-norm minimization problem(P1)can be replaced by the follow ing defined objective function fλ(α):
whereλ≥0 and the solution path starts at a large value forλ and zero vector forαλand terminates whenλ=0 andαλconverges to the desired solution(P1).
3.1 Database used
The database used in this paper is established with the mouth-state images cropped from the images random ly downloaded from Google online.It contains two kinds of mouth-state images of man and woman at all age groups,883 mouth-closed images(783 for training and 100 for test)and 1 001 mouth-opened ones(901 for training and 100 for test) under variable illum inations,scales and poses,and partof them are shown in Figs.6(a)and(b).
Fig.6 Part of images from the established mouth-state image database:(a)100 mouth-closed images,(b)100 mouth-opened images,(c)100 warped mouth-closed images of size 36×72,and(d)100 warped mouth-opened images of size 36×72
3.2 Design of experiments
Sim ilar to the processing in Ref.[22],every mouth image from both of the two training sets is down-sampled to size 10×12,reordered into a column vector and then?2-norm normalized.The sub-dictionary matrixesandare constructed with mouth-closed and mouthopened samples from the original training set,respectively,thenIn a sim ilarway,the dictionary Awcan be constructed with the warped training set.The test samples are processed similarly to the training samples,and the corresponding reconstructed images by using each dictionary matrix and the reconstruction errors are shown in Fig.8 for mouth-closed and in Fig.9 for mouthopened.The estimated SR coefficient vectorsαcorresponding to the related testmouth-state images with different dictionary matrixes are shown in Fig.10.The approach proposed in this paper is compared with the state-of-the-art methods such as SVM,neural network(NN),MD and SRC based on a dictionary pre-trained by K-SVD algorithm(SRC-KSVD)with the original training setand warped training set,respectively.In the SRC-KSVD,the dictionariesare trained from the original training set,andare trained from the warped training set.
Fig.8 Reconstructed results of original and warped mouth-closed samples:(a)original;(b)down sampled version of(a)(as the feature);(c)reconstructed result by Ao,error e=0.774;
Fig.9 Reconstructed results of original and warped mouth-opened samples:(a)original;(b)down sampled version of(a)(as the feature);(c)reconstructed result by Ao,error e=1.330;
Fig.10 The coefficient vector locations and valuesof nonzero entries of αcorresponding to(a)Fig.8(b),(b)Fig.8(g),(c)Fig.9 (b),and(d)Fig.9(g)
The parameters are set similarly to those in Ref.[22],i.e.,10 for sparsity prior and 50 for the maximum number of iterations.We also compare the CR between homotopy and other linear programming(LP)solvers like OMP.The test mouth-state samples are processed similarly to the corresponding training samples.All the experiments are carried out on two groups of training sets,respectively.The results are shown in Table 1.In Table 1- 2,W represents image warping,the data format is CR with warped set/CR with un-warped set.The results prove that ourmethod always has the best performance,the highest CR reaching 97.5%.Also according to Table 2,it is obvious that image warping is good to mouth-state recognition,i.e.,it boosts the CR on an average of 1.786% and the highest increment reaches 4.5%,as shown in columns 2 and 7 of Table 1.To further illustrate the benefits of image warping and homotopymethod,extended experiments under the condition of different number of training samples(the number ofmouth-closed training samples is the same to mouth-opened ones)have been done,and the results are shown in Fig.11.Homotopy method always has better performance than the OMP method under the same condition.And the CRs are improved by image warping inmost cases,especially when the number of the training samples is less than 50 per class,and the highest average increase has reached 9.0%in MDmethod and 5.214% in ourmethod aswell,as shown in Table2.The CRs ofmouthclosed are decreased slightly in some cases since the TPS interpolation used in image warping may make the relation of magnitude of reconstruction errors upside down when e1is approximately equal to e2.But this decrease is acceptable since the image warping makes the CRs of mouth-closed and mouthopened more balanced and reasonable.All the experiments are carried out on a PC(Inter core i3- 3220 CPU,3.30 GHz)with Matlab R2010a.
Table 1 Comparison of CR(%)of differentmethods with warped and un-warped training sets
Table 2 Comparison of CR(%)of differentmethodswith warped and un-warped training sets in case of small training sets
Fig.11 CRs for different number of training samples
To improve the performance of an intelligent robotauditory system,the effective audio sources should be detected by recognizing the state of mouth from the acoustic m ixtures.In this paper,we have proposed a novel approach for the human mouth-state recognition based on image warping and SR combined with homotopy method.Relevant experiments have been done to compare the proposedmethod with the state-of-theartmethods on two different training sets:original training set and the warped training set.The results have proved that our method ismore efficient and effective than the others for human mouth-state recognition.Homptopy is selected because it has faster running speed than the general LP solvers which are proved by Donoho etal.[18].It is further proved thathomotopy method ismuch more effective than OMPmethod in terms of classification.In addition,image warping makes our method obtain higher CR,although when we just have limited training samples.In the future,we will investigate how to extract the feature points not only along the outline of mouth but also between the lips such as the pixels of teeth in mouth-opened images.Also wewill investigatewhether the CR will increase if we use an adaptive template instead of the standard template with fixed size.Wewill also extend our algorithm tomulti-class recognition problems,such as human gestures recognition in a service robot system.
References
[1]Rivet B,Wang W,Naqvi S M,et al.Audio-Visual Speech Source Separation[J].IEEE Signal Processing Magazine,2014,31(3):125-134.
[2]Liu Q,Wang W W,Jackson P.Use of Bimodal Coherence to Resolve Permutation Problem in Convolutive BSS[J].Signal Processing,2012,92(8):1916-1927.
[3]M issaoui I,Zied L.Cepstral Smoothing of Binary Masks for Convolutive Blind Separation of Speech M ixtures[J].International Journal of Digital Content Technology and Its Applications,2012,6(17):532-541.
[4]Bucak S S,Rong J,Jain A K.Multiple Kernel Learning for Visual Object Recognition:a Review[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,36(7):1354-1369.
[5]Loog M,Jensen A C.Sem i-supervised Nearest Mean Classification through a Constrained Log-likelihood[J].IEEE Transactions on Neural Networks and Learning Sy stems,2015,26(5):995-1006.
[6]Zheng J,Lu B L.A Support Vector Machine Classifier with Automatic Confidence and Its Application to Gender Classification[J].Neurocomputing,2011,74(11):1926-1935.
[7]Cavalcanti G D C,Ren T I,Vale B A.Data Complexity Measures and Nearest Neighbor Classifiers:a Practical Analysis for Meta-learning[C].IEEE 24th International Conference on Tools with Artificial Intelligence,Athens,Greece,2012:1065-1069.
[8]Bag S,Sanyal G.An Efficient Face Recognition Approach Using PCA and M inimum Distance Classifier[C].IEEE International Conference on Image Information Processing,Himachal Pradesh,India,2011:3-5.
[9]Wang C L,Lan L,Zhang Y W,et al.Face Recognition Based on Principle Component Analysis and Support Vector Machine[C].IEEE 3rd InternationalWorkshop on Intelligent Systemsand Applications,Wuhan,China,2011:1-4.
[10]Cootes T F,EdwardsG J,Taylor C J.Active Appearance Models[J].Computer Vision ECCV'98,1998,1407:484-498.
[11]Chen Y,Yu F,AiC.Sequential Active Appearance Model Based on Online Instance Learning[J].IEEE Signal Processing Letters,2013,20(6):567-570.
[12]Wang S L,Liew A W C.ICA-Based Lip Feature Representation for Speaker Authentication[C].International IEEE Conference on Signal-Image Technologies and Internet-Based System,Shanghai,China,2007:763-767.
[13]W right J,Yang A Y,Ganesh A,et al.Robust Face Recognition via Sparse Representation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(2):210-227.
[14]Donoho D L.For Most Large Undeterm ined Systems of Linear Equations the M inimal l1-Norm Solution is Also the Sparsest Solution[J].Communications on Pure and Applied Mathematics,2006,59(6):797-829.
[15]Qin Q,Jiang Z N,F(xiàn)eng K,et al.A Novel Scheme for Fault Detection of Reciprocating Compressor Valves Based on Basis Pursuit,Wave Matching and Support Vector Machine[J].Measurement,2012,45(5):897-908.
[16]Moussallam M,Daudet L,Richard G.Matching Pursuits with Random Sequential Subdictionaries[J].Signal Processing,2012,92(10):2532-2544.
[17]Karahanoglu N B,Erdogan H.A*OrthogonalMatching Pursuit: Best-First Search for Compressed Sensing Signal Recovery[J].Digital Signal Processing,2012,22(4):555-568.
[18]Donoho D,Tsaig Y.Fast Solution of l1-Norm M inimization Problems When the Solution May be Sparse[J].IEEE Transactions on Information Theory,2008,54(11):4789-4812.
[19]Ul Haq Q S,Shi L X,Tao L M,et al.Hyperspectral Data Classification via Sparse Representation in Homotopy[C].IEEE 2nd International Conference on Information Science and Engineering,Hangzhou,China,2010:3748-3752.
[20]Cao H B,Deng H W,Li M,et al.Classification of Multicolor Fluorescence in Situ Hybridization(M-FISH)Imageswith Sparse Representation[J].IEEE Transactions on Nanobioscience,2012,11(2):111-118.
[21]Aharon M,Elad M,Bruckstein A.K-SVD:an Algorithm for Designing Overcomplete Dictionaries for Sparse Representation[J],IEEE Transactions on Signal Processing,2006,54(11): 4311-4322.
[22]Zhang Y,Qu S,Wu JH.Human Mouth-Type Recognition via Learned Dictionary and Sparse Representation[J].International Journal of DigitalContent Technology and its Applications,2013,7(4):599-606.
[23]Zhang Q,Li B X.Discrim inative K-SVD for Dictionary Learning in Face Recognition[C].IEEE Conference on Computer Vision and Pattern Recognition,San Francisco,CA,USA,2010:2691-2698.
[24]Li H X,Wang P,Shen C H.Robust Face Recognition via Accurate Face Alignment and Sparse Representation[C].International Conference on Digital Image Computing:Techniques and Applications,Sydney,Australia,2010:265-269.
[25]Pishchulin L,Gass T,Dreuw P,et al.Image Warping for Face Recognition:from Local Optimality towards Global Optimization[J].Pattern Recognition,2012,45(9):3131-3140.
[26]Elad M,Goldenbery R,Kimmel R.Low Bit-Rate Compression of Facial Images[J].IEEE Transactions on Image Processing,2007,16(9):2379-2383.
[27]NejatiM,Amirfattahi R,Sadri S.A Fast Hybrid Approach for Approximating a Thin-Plate Spline Surface[C].The18th Iranian Conference on Electrical Engineering,Isfahan,Iran,2010:204-208.
[28]Sum K L,Lau W H,Leung S H,et al.A New Optim ization Procedure for Extracting the Point-Based Lip Contour Using Active Shape Model[C].Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing,Salt Lake City,UT,USA,2001:1485-1488.
[29]Tropp J A,Gilbert A C,Strauss M J.Algorithms for Simultaneous Sparse Approximation,Part I:Greedy Pursuit[J].Signal Processing,2006,86(3):572-588.
[30]Donoho D,Huo X.Uncertainty Principles and Ideal Atomic Decomposition[J].IEEE Transactions on Information Theory,2001,47(7):2845-2862.
[31]Wang S L,Lau W H,Leung S H.Automatic Lip Contour Extraction from Color Images[J].Pattern Recognition,2004,37 (12):2375-2387.
TN911.73;O235
A
1672-5220(2015)04-0658-07
date:2014-11-05
s:National Natural Science Foundation of China(No.61210306074);Natural Science Foundation of Jiangxi Province,China (No.2012BAB201025);the Scientific Program of Jiangxi Provincial Education Department,China(Nos.GJJ14583,GJJ13008)
*Correspondence should be addressed to WU Jian-hua,Email:jhwu@ncu.edu.cn
Journal of Donghua University(English Edition)2015年4期