曾明如 鄭子勝 羅順
摘 ?要: 為了更好地獲取視頻中連續(xù)幀之間的時(shí)間信息,提出一種新穎的雙流卷積網(wǎng)絡(luò)結(jié)構(gòu)用于視頻的人體行為識(shí)別。該網(wǎng)絡(luò)在不改變雙流卷積中空間流結(jié)構(gòu)的情況下,在時(shí)間流的卷積模型中加入長(zhǎng)短時(shí)記憶(LSTM)網(wǎng)絡(luò),并且時(shí)間流的訓(xùn)練相較于以往的雙流卷積架構(gòu)采用端對(duì)端的訓(xùn)練方式。同時(shí)在新的網(wǎng)絡(luò)結(jié)構(gòu)上嘗試使用組合誤差函數(shù)來(lái)獲得更好的光流信息。在KTH和UCF101兩個(gè)通用人體行為視頻數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果證明,提出的使用組合誤差函數(shù)結(jié)合LSTM的雙流卷積與普通的雙流卷積、使用以往誤差函數(shù)結(jié)合LSTM的雙流卷積相比,識(shí)別率有明顯的提高。
關(guān)鍵詞: LSTM; 雙流卷積; 人體行為識(shí)別; 卷積神經(jīng)網(wǎng)絡(luò); 光流信息; 模型融合
中圖分類(lèi)號(hào): TN911.73?34; TP391.41 ? ? ? ? ? ? ? ? 文獻(xiàn)標(biāo)識(shí)碼: A ? ? ? ? ? ? ? ?文章編號(hào): 1004?373X(2019)19?0037?04
Abstract: In order to better obtain the time information between consecutive frames in the video, a novel two?stream convolutional network structure is proposed for recognition of human behavior in the video. In the network, a long?short?time memory (LSTM) network is added into the convolution model of the temporal stream without changing the spatial stream structure in the double stream convolution, and in compared with the previous two?stream convolution architecture, the end?to?end training mode is used in the training of the temporal stream. An attempt to use combined error function in the new network structure was made to obtain the better optical flow information. The experiment was carried out on two universal human behavior video datasets of KTH and UCF101. The results verify that the proposed two?stream convolution combined with LSTM has more significant recognition rate in comparison with the conventional two?stream convolution.
Keywords: LSTM; two?stream convolution; human behavior recognition; convolutional nerual network; optical flow information; model fusion
人體行為識(shí)別領(lǐng)域在過(guò)去幾年飛速發(fā)展,但是視頻中的人體行為識(shí)別仍然面臨著巨大的挑戰(zhàn)。相比靜態(tài)圖像分類(lèi),視頻中的時(shí)間流信息為識(shí)別提供了一個(gè)重要的線索,因?yàn)榇蟛糠謩?dòng)作可以通過(guò)時(shí)間流中的運(yùn)動(dòng)信息準(zhǔn)確地識(shí)別出來(lái)。因此近年來(lái),大部分研究都是針對(duì)如何從視頻幀中獲取視頻的時(shí)間流信息,從而得到運(yùn)動(dòng)信息[1?2]。
最初,傳統(tǒng)手工提取特征的方法在行為識(shí)別領(lǐng)域興起了一段時(shí)間,如文獻(xiàn)[3]提出的改進(jìn)稠密軌跡(IDT)用來(lái)表示運(yùn)動(dòng)信息的特征或基于時(shí)空興趣點(diǎn)的特征。接著,文獻(xiàn)[4]提出單一的卷積結(jié)構(gòu)在視頻處理中比傳統(tǒng)的手工提取方法更快,但是表現(xiàn)卻不如傳統(tǒng)的手工提取方法。因?yàn)閱我坏木矸e結(jié)構(gòu)存在難以獲取視頻幀之間運(yùn)動(dòng)信息的問(wèn)題。隨后,文獻(xiàn)[5]提出雙流卷積結(jié)構(gòu)解決了這個(gè)問(wèn)題。雙流卷積結(jié)構(gòu)在以往的結(jié)構(gòu)上增加了一個(gè)額外的卷積結(jié)構(gòu)(時(shí)間流)來(lái)計(jì)算時(shí)間流信息,新的結(jié)構(gòu)相較于單一的卷積結(jié)構(gòu)在準(zhǔn)確率上有了明顯的提高,并且相比傳統(tǒng)的手工提取特征的方法在視頻處理中更快。
雙流卷積結(jié)構(gòu)的不足在于,在視頻分類(lèi)中經(jīng)常使用抽樣幀作為輸入數(shù)據(jù),而這個(gè)可能導(dǎo)致視頻級(jí)別的標(biāo)簽信息不完整甚至缺失[6]。本文的創(chuàng)新之處在于,在時(shí)間流中加入長(zhǎng)短時(shí)記憶(Long?Short?Time Memory,LSTM)網(wǎng)絡(luò),LSTM是在循環(huán)神經(jīng)網(wǎng)絡(luò)的基礎(chǔ)上加入了記憶單元來(lái)存儲(chǔ)信息,使得它在視頻幀處理中更容易獲得長(zhǎng)距離的光流信息,從而避免使用視頻中的抽樣幀作為輸入數(shù)據(jù)。并且在時(shí)間卷積流的訓(xùn)練過(guò)程中,相比以往的雙流卷積結(jié)構(gòu)采用端對(duì)端的訓(xùn)練方式,減少了對(duì)輸入數(shù)據(jù)的額外處理。同時(shí),在新的網(wǎng)絡(luò)結(jié)構(gòu)上嘗試使用新的誤差函數(shù)來(lái)獲得更好的光流信息。
雙流卷積網(wǎng)絡(luò)通過(guò)模仿人體視覺(jué)過(guò)程,將視頻的處理分為兩個(gè)流(空間流和時(shí)間流)[7],如圖1所示。其中,每個(gè)流都使用一個(gè)深層的卷積網(wǎng)絡(luò)與一個(gè)softmax分類(lèi)器連接,最終將兩個(gè)流的分類(lèi)結(jié)果進(jìn)行融合。
從以上多個(gè)實(shí)驗(yàn)可知:結(jié)合LSTM的雙流卷積神經(jīng)網(wǎng)絡(luò)對(duì)人體識(shí)別的準(zhǔn)確率有較大的性能提升。新的網(wǎng)絡(luò)結(jié)構(gòu)在UCF101數(shù)據(jù)集上進(jìn)行了實(shí)驗(yàn),獲得了78.1%的準(zhǔn)確率,比雙流卷積神經(jīng)網(wǎng)絡(luò)的識(shí)別準(zhǔn)確率高。
本文在雙流卷積神經(jīng)網(wǎng)絡(luò)的基礎(chǔ)上進(jìn)行了改進(jìn),并在UCF101數(shù)據(jù)集上進(jìn)行了實(shí)驗(yàn)驗(yàn)證。本文設(shè)計(jì)的網(wǎng)絡(luò)模型,在雙流卷積神經(jīng)網(wǎng)絡(luò)中的時(shí)間流結(jié)構(gòu)中引入LSTM網(wǎng)絡(luò),其使用記憶單元來(lái)存儲(chǔ)之前的信息,使得新的網(wǎng)絡(luò)能更好地獲取更長(zhǎng)的視頻幀信息。同時(shí),本文使用的網(wǎng)絡(luò)采用新的誤差函數(shù),新的誤差函數(shù)通過(guò)將標(biāo)準(zhǔn)像素重建誤差函數(shù)、平滑誤差函數(shù)和SSIM誤差函數(shù)進(jìn)行整合,利用三個(gè)誤差函數(shù)的優(yōu)點(diǎn)從視頻幀獲取更好的光流信息。從UCF101數(shù)據(jù)集中的實(shí)驗(yàn)證明,結(jié)合LSTM的雙流卷積神經(jīng)網(wǎng)絡(luò)在一定程度上獲取了更好的光流信息,較大幅度地提高了雙流卷積網(wǎng)絡(luò)的識(shí)別準(zhǔn)確率。同時(shí),在運(yùn)動(dòng)背景復(fù)雜且包含相機(jī)運(yùn)動(dòng)的情況下,結(jié)合LSTM雙流卷積神經(jīng)網(wǎng)絡(luò)的表現(xiàn)也比雙流卷積神經(jīng)網(wǎng)絡(luò)更好。
參考文獻(xiàn)
[1] CHEN B. Deep learning of invariant spatio?temporal features from video A [D]. Vancouver: The University of British Columbia, 2010.
[2] YEFFET L, WOLF L. Local trinary patterns for human action recognition [C]// 2009 IEEE 12th International Conference on Computer Vision. Kyoto: IEEE, 2009: 492?497.
[3] WANG H, SCHMID C. Action recognition with improved trajectories [C]// IEEE International Conference on Computer Vision. Sydney: IEEE, 2014: 3551?3558.
[4] KARPATHY A, TODERICI G, SHETTY S, et al. Large?scale video classification with convolutional neural networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1725?1732.
[5] SIMONYAN K, ZISSERMAN A. Two?stream convolutional networks for action recognition in videos [J]. Advances in neural information processing systems, 2014, 1(4): 568?576.
[6] JOE Y H N, MATTHEW H, SUDHEENDRA V, et al. Beyond short snippets: deep networks for video classification [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015: 4694?4702.
[7] 王昕培.基于雙流CNN的異常行為分類(lèi)算法研究[D].哈爾濱:哈爾濱工業(yè)大學(xué),2017.
WANG Xinpei. Research on two stream CNN based abnormal bahavior classification [D]. Harbin: Harbin Institute of Techno?logy, 2017.
[8] DONAHUE J, HENDRICKS L A, ROHRBACH M, et al. Long?term recurrent convolutional networks for visual recognition and description [J]. IEEE transactions on pattern analysis & machine intelligence, 2014, 39(4): 677?691.
[9] ZHAO H, GALLO O, FROSIO I, et al. Loss functions for image restoration with neural networks [J]. IEEE transactions on computational imaging, 2017, 3(1): 47?57.
[10] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition [J]. IEEE transactions on pattern analysis & machine intelligence, 2012, 35(1): 221?231.