辛梓銘 王芳
摘 要:樸素貝葉斯算法在給定輸出類別的情況下,需假設(shè)屬性之間相互獨(dú)立,然而現(xiàn)實(shí)中這個(gè)假設(shè)一般不成立,導(dǎo)致在屬性個(gè)數(shù)較多或者屬性之間相關(guān)性較大時(shí),分類效果不是很理想。為了解決這個(gè)問題,本文采用優(yōu)化的模糊C均值聚類及權(quán)重計(jì)算方法改進(jìn)樸素貝葉斯算法。首先,基于JS散度構(gòu)造類別個(gè)數(shù)的自適應(yīng)函數(shù)優(yōu)化模糊聚類算法,利用優(yōu)化后的算法將文本分類整理。然后,采用詞頻因子優(yōu)化的TF-IDF算法計(jì)算分類后各樣本的特征權(quán)重,結(jié)合樣本權(quán)重與貝葉斯公式,進(jìn)行分類計(jì)算。最后,為了體現(xiàn)改進(jìn)的樸素貝葉斯算法的有效性和優(yōu)越性,將其與原始樸素貝葉斯算法以及其他改進(jìn)算法進(jìn)行對比實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明,改進(jìn)后的算法有效地降低了樸素貝葉斯模型對特征項(xiàng)獨(dú)立性的要求,提高了分類決策的準(zhǔn)確率,且在分類性能和效率上具有一定的優(yōu)越性。
關(guān)鍵詞:樸素貝葉斯;文本分類;模糊聚類;特征權(quán)重;獨(dú)立性假設(shè)
中圖分類號: TP391? 文獻(xiàn)標(biāo)識(shí)碼: A? DOI:10.3969/j.issn.1007-791X.2023.01.009
0 引言
隨著互聯(lián)網(wǎng)技術(shù)的迅猛發(fā)展以及大數(shù)據(jù)時(shí)代的到來,文本信息量呈爆炸式增長。如何更快更準(zhǔn)確地進(jìn)行信息檢索與數(shù)據(jù)分類成為重要的問題。文本分類算法是數(shù)據(jù)挖掘領(lǐng)域的核心內(nèi)容之一,它根據(jù)分類器將數(shù)據(jù)集中的數(shù)據(jù)項(xiàng)劃分到某一個(gè)固定的類別,基本步驟為:文本預(yù)處理、索引和詞頻統(tǒng)計(jì)、特征抽取、構(gòu)造分類器以及對分類結(jié)果的評價(jià)。文本分類算法包含多種,如支持向量機(jī)算法[1]、決策樹算法[2]、K近鄰算法[3]、神經(jīng)網(wǎng)絡(luò)算法[4]、貝葉斯分類算法[5]等等。
樸素貝葉斯算法先計(jì)算各個(gè)樣本的先驗(yàn)概率,再利用貝葉斯公式計(jì)算各樣本屬于每一個(gè)類的后驗(yàn)概率。該算法高效穩(wěn)定,常被應(yīng)用于數(shù)據(jù)分析:張付志等[6]利用貝葉斯算法對垃圾郵件的過濾進(jìn)行研究;楊曉花等[7]利用貝葉斯算法對圖書館書目進(jìn)行自動(dòng)分類;丁童心等[8]利用樸素貝葉斯算法進(jìn)行人臉表情識(shí)別。樸素貝葉斯算法的應(yīng)用領(lǐng)域廣泛,但是它的使用是以各屬性之間相互獨(dú)立的假設(shè)為前提??紤]到文本中各個(gè)詞組的特征向量之間并不都滿足獨(dú)立性假設(shè)的條件,直接使用樸素貝葉斯算法可能會(huì)導(dǎo)致文本分類的準(zhǔn)確率下降。針對這一問題,很多學(xué)者提出了改進(jìn)方法:Kononenko等[9]提出了半樸素貝葉斯分類模型,考慮部分屬性之間的相互依賴關(guān)系,通常假定每個(gè)屬性僅依賴于其他最多一個(gè)屬性。Hall[10]提出將樸素貝葉斯算法與決策樹算法結(jié)合,通過構(gòu)建決策樹來估計(jì)各屬性的權(quán)重。Webb等[11]通過對所有約束分類器類求平均的方法來削弱屬性的獨(dú)立性假設(shè)。Zadrozny等[12]提出了從決策樹和樸素貝葉斯分類器中獲得校正概率估計(jì)的方法。裘雅瑩等[13]提出了基于帶加權(quán)正特征的擴(kuò)展貝葉斯模型的中文文本分類的方法。秦兵等[14]提出利用密度函數(shù)似然比來增加特征詞的可分性信息的算法。李方[15]構(gòu)造了屬性間相關(guān)性的度量方法:通過改進(jìn)屬性加權(quán)來優(yōu)化樸素貝葉斯分類模型。黃勇等[16]提出了基于詞向量間余弦相似度改進(jìn)樸素貝葉斯算法。這些方法一定程度上削弱了貝葉斯模型獨(dú)立性假設(shè)帶來的影響,然而它們沒有考慮文本的語法語序,且操作起來相對復(fù)雜,導(dǎo)致耗時(shí)較大。
針對上述不足,本文利用優(yōu)化的模糊C均值聚類法[17]和權(quán)重計(jì)算方法改進(jìn)樸素貝葉斯算法。首先,采用JS散度構(gòu)造類別個(gè)數(shù)的自適應(yīng)函數(shù)以優(yōu)化模糊聚類算法,并利用改進(jìn)算法對文本進(jìn)行分類整理。然后,利用詞頻因子優(yōu)化的TF-IDF算法計(jì)算各類別內(nèi)樣本的權(quán)重,同位置權(quán)重一起代入貝葉斯公式,最后進(jìn)行分類計(jì)算。通過與傳統(tǒng)樸素貝葉斯算法與其他改進(jìn)算法的對比實(shí)驗(yàn)表明,本文的算法提高了分類決策的正確率與效率,降低了樸素貝葉斯模型對特征項(xiàng)獨(dú)立性的要求,并較其他改進(jìn)算法有一定的優(yōu)越性。
由表1可知,改進(jìn)后的樸素貝葉斯算法除了科學(xué)類的準(zhǔn)確率和F1分?jǐn)?shù)略低于傳統(tǒng)算法,其他類別都有所提升。改進(jìn)算法的P,R,F(xiàn)1指標(biāo)均值較傳統(tǒng)算法分別提升了3.07%、2.84%、2.91%,分類時(shí)間較原來減少1 min 3 s。實(shí)驗(yàn)結(jié)果表明,改進(jìn)后的樸素貝葉斯算法有效地提高了文本分類的性能與效率,降低了算法獨(dú)立性帶來的影響。
為了更好地體現(xiàn)本文的改進(jìn)算法在文本分類上的優(yōu)越性,選取基于樸素貝葉斯與決策樹結(jié)合的改進(jìn)算法[11]、基于屬性加權(quán)的改進(jìn)算法[15]、基于詞向量余弦相似度的改進(jìn)算法[16]進(jìn)行對比實(shí)驗(yàn),得到數(shù)據(jù)如表2所示。
通過4種算法的實(shí)驗(yàn)數(shù)據(jù)對比可以得到本文算法模型的F1分?jǐn)?shù)均值最大,分類性能最佳,且平均消耗時(shí)長最短,分類效率最高。將4種算法模型與原始算法模型應(yīng)用于六類數(shù)據(jù)分類時(shí)的12個(gè)P、R指標(biāo)進(jìn)行比較,除了文獻(xiàn)[15]算法,其余的3種算法模型均存在不同數(shù)量的分類指標(biāo)低于原始算法模型。本文的算法模型中存在兩個(gè)指標(biāo)略低于原始算法模型,穩(wěn)定性僅次于文獻(xiàn)[15]算法模型。實(shí)驗(yàn)結(jié)果表明,本文的改進(jìn)算法模型在分類性能和效率上具有一定的優(yōu)越性,但穩(wěn)定性欠佳,未來研究中,將針對算法模型穩(wěn)定性的問題進(jìn)行優(yōu)化。
4 結(jié)論
本文針對樸素貝葉斯算法的獨(dú)立性假設(shè)與特征權(quán)重分配不當(dāng)?shù)膯栴},在分類前利用模糊聚類法對數(shù)據(jù)集進(jìn)行重新整理,分類后計(jì)算不同類別的權(quán)重?cái)?shù)值,得到改進(jìn)的樸素貝葉斯算法。通過進(jìn)一步的對比實(shí)驗(yàn),本文改進(jìn)算法的P,R,F(xiàn)1指標(biāo)均值較原始算法均提升了3%左右,較其他三類改進(jìn)算法提升了1%~2%左右,且平均消耗時(shí)長最短。實(shí)驗(yàn)結(jié)果表明本文的改進(jìn)算法有效地降低了樸素貝葉斯模型對特征項(xiàng)獨(dú)立性的要求,提高了分類決策的正確率與效率,且在分類性能和效率上同其他改進(jìn)算法具有一定的優(yōu)越性。未來工作中,將在本文的文本類別基礎(chǔ)上添加相關(guān)的英文文本,在增大數(shù)據(jù)量以增強(qiáng)算法模型穩(wěn)定性的同時(shí),實(shí)現(xiàn)文本的多語種分類。
參考文獻(xiàn)
[1] SHEVADE S K, KEERTHI S S, BHATTACHARYYA C, et al. Improvements to the SMO algorithm for SVM regression[J]. IEEE Transactions on Neural Networks, 2000, 11(5): 1188-1193.
[2]? WANG L M, LI X L, CAO C H, et al. Combining decision tree and naive Bayes for classification[J]. Knowledge-Based Systems, 2006,19(7): 511-515.
[3]? CHEN Q, LI D, TANG C K. KNN matting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(9): 2175-2188.
[4]? LI Y M, WEI B G, LIU Y H, et al. Incorporating knowledge into neural network for text representation[J]. Expert Systems with Applications, 2018, 96:103-114.
[5]? ZHENG G, TIAN Y. Chinese web text classification system model based on naive bayes[C]//International Conference on E-product and E-entertainment, Zhengzhou,China, 2010: 1-4.
[6] 張付志, 伍朝輝, 姚芳. 基于貝葉斯算法的垃圾郵件過濾技術(shù)的研究與改進(jìn)[J]. 燕山大學(xué)學(xué)報(bào), 2009, 33(1): 47-52.
ZHANG F Z, WU Z F, YAO F. Research and improvement of spam filtering technology based on Bayesian algorithm[J]. Journal of the Yanshan University, 2009, 33(1): 47-52.
[7] 楊曉花,高海云.基于改進(jìn)貝葉斯的書目自動(dòng)分類算法[J]. 計(jì)算機(jī)科學(xué), 2018, 45(8): 203-207.
YANG X H,GAO H Y. Bibliographic automatic classification algorithm based on improved Bayes[J]. Computer Science, 2018, 45(8): 203-207.
[8] 丁童心,禹素萍.改進(jìn)樸素貝葉斯算法的人臉表情識(shí)別[J]. 軟件導(dǎo)刊, 2021, 20(1): 68-71.
DING T X, YU S P. Facial expression recognition based on improved naive Bayes algorithm[J]. Software Guide, 2021, 20(1): 68-71.
[9] KONONENKO I. Semi-naive Bayesian classifier[M]. Berlin: Lecture Notes in Computer Science, 1991:206-219.
[10] HALL M. A decision tree-based attribute weighting filter for naive Bayes[J]. Knowledge-Based Systems, 2007, 20(2): 120-126.
[11] WEBB G I, BOUGHTON J R, WANG Z. Not so naive Bayes: aggregating one-dependence estimators[J]. Machine Learning, 2005, 58(1): 5-24.
[12] ZADROZNY B, ELKAN C. Obtaining calibrated probablity estimates from decision trees and naive Bayesian classsifiers[C]//International Conferrence on Machine Learning,Williamstown,USA, 2001: 609-616.
[13] QIU Y Y, YANG G M,TAN Z H.Chinese text classification based on extended naive Bayes model with weighted positive features[C]//IEEE International Conference on Software Engineering and Service Science, Beijing,China, 2010: 243-246.
[14] 秦兵, 鄭實(shí)福, 劉挺, 等. 基于改進(jìn)的貝葉斯模型的中文網(wǎng)頁分類[C]//全國第六屆計(jì)算語言學(xué)聯(lián)合學(xué)術(shù)會(huì)議論文集, 北京: 清華大學(xué)出版社, 2001: 373-378.
QIN B, ZHENG S F, LIU T, et al. Chinese webpage classifier based on improved Bayesian cognitive science[C]//The Proceedings of the Sixth Joint Conference on Computational Linguistics, Beijing: Tsinghua University Press, 2001: 373-378.
[15] 李方. 基于改進(jìn)屬性加權(quán)的樸素貝葉斯分類模型[J]. 計(jì)算機(jī)工程與應(yīng)用, 2006, 46(4): 132-133.
LI F. Naivebayes classification model basedon improved attribute weighting[J]. Computer Engineering and Applications, 2006, 46(4): 132-133.
[16] 黃勇, 羅文輝, 張瑞舒. 改進(jìn)樸素貝葉斯算法在文本分類中的應(yīng)用[J]. 科技創(chuàng)新與應(yīng)用,2019(5):24-27.
HUANG Y, LUO W H, ZHANG R S. Application of improved naive Bayes algorithm in text categorization[J]. Technology Innovation and Application,2019(5):24-27.
[17] PAL S K, KING R A, HASHIM A A. Automatic gray level thresholding through index of fuzziness and entropy[J]. Pattern Recognition Letters, 1983, 1(3): 141-146.
[18] BEZDEK J C, HALL L O, CLARKELP. Review of MR image segmentation techniques using pattern recognition[J]. Medical Physics, 1993, 20(4): 1033-1048.
[19]CHEN J N, MATZINGER H, ZHAI H Y, et al. Centroid estimation based on symmetric KL divergence for multinomial text classification problem[C]//Proceedings of IEEE International Conference on Machine Learning and Applications, Orlando,USA, 2018: 1174-1177.
[20] 王春偉, 侯方, 申升, 等. 基于文本信息的PDF文檔管理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[J]. 燕山大學(xué)學(xué)報(bào),2020, 44(6): 603-608.
WANG C W, HOU F, SHEN S, et al. Design and implementation of PDF document management system based on text information[J]. Journal of Yanshan University, 2020, 44(6): 603-608.
[21] 何曉靜. 對TF-IDF算法的改進(jìn)及實(shí)驗(yàn)研究[D].長春: 吉林大學(xué), 2017: 14-24.HE X J. Improvement and experimental research on TF-IDF algorithm[D]. Changchun: Jilin University, 2017: 14-24.
Research on text classification based on improved naive Bayes algorithm
XIN Ziming,WANG Fang
(School of Science, Yanshan University, Qinhuangdao, Hebei 066004, China)
Abstract: In the case of a given output class, the naive Bayes algorithm assumes that the attributes are independent of each other. However, in reality, this assumption is usually not true.When the number of attributes is large or the correlation between attributes is high, the classification effect is not very good.In order to solve this problem, an optimized fuzzy C-means clustering and weight calculation method is used to improve the naive Bayes algorithm. Firstly, an adaptive function based on JS divergence is constructed to optimize the fuzzy clustering algorithm, and the optimized algorithm is used to sort the text. Then, the TF-IDF algorithm optimized by word frequency factor is used to calculate the feature weight of each sample after classification, and the classification calculation is carried out by combining the sample weight and Bayesian formula. Finally, in order to show the effectiveness and superiority of the improved naive Bayes algorithm, it is compared with the original naive Bayes algorithm and other improved algorithms. Experimental results show that the improved algorithm effectively reduces the requirements of the naive Bayes model for the independence of feature terms, improves the accuracy of classification decision-making, and has certain advantages in classification performance and efficiency.
Keywords: naive Bayes;text classification;fuzzy clustering;feature weight;independence hypothesis