莫贊 蓋彥蓉 樊冠龍
摘 要:針對傳統(tǒng)單個分類器在不平衡數(shù)據(jù)上分類效果有限的問題,基于對抗生成網(wǎng)絡(luò)(GAN)和集成學(xué)習(xí)方法,提出一種新的針對二類不平衡數(shù)據(jù)集的分類方法——對抗生成網(wǎng)絡(luò)自適應(yīng)增強決策樹(GAN-AdaBoost-DT)算法。首先,利用GAN訓(xùn)練得到生成模型,生成模型生成少數(shù)類樣本,降低數(shù)據(jù)的不平衡性;其次,將生成的少數(shù)類樣本代入自適應(yīng)增強(AdaBoost)模型框架,更改權(quán)重,改進AdaBoost模型,提升以決策樹(DT)為基分類器的AdaBoost模型的分類性能。使用受測者工作特征曲線下面積(AUC)作為分類評價指標(biāo),在信用卡詐騙數(shù)據(jù)集上的實驗分析表明,該算法與合成少數(shù)類樣本集成學(xué)習(xí)相比,準(zhǔn)確率提高了4.5%,受測者工作特征曲線下面積提高了6.5%;對比改進的合成少數(shù)類樣本集成學(xué)習(xí),準(zhǔn)確率提高了4.9%,AUC值提高了5.9%;對比隨機欠采樣集成學(xué)習(xí),準(zhǔn)確率提高了4.5%,受測者工作特征曲線下面積提高了5.4%。在UCI和KEEL的其他數(shù)據(jù)集上的實驗結(jié)果表明,該算法在不平衡二分類問題上能提高總體的準(zhǔn)確率,優(yōu)化分類器性能。
關(guān)鍵詞:對抗生成網(wǎng)絡(luò); 集成學(xué)習(xí); 不平衡分類;? 二分類;自適應(yīng)增強;決策樹;信用卡欺詐
中圖分類號: TP391
文獻標(biāo)志碼:A
Abstract: Concerning that traditional single classifiers have poor classification effect for imbalanced data classification, a new binary-class imbalanced data classification algorithm was proposed based on Generative Adversarial Nets (GAN) and ensemble learning, namely Generative Adversarial Nets-Adaptive Boosting-Decision Tree (GAN-AdaBoost-DT). Firstly, GAN training was adopted to get a generative model which produced minority class samples to reduce imbalance ratio. Then, the minority class samples were brought into Adaptive Boosting (AdaBoost) learning framework and their weights were changed to improve AdaBoost model and classification performance of AdaBoost with Decision Tree (DT) as base classifier. Area Under the Carve (AUC) was used to evaluate the performance of classifier when dealing with imbalanced classification problems. The experimental results on credit card fraud data set illustrate that compared with synthetic minority over-sampling ensemble learning method, the accuracy of the proposed algorithm was increased by 4.5%, the AUC of it was improved by 6.5%; compared with modified synthetic minority over-sampling ensemble learning method, the accuracy was increased by 4.9%, the AUC was improved by 5.9%; compared with random under-sampling ensemble learning method, the accuracy was increased by 4.5%, the AUC was improved by 5.4%. The experimental results on other data sets of UCI and KEEL illustrate that the proposed algorithm can improve the accuracy of imbalanced classification and the overall classifier performance.
Key words: Generative Adversarial Nets (GAN); ensemble learning; imbalanced classification; binary-class classification; Adaptive Boosting (AdaBoost); Decision Tree (DT); credit card fraud
0 引言
不平衡數(shù)據(jù)是指數(shù)據(jù)集中的某個或某些類的樣本量遠遠高于其他類,而某些類樣本量較少,通常把樣本量較多的類稱為多數(shù)類,樣本量較少的類稱為少數(shù)類[1]。在不平衡數(shù)據(jù)集中,對少數(shù)類的識別較為重要,例如故障診斷[2]中,機器故障屬于少數(shù)類,如果將故障診斷為正常,就會造成工程延誤,帶來不必要的損失。由于不平衡數(shù)據(jù)集的復(fù)雜特性,傳統(tǒng)的分類算法預(yù)測少數(shù)類的分類規(guī)則比多數(shù)類的分類規(guī)則少,而且效果差[3],這就是不平衡分類問題。不平衡分類問題已經(jīng)成為數(shù)據(jù)挖掘領(lǐng)域的挑戰(zhàn)之一[4],現(xiàn)在這種問題普遍存在于銀行信用評級[5]、異常檢測[6]、人臉識別[7]、醫(yī)學(xué)診斷[8]、電子郵件分類[9]等領(lǐng)域。
本文所研究的信用卡欺詐偵測問題也是不平衡分類問題。信用卡欺詐偵測就是銀行根據(jù)與客戶信用狀況相關(guān)的特征變量預(yù)測客戶的支付記錄是否是欺詐交易,欺詐交易雖然是少數(shù)類,但一個欺詐交易的分類錯誤所造成的資金損失,是千百個正常交易分類正確也挽回不了的。為了避免信用風(fēng)險造成的損失,對欺詐交易記錄的識別尤為重要。
目前處理不平衡問題的方法可以概括為兩類。一種比較普遍的方法是在數(shù)據(jù)層面通過采用欠采樣或過采樣的方法,重新分配類別分布,例如文獻[10]提出的合成小類過采樣技術(shù)(Synthetic Minority Over-sampling Technique,SMOTE),文獻[11]提出的自適應(yīng)樣本合成方法(Adaptive Synthetic Sampling Approach,ADASYN)。欠采樣方法可以提升模型對小類樣本的分類性能,但是這種方法會造成大類樣本數(shù)據(jù)的信息丟失而使模型無法充分利用已有的信息。傳統(tǒng)的過采樣方法可以生成少數(shù)類樣本的數(shù)據(jù),但是根據(jù)少數(shù)類數(shù)據(jù)生成,只是基于當(dāng)前少數(shù)類蘊含的信息,缺乏數(shù)據(jù)多樣性,一定程度上會造成過擬合。
另一種是在算法層面上,包括集成學(xué)習(xí)和代價敏感學(xué)習(xí)。集成學(xué)習(xí)通過集成多個分類器來避免單個分類器對不平衡數(shù)據(jù)分類預(yù)測造成的偏差[12],如文獻[13]提出的在自適應(yīng)增強模型(Adaptive Boosting,AdaBoost)的每次迭代中引入SMOTE的SMOTEBoost算法,文獻[14]提出的在AdaBoost的每次迭代中引入隨機欠采樣(Random Under-Sampling method,RUS)的RUSBoosts算法。代價敏感學(xué)習(xí)是在算法迭代過程中設(shè)置少數(shù)類被錯分時具有較高的代價損失[15],通常與集成學(xué)習(xí)算法組合使用。代價敏感方法只是在算法層次進行了修改,沒有增加算法的開銷,效率較高,能有效提高不平衡數(shù)據(jù)的分類效果;但是由于主觀引入代價敏感損失,損失函數(shù)的設(shè)計會影響算法的迭代效果,適用性普遍較弱[16]。
因此,本文擬從數(shù)據(jù)層面生成少數(shù)類樣本來使數(shù)據(jù)達到平衡,以此提高傳統(tǒng)分類算法的分類效果。生成式對抗網(wǎng)絡(luò)(Generative Adversarial Nets,GAN)[17]是2014年提出的生成模型,與傳統(tǒng)的生成模型對比,不需要基于真實數(shù)據(jù)就可以生成逼近真實數(shù)據(jù)的合成數(shù)據(jù),可以擴展數(shù)據(jù)多樣性,避免過擬合。
由于單一方法難以滿足不同不平衡數(shù)據(jù)集的要求,適用性普遍不強,同時組合預(yù)測模型能發(fā)揮各個單一預(yù)測模型的優(yōu)勢,進而提高模型整體的預(yù)測效果,因此,本文提出一種針對不平衡二分類問題的對抗生成網(wǎng)絡(luò)自適應(yīng)增強決策樹(Generative Adversarial Nets-Adaptive Boosting-Decision Tree,GAN-AdaBoost-DT)算法。該算法首先使用GAN生成少數(shù)類樣本,使數(shù)據(jù)達到平衡,之后使用AdaBoost集成學(xué)習(xí)框架,使用以決策樹(Decision Tree,DT)作為基分類器的AdaBoost算法,利用集成的思想提高DT在不平衡數(shù)據(jù)集中的分類能力。采用受測者工作特征曲線下面積(Area Under the Carve,AUC)作為評價標(biāo)準(zhǔn)評價分類器的效果。
1 相關(guān)工作
1.1 GAN算法
GAN是2014年基于零和博弈理論提出的一種生成式模型,模型包括基于神經(jīng)網(wǎng)絡(luò)的生成模型(G)和判別模型(D),生成模型基于噪聲空間z生成數(shù)據(jù),判別模型判斷數(shù)據(jù)是真實的還是生成模型生成的。這個過程相當(dāng)于一個二人博弈,G的訓(xùn)練目標(biāo)是使生成的數(shù)據(jù)接近于真實數(shù)據(jù)的分布,判別器訓(xùn)練目標(biāo)是區(qū)分出真實數(shù)據(jù)生成數(shù)據(jù),兩者相互迭代優(yōu)化,使D和G的性能得到不斷增強,最終使兩個網(wǎng)絡(luò)達到一個動態(tài)均衡,判別模型判斷生成模型生成的數(shù)據(jù)為真的概率接近0.5,此時生成器生成的數(shù)據(jù)近似真實數(shù)據(jù)。計算流程如圖1所示。
4 結(jié)語
針對傳統(tǒng)分類算法在不平衡分類問題性能較差的問題,本文提出了一種用于解決不平衡二分類問題的算法——GAN-AdaBoost-DT算法。該算法基于對抗生成網(wǎng)絡(luò)改進了AdaBoost算法,在AdaBoost每次迭代中使用GAN生成少數(shù)類數(shù)據(jù),降低數(shù)據(jù)的不平衡率,從而提高AdaBoost-DT的分類性能。在信用卡詐騙數(shù)據(jù)集的實驗結(jié)果表明,該方法對不平衡數(shù)據(jù)集的識別率有所提高,綜合提升了分類器的性能。在UCI、KEEL的5個數(shù)據(jù)集上的實驗結(jié)果表明,該方法相比其他算法識別率更高,分類性能更優(yōu)。
參考文獻:
[1] SEARLE S R. Linear Models for Unbalanced Data [M]. New York: John Wiley & Sons, 1987: 145-153.
[2] YANG Z, TANG W H, SHINTEMIROV A, et al. Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers [J]. IEEE Transactions on Systems, Man & Cybernetics, Part C: Applications and Reviews, 2009, 39(6): 597-610.
[3] SUN Y, KAMEL M S, WONG A K C, et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition,2007,40(12): 3358-3378.
[4] YANG Q, WU X. 10 challenging problems in data mining research [J]. International Journal of Information Technology & Decision Making, 2011, 5(4): 597-604.
[5] BROWN I, MUES C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets [J]. Expert Systems with Applications, 2012, 39(3): 3446-3453.
[6] TAVALLAEE M, STAKHANVA N, GHORBANI A A. Toward credible evaluation of anomaly-based intrusion-detection methods[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C: Applications and Reviews, 2010, 40(5): 516-524.
[7] LIU Y-H, CHEN Y-T. Total margin based adaptive fuzzy support vector machines for multiview face recognition [C]// Proceedings of the 2005 IEEE International Conference on Systems, Man and Cybernetics. Washington, DC: IEEE Computer Society, 2005, 2: 1704-1711.
[8] MAZUROWSKI M A, HABAS P A, ZURADE J M, et al. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance [J]. Neural Networks, 2008, 21(2/3): 427-436.
[9] BERMEJO P, GAMEZ J A, PUERTA J M. Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets [J]. Expert Systems with Applications, 2011, 38(3): 2072-2080.
[10] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique [J]. Journal of Artificial Intelligence Research,2002, 16(1): 321-357.
[11] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning [C]// Proceeding of the 2008 International Joint Conference on Neural Networks. Piscataway, NJ: IEEE, 2008: 1322-1328.
[12] FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm [C]// Proceedings of the Thirteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1996: 148-156.
[13] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting [C]// Proceedings of the 2003 European Conference on Knowledge Discovery in Databases, LNCS 2838. Berlin: Springer, 2003: 107-119.
[14] SEIFFERT C, KHOSHGOFTAAR T M, van HULSE J, et al. RUSBoost: a hybrid approach to alleviating class imbalance [J]. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2010, 40(1): 185-197.
[15] FAN W, STOLFO S J, ZHANG J, et al. AdaCost: misclassification cost-sensitive boosting [C]// Proceedings of the 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1999: 97-105.
[16] CATENI S, COLLA V, VANNUCCI M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems [J]. Neurocomputing, 2014, 135: 32-41.
[17] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]// NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014, 2: 2672-2680.
[18] GOODFELLOW I. NIPS 2016 tutorial: generative adversarial networks [EB/OL]. (2016-12-31) [2017-09-24]. https://arxiv.org/pdf/1701.00160.pdf.
[19] LI J, MONROE W, SHI T, et al. Adversarial learning for neural dialogue generation [EB/OL].[2017-07-13]? [2018-05-02]. https://arxiv.org/pdf/1701.06547v1.pdf.
[20] YU L, ZHANG W, WANG J, et al. SeqGAN: sequence generative adversarial nets with policy gradient [EB/OL].[2017-08-25] [2018-05-02]. https://arxiv.org/pdf/1609.05473.pdf.
[21] HU WW, TAN Y. Generating adversarial malware examples for black-box attacks based on GAN [EB/OL]. [2017-02-20][2018-05-02]. https://arxiv.org/pdf/1702.05983v1.pdf.
[22] CHIDAMBARAM M, QI Y. Style transfer generative adversarial networks: learning to play chess differently[EB/OL]. [2017-05-07] [2018-07-02]. https://arxiv.org/pdf/1702.06762v1.pdf.
[23] FREUND Y, SCHAPIRE R E. A desicion-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer & System Sciences, 1997, 55(1):119-139.
[24] HUNT E, KRIVANEK J. The effects of pentylenetetrazole and methylphenoxypropane on discrimination learning [J]. Psychopharmacology, 1966, 9(1): 1-16.
[25] BOSE I, FARQUAD M A H. Preprocessing unbalanced data using support vector machine [J]. Decision Support Systems, 2012, 53(1): 226-233.
[26] 張順,張化祥.用于多標(biāo)記學(xué)習(xí)的K近鄰改進算法[J].計算機應(yīng)用研究,2011,28(12):4445-4450. (ZHANG S, ZHANG H X. Modified KNN algorithm for multi-label learning [J]. Application Research of Computers, 2011, 28(12): 4445-4450.)
[27] 李詒靖,郭海湘,李亞楠,等.一種基于Boosting的集成學(xué)習(xí)算法在不均衡數(shù)據(jù)中的分類 [J].系統(tǒng)工程理論與實踐,2016,36(1):189-199. (LI Y J, GUO H X, LI Y N, et al. A boosting based on ensemble learning algorithm in imbalanced data classification [J]. Systems Engineering — Theory & Practice, 2016, 36(1): 189-199.)