李建春 李智 萬里 李健
摘 要:數(shù)據(jù)缺失是臨床試驗中常見但又不可避免的問題之一。由于醫(yī)療設(shè)備欠缺或者病患忽略檢測白蛋白,可能造成白蛋白指標缺失。隨著機器學(xué)習(xí)的廣泛應(yīng)用,很多研究者將機器學(xué)習(xí)應(yīng)用在缺失數(shù)據(jù)估計上。提出一種基于隨機森林與聚類方法結(jié)合的算法——雙隨機森林回歸法,并將該算法應(yīng)用于估計白蛋白缺失值。在準確率和魯棒性方面,雙隨機森林回歸法相比于最近鄰法、決策樹與隨機森林方法,均有不同程度提高。該算法為缺失值的有效處理提供了一種新思路,可以為其它的缺失值估計研究提供參考。
關(guān)鍵詞:血液透析;白蛋白;隨機森林;缺失值;數(shù)據(jù)缺失
DOI:10.11907/rjdk.173135
中圖分類號:TP319
文獻標識碼:A 文章編號:1672-7800(2018)005-0124-03
Abstract:Data missing is a common problem in clinical trials. The indicator of the albumin (ALB) is very important since it is associated with prognosis and mortality in patients with renal failure. And due to lack of medical equipment or patients ignorance of the detection of albumin, the value of albumin may be missed. With the widespread application of machine learning, many researchers have applied machine learning to the estimation of missing data in order to improve the quality of the dataset, and their work have got good results. In this paper, the method based on random forest and clustering and twice random forest, that is, Random forest regression-Kmeans-Random forest regression, RKR is proposed to apply this algorithm to estimate the albumin deletion value.The principle of the algorithm is to make use of the advantages of random forests in predicting nonlinear datasets. The process is divided into three parts. The first part is using the random forest regression method to impute the missing data of albumin. The second part is using the cluster method, Kmeans method, to cluster the dataset into six classes. Last but not the least, the third part is reusing the random forest regression method to impute the missing data of albumin. In terms of accuracy and robustness, the method performs better than the nearest neighbor regression method, decision regression tree and the random forest regression method. The algorithm provides a new approach for the efficient processing of missing values, which can be used as a reference for other researchers who study the estimation of missing values.
Key Words:hemodialysis; albumin; random forest; missing value; data missing
0 引言
數(shù)據(jù)缺失是臨床試驗中常見但又不可避免的問題之一。白蛋白(ALB)對于腎衰病人是一個非常重要的指標,與腎衰病人的預(yù)后和死亡率有一定關(guān)聯(lián)[1-4]。而由于醫(yī)療設(shè)備欠缺或者病患忽略檢測白蛋白,可能造成白蛋白指標缺失。隨著機器學(xué)習(xí)的廣泛應(yīng)用,很多研究者將機器學(xué)習(xí)應(yīng)用在缺失數(shù)據(jù)估計上,如多元線性回歸、最近鄰法(K-Nearest Neighbor,KNN)、貝葉斯主成分分析法(Bayesian Principal Component Analysis,BPCA)[11]及決策樹(Decision Tree,DT)[5-8]等。但這些方法沒有充分利用患者檢查數(shù)據(jù)的特殊性,估計精度不高[10-12]。隨機森林(Random Forest,RF)基于DT算法,其優(yōu)勢在于克服了DT存在的過擬合問題,為解決數(shù)據(jù)缺失提供了一種可行的手段。然而,它也存在以下兩個問題:①隨機森林(Random Forest,RF)[9]回歸預(yù)測使用的最終預(yù)測值是取各個子樹的平均值,因而帶來一定誤差;②很多研究者在估計缺失值時,未考慮缺失值特征帶來的影響,只對缺失值進行預(yù)測,因而又將一部分誤差引入[14-15]。
針對上述問題,本文提出一種將隨機森林和K均值聚類相結(jié)合的缺失值估計方法,即雙隨機森林回歸法(Random Forest Regression-Kmeans-Random Forest Regression,RKR),并使用歸一化均方誤差(Normalized Mean Square Error,NMSE)[13]、標準均方根誤差(Normalized Root Mean Square Deviation,NRMSD)[6]度量算法的準確度與穩(wěn)定性。
1 基本原理與方法
1.1 雙隨機森林(RKR)方法
雙隨機森林(RKR)是將隨機森林與K均值聚類方法融合的一種方法。首先使用隨機森林回歸(Random Forest Regression,RFR)對空缺值進行第一次估計,從而填補空缺值,進行Kmeans均值聚類。實驗發(fā)現(xiàn),聚類6個簇時效果最好。得到6個子樣本后,在含有空缺值的子樣本內(nèi),再次進行隨機森林回歸(Random Forest Regression,RFR)估計缺失值。實驗結(jié)果表明,該算法可以有效提升缺失值估計的準確率。
具體分為以下步驟:①首先獲取完整的數(shù)據(jù)集DataSet0,隨機挑選指定比例的記錄,組成訓(xùn)練集DataSetTrain,將剩下部分預(yù)測指標中的值清空,組成測試集DataSetTest;②使用隨機森林(Random Forest,RF)訓(xùn)練數(shù)據(jù)集DataSetTrain,對DataSetTest估計缺失值,得到新數(shù)據(jù)集DataSetTest1。將DataSetTest1與DataSetTrain合并成新的測試集DataSet1,使用K均值聚類方法將DataSet1分為6個聚類,DataCluster0、DataCluster1、DataCluster2、DataCluster3、DataCluster4、DataCluster5;③將DataCluster0中也存在于DataSetTest1記錄預(yù)測指標中的值清空,將DataCluster0中預(yù)測指標不為空的記錄挑選出來,組成DataClusterTrain0,剩下的記錄組成DataClusterTest0;④使用隨機森林(Random Forest,RF)訓(xùn)練數(shù)據(jù)集DataClusterTrain0,對DataClusterTest0預(yù)測指標缺失值,將預(yù)測值放入數(shù)據(jù)集DataSetPredicted;⑤對DataCluster1-DataCluster5重復(fù)步驟③、④。
2 實驗結(jié)果及分析
總共進行了5次試驗,采用的對比算法有:K近鄰回歸(KNeighbors Regressor,KNR)、決策樹回歸(DecisionTree Regressor,DTR)、隨機森林回歸(Random Forest Regressor,RFR)與本文提出的雙隨機森林法回歸(Random Forest Regressor-Kmeans-Random Forest Regressor,RKR)。4種算法分別在測試集為1%、5%、10%、15%、20%進行缺失值估計,并使用歸一化均方誤差(NMSE)、標準均方根誤差(NRMSD)度量算法的準確度與穩(wěn)定性。
2.1 實驗數(shù)據(jù)
本研究實驗數(shù)據(jù)來自成都軍區(qū)總醫(yī)院2013年1月~2015年11月期間的腎內(nèi)科數(shù)據(jù),對數(shù)據(jù)進行預(yù)處理,最后選出511個透析病人的實驗室檢查數(shù)據(jù),包括:白蛋白(ALB)、尿素氮(Bun)、性別(SEX)、年齡(AGE)、身高(HEIGHT)、體重(WEIGHT)、身體質(zhì)量指數(shù)(BMI)、舒張壓(DBP)、收縮壓(SBP)、鈣(CA)、磷(P)、鉀(K)、甲狀旁腺素(PTH)、堿性磷酸酶(AP)、鈉(NA)、血清肌酐(SCR)。將以上數(shù)據(jù)作為特征,這16個特征是透析患者應(yīng)著重關(guān)注的指標。選擇需要估計的指標(因變量)為白蛋白(ALB),其它指標作為自變量。采用隨機抽取的方法將原始數(shù)據(jù)分成訓(xùn)練集和測試集,用訓(xùn)練集獲得各種回歸模型,再利用回歸模型加載測試集,得到估測值。
2.2 實驗結(jié)果
在不同衡量指標下,4種算法實驗對比結(jié)果如圖1、圖2所示。
圖1表明,當預(yù)測結(jié)果衡量指標為NMSE時,在各種測試集比例下,決策樹方法(DTR)預(yù)測結(jié)果最差,雙隨機森林(RKR)預(yù)測結(jié)果最好;測試集比例在10%以下時,K近鄰回歸(KNR)、隨機森林(RFR)和雙隨機森林均表現(xiàn)優(yōu)異;測試集比例在10%以上時,K近鄰回歸預(yù)測結(jié)果比隨機森林和雙隨機森林差。
圖2表明,當預(yù)測結(jié)果衡量指標為NRMSD,在各種測試集比例下,決策樹方法(DTR)預(yù)測結(jié)果最差,雙隨機森林(RKR)預(yù)測結(jié)果最好;測試集比例在5%以下時,K近鄰回歸(KNR),隨機森林(RFR)和雙隨機森林均表現(xiàn)優(yōu)異;測試集比例在5%以上時,K近鄰回歸預(yù)測結(jié)果比隨機森林和雙隨機森林差。
綜上述,通過與K近鄰、決策樹、隨機森林方法進行實驗對比,結(jié)果表明,雙隨機森林算法實現(xiàn)了對透析病人白蛋白(ALB)指標缺失值較為準確的填補,同時具有較高的穩(wěn)定性。
3 結(jié)語
為解決臨床試驗中的數(shù)據(jù)缺失問題,本文提出一種基于隨機森林與聚類方法結(jié)合的算法——雙隨機森林回歸法,并將此算法應(yīng)用于估計白蛋白缺失值。雙隨機森林回歸法相比于最近鄰法、決策樹與隨機森林方法,在準確率和魯棒性方面均有不同程度提高。該算法為缺失值的有效處理提供了一種新思路,可以為其它的缺失值估計研究提供參考。
參考文獻:
[1] 潘少康,劉東偉,劉章鎖.不同透析模式對急性腎損傷預(yù)后的影響[J].實用醫(yī)院臨床雜志,2017(2):16-19.
[2] MA L, ZHAO S. Risk factors for mortality in patients undergoing hemodialysis: a systematic review and meta-analysis[J]. International Journal of Cardiology,2017.
[3] ERIGUCHI R, OBI Y, STREJA E, et al. Longitudinal associations among renal urea clearance–corrected normalized protein catabolic rate, serum albumin, and mortality in patients on hemodialysis[J]. Clinical Journal of the American Society of Nephrology,2017.
[4] FAN H, YANG J, LIU L, et al. Effect of serum albumin on the prognosis of elderly patients with stage 3-4 chronic kidney disease[J]. International Urology & Nephrology,2017.
[5] LUO S, LAWSON A B, HE B, et al. Bayesian multiple imputation for missing multivariate longitudinal data from a Parkinson's disease clinical trial[J]. Statistical Methods in Medical Research,2012.
[6] WANG X, JIANG Z, FENG H. Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme[J]. BMC Bioinformatics,2006,7(1):1-10.
[7] SHAH A D, BARTLETT J W, CARPENTER J, et al. Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study[J]. American Journal of Epidemiology,2014,179(6):764.
[8] BABU G A, SUMANA G, RAJASEKHAR M. Computer-aided diagnosis of polycystic kidney disease using ANN[J]. World Academy of Science, Engineering and Technology, International Journal of Medical, Health, Biomedical, Bioengineering and Pharmaceutical Engineering,2013,7(12):933-937.
[9] ZHANG H, WU P, YIN A, et al. Prediction of soil organic carbon in an intensively managed reclamation zone of eastern China: a comparison of multiple linear regressions and the random forest model[J]. Science of the Total Environment,2017,592:704-713.
[10] TROYANSKAYA O, CANTOR M, SHERLOCK G, et al. Missing value estimation methods for DNA microarrays[J]. Bioinformatics,2001,17(6):520.
[11] OBA S, SATO M A, TAKEMASA I, et al. A Bayesian missing value estimation method for gene expression profile data[J]. Bioinformatics,2003,19(16):2088-2096.
[12] KIM H, GOLUB G H. Missing value estimation for DNA microarray gene expression data: local least squares imputation[J]. Bioinformatics,2005,21(2):187-198.
[13] 李瑞紅,李智,童玲.蟻群路徑優(yōu)化決策樹在慢性腎病分期診斷中的應(yīng)用[J].軟件導(dǎo)刊, 2017,16(2):135-138.
[14] ZHANG S, WU X, ZHU M. Efficient missing data imputation for supervised learning[M]. 2010.
[15] LI H, ZHAO C, SHAO F, et al. A hybrid imputation approach for microarray missing value estimation[J]. Bmc Genomics,2015,16(S9):S1.
(責(zé)任編輯:黃 ?。?/p>