胡偉澎,李佑平,張秀清
基于遷移學(xué)習(xí)的MHC-I型抗原表位呈遞預(yù)測(cè)
胡偉澎1,2,3,李佑平2,3,4,張秀清2,3,4
1. 華南理工大學(xué)生物科學(xué)與工程學(xué)院,廣州 510006 2. 深圳華大生命科學(xué)研究院,深圳 518083 3. 華大吉諾因,武漢4300794 4. 中國(guó)科學(xué)院大學(xué)華大教育中心,深圳 518083
基于新抗原的腫瘤免疫治療,抗原呈遞的準(zhǔn)確預(yù)測(cè)是篩選T細(xì)胞特異性表位的關(guān)鍵步驟。質(zhì)譜鑒定的表位數(shù)據(jù)對(duì)建立抗原呈遞預(yù)測(cè)模型具有重要價(jià)值。盡管近年來(lái)質(zhì)譜數(shù)據(jù)的積累持續(xù)增加,但是大部分人類(lèi)白細(xì)胞抗原(human leukocyte antigen, HLA)分型所對(duì)應(yīng)的多肽數(shù)量相對(duì)較少,無(wú)法建立可靠的預(yù)測(cè)模型。為此,本研究嘗試?yán)眠w移學(xué)習(xí)的方法,先利用混合分型的表位數(shù)據(jù)建立模型以識(shí)別抗原表位的共同特征,在此預(yù)訓(xùn)練模型的基礎(chǔ)上再利用分型特異性數(shù)據(jù)建立抗原呈遞預(yù)測(cè)模型Pluto。在相同的驗(yàn)證集上,Pluto的平均0.1%陽(yáng)性預(yù)測(cè)值(positive predictive value, PPV)比從頭訓(xùn)練的模型高0.078。在外部的質(zhì)譜數(shù)據(jù)獨(dú)立評(píng)估上,Pluto的平均0.1% PPV為0.4255,高于從頭訓(xùn)練模型(0.3824)和其他主流工具,包括MixMHCpred (0.3369)、NetMHCpan4.0-EL (0.4000)、NetMHCpan4.0-BA (0.3188)和MHCflurry (0.3002)。此外,在免疫原性預(yù)測(cè)評(píng)估上,Pluto相對(duì)于其他工具也能找到更多的新抗原。Pluto開(kāi)源網(wǎng)址:https://github.com/weipenegHU/Pluto。
免疫治療;新抗原;抗原呈遞;深度學(xué)習(xí);遷移學(xué)習(xí)
腫瘤細(xì)胞內(nèi)含有腫瘤特異性突變位點(diǎn)的蛋白質(zhì)能夠被消化為不同長(zhǎng)度的多肽,含有突變的多肽能夠在內(nèi)質(zhì)網(wǎng)中與主要組織相同性復(fù)合體(major his-tocompatibility complex, MHC)結(jié)合形成多肽-MHC復(fù)合物然后被呈遞到細(xì)胞表面,如果多肽-MHC復(fù)合物被T細(xì)胞特異性識(shí)別,即能夠引起腫瘤細(xì)胞的凋亡,這種多肽被稱(chēng)為新抗原。近年來(lái),基于新抗原的腫瘤免疫療法在不同癌種的治療中取得令人矚目的突破[1~7],而且新抗原對(duì)于預(yù)測(cè)腫瘤療效和病人預(yù)后具有重要價(jià)值[8~12]目前篩選新抗原的主流方法是通過(guò)親和力預(yù)測(cè)工具預(yù)測(cè)多肽能否和人類(lèi)白細(xì)胞抗原(human leukocyte antigen, HLA)結(jié)合,例如NetMHCpan系列工具[13~15]和MHCflurry[16]等。但是,這些工具使用的訓(xùn)練數(shù)據(jù)大部分來(lái)源于體外實(shí)驗(yàn),不能真實(shí)反應(yīng)細(xì)胞內(nèi)多肽與HLA結(jié)合的情況。隨著質(zhì)譜技術(shù)的發(fā)展,科學(xué)家們能夠直接獲得呈遞到細(xì)胞表面的多肽數(shù)據(jù),相對(duì)于傳統(tǒng)的經(jīng)體外實(shí)驗(yàn)得到的親和力數(shù)據(jù),這些質(zhì)譜數(shù)據(jù)更加真實(shí)地反應(yīng)了多肽在細(xì)胞內(nèi)加工到呈遞的自然過(guò)程,包含更多的信息。隨著質(zhì)譜數(shù)據(jù)的積累以及質(zhì)譜數(shù)據(jù)對(duì)多肽免疫原性預(yù)測(cè)的重要性得到越來(lái)越多的重視[17,18],基于質(zhì)譜數(shù)據(jù)訓(xùn)練的抗原呈遞預(yù)測(cè)模型也隨之出現(xiàn),例如MixMHCpred[19,20]和EDGE[17]。
雖然目前已經(jīng)積累了一定數(shù)量的質(zhì)譜鑒定的抗原表位數(shù)據(jù),但對(duì)應(yīng)到每個(gè)HLA分型的質(zhì)譜數(shù)據(jù)并不均勻,大部分的HLA分型只有數(shù)千條多肽數(shù)據(jù),有的更只有數(shù)百條。在這種情況下,并不能開(kāi)發(fā)出可靠的分型特異性的抗原呈遞預(yù)測(cè)模型。遷移學(xué)習(xí)或許能夠幫助改善目前的這種狀況,其基本原理是利用在一個(gè)相似任務(wù)上學(xué)習(xí)到的經(jīng)驗(yàn)轉(zhuǎn)移到最終需要解決的任務(wù)上,通常前者擁有大量的數(shù)據(jù),而后者只有少量的數(shù)據(jù)。為了驗(yàn)證上述猜想,本研究先利用混合分型的MHC-I亞型抗原表位數(shù)據(jù)(是指訓(xùn)練數(shù)據(jù)由對(duì)應(yīng)不同MHC-I亞型的抗原表位組成)來(lái)訓(xùn)練一個(gè)模型以區(qū)分抗原表位與普通的蛋白質(zhì)多肽,再利用另外的包括16個(gè)HLA分型的單分型抗原表位數(shù)據(jù)在預(yù)訓(xùn)練模型的基礎(chǔ)上訓(xùn)練最終的分型特異性抗原呈遞模型,稱(chēng)之為Pluto。然后,在相同的驗(yàn)證集上評(píng)估了Pluto相對(duì)于從頭訓(xùn)練模型的優(yōu)勢(shì),并在獨(dú)立驗(yàn)證集上比較了其與目前主流軟件的表現(xiàn)。Pluto模型有望為相關(guān)工作提供新的思路以及對(duì)免疫治療領(lǐng)域做出有益的貢獻(xiàn)。
預(yù)訓(xùn)練模型用到的陽(yáng)性集來(lái)源于Pearson等[21]和Bassani-Sternberg等[22]產(chǎn)生的數(shù)據(jù)以及SysteMHC質(zhì)譜多肽數(shù)據(jù)庫(kù)[23]。將這些數(shù)據(jù)集合并后,剔除長(zhǎng)度小于8以及大于14的多肽,然后根據(jù)多肽和HLA分型去重,總共得到接近16萬(wàn)的多肽數(shù)據(jù)(表1)。陰性集來(lái)源于人類(lèi)蛋白組的隨機(jī)切割的多肽(剔除出現(xiàn)在陽(yáng)性數(shù)據(jù)集中的多肽),從中挑取與陽(yáng)性集等量的陰性多肽與陽(yáng)性集合并構(gòu)成訓(xùn)練集,然后從訓(xùn)練集中各挑取5000條陽(yáng)性多肽和5000條陰性多肽構(gòu)成驗(yàn)證集。
抗原呈遞模型中用到的陽(yáng)性訓(xùn)練數(shù)據(jù)來(lái)源于Abelin等[24]研究的16個(gè)單分型細(xì)胞系,包括A01:01、A02:01、A02:03、A02:04、A02:07、 A03:01、 A24:02、A29:02、A31:01、A68:02、B35:01、B44:02、B44:03、B51:01、B54:01和B57:01,總共約有2.7萬(wàn)條的多肽(表1),分別為這16個(gè)分型單獨(dú)建模。每個(gè)分型的數(shù)據(jù)按照8:2的比例劃分為陽(yáng)性訓(xùn)練集和陽(yáng)性驗(yàn)證集。從隨機(jī)切割的蛋白質(zhì)多肽中挑取陽(yáng)性訓(xùn)練集數(shù)據(jù)100倍的陰性多肽與陽(yáng)性訓(xùn)練集合并構(gòu)成訓(xùn)練集,挑取陽(yáng)性驗(yàn)證集數(shù)據(jù)999倍的陰性多肽與陽(yáng)性驗(yàn)證集合并構(gòu)成驗(yàn)證集。
本研究構(gòu)建的模型主要對(duì)長(zhǎng)度在8~14的短肽進(jìn)行預(yù)測(cè),因此預(yù)訓(xùn)練模型和抗原呈遞模型使用的多肽先利用通配符’X’把多肽的長(zhǎng)度統(tǒng)一為14肽,然后利用熱編碼將每條多肽編碼為294維(14×21,算上通配符‘X’,每個(gè)氨基酸需要編碼為21維向量)的向量。
預(yù)訓(xùn)練的模型由輸入層、5層隱藏層和輸出層組成(圖1A),其中5層隱藏層包含的神經(jīng)元數(shù)目分別為100、30、100、30和10,第一個(gè)和第三個(gè)隱藏層采用dropout(dropout rate=0.4)來(lái)控制模型的過(guò)擬合,各隱藏層均使用exponential linear unit (ELU)作為激活函數(shù)。本研究采用批次梯度下降的方法訓(xùn)練預(yù)訓(xùn)練模型,每個(gè)批次包含1024條多肽(陽(yáng)性和陰性多肽各一半),總共迭代100次。采用5層交叉驗(yàn)證的方法來(lái)評(píng)估預(yù)訓(xùn)練模型的準(zhǔn)確率。
表1 訓(xùn)練集總結(jié)
Pluto的結(jié)構(gòu)是在預(yù)訓(xùn)練模型結(jié)構(gòu)的基礎(chǔ)上,在最后一層隱藏層和輸出層之間增加了一個(gè)隱藏層,這層隱藏層之前的神經(jīng)元參數(shù)均使用預(yù)訓(xùn)練模型中對(duì)應(yīng)神經(jīng)元的參數(shù),而且不再對(duì)這些神經(jīng)元進(jìn)行訓(xùn)練,而只對(duì)新增的隱藏層和輸出層的神經(jīng)元訓(xùn)練。同樣采用批次梯度下降的方法訓(xùn)練模型,每個(gè)批次包含全部的陽(yáng)性多肽以及10倍的陰性多肽,總共迭代1000次。每迭代一次,利用訓(xùn)練好的模型對(duì)驗(yàn)證集進(jìn)行預(yù)測(cè)打分,根據(jù)分?jǐn)?shù)大小進(jìn)行排序并統(tǒng)計(jì)排名前0.1%的結(jié)果(陽(yáng)性多肽的數(shù)目)中的陽(yáng)性預(yù)測(cè)值(positive predictive value, PPV)。最后根據(jù)0.1%PPV的表現(xiàn)選擇最終的模型。采用0.1%PPV評(píng)估標(biāo)準(zhǔn)是因?yàn)楦鶕?jù)之前報(bào)道[25~27],細(xì)胞內(nèi)能被呈遞的多肽若占整個(gè)人類(lèi)蛋白組的0.1%,因此該評(píng)估標(biāo)準(zhǔn)更能夠反映實(shí)際情況。模型的實(shí)現(xiàn)和訓(xùn)練均采用Tensorflow框架[28]。
圖1 Pluto的構(gòu)建過(guò)程
A:預(yù)訓(xùn)練模型的結(jié)構(gòu);B: Pluto的結(jié)構(gòu)中,前5層隱藏層使用的權(quán)重來(lái)源于預(yù)訓(xùn)練模型的結(jié)構(gòu),訓(xùn)練過(guò)程中對(duì)這些遷移過(guò)來(lái)的權(quán)重鎖定,即這些權(quán)重在訓(xùn)練過(guò)程中不會(huì)改變,并且只對(duì)新增加的隱藏層和輸出層的權(quán)重進(jìn)行訓(xùn)練。
收集了Trolle等[29]產(chǎn)生的HeLa單分型細(xì)胞系多肽數(shù)據(jù)對(duì)模型進(jìn)行獨(dú)立評(píng)估。把這些質(zhì)譜多肽與999倍的陰性多肽合并構(gòu)建成測(cè)試集,生成的測(cè)試集用于評(píng)估Pluto、從頭訓(xùn)練模型以及MixMHCpred、NetMHCpan4.0-EL、NetMHCpan4.0-BA和MHCflurry的0.1% PPV。
收集了Stronen等[30]和Gros等[31]經(jīng)實(shí)驗(yàn)驗(yàn)證具有免疫原性的多肽。因?yàn)镾tronen等是利用四聚體實(shí)驗(yàn)直接對(duì)包含突變的多肽進(jìn)行驗(yàn)證的,所以每條多肽是否具有免疫原性是明確的。而Gros等是利用串聯(lián)迷你基因(tandem mini-gene, TMG)驗(yàn)證的,把這些TMG切成長(zhǎng)度為8~11個(gè)氨基酸,包含突變位點(diǎn)的重疊連續(xù)多肽。來(lái)自于沒(méi)有免疫原性的TMG的多肽被標(biāo)記為陰性數(shù)據(jù)。來(lái)自于具有免疫原性的TMG但是沒(méi)有經(jīng)過(guò)多肽負(fù)載實(shí)驗(yàn)驗(yàn)證的多肽會(huì)被剔除,因?yàn)椴荒艽_定這些多肽能否被T細(xì)胞識(shí)別,其他多肽則按照多肽負(fù)載實(shí)驗(yàn)驗(yàn)證的結(jié)果標(biāo)記為陽(yáng)性和陰性多肽。然后利用Pluto、MixMHCpred、NetMHCpan4.0-EL、NetMHCpan4.0-BA和MHCflurry對(duì)這些多肽進(jìn)行預(yù)測(cè),并比較這些工具對(duì)免疫原性多肽的排位。
假設(shè)預(yù)訓(xùn)練模型從大量呈遞的抗原表位中學(xué)習(xí)到一些非分型特異性的特征,并且能夠提高分型特異性抗原呈遞預(yù)測(cè)模型的表現(xiàn)。為驗(yàn)證此假設(shè),本研究利用單分型訓(xùn)練集從頭訓(xùn)練Pluto整個(gè)網(wǎng)絡(luò)的全部參數(shù),而不利用預(yù)訓(xùn)練模型訓(xùn)練好的參數(shù),并且在相同的驗(yàn)證集上和Pluto的表現(xiàn)作比較(圖2A)。
通過(guò)分析,經(jīng)過(guò)混合分型的表位數(shù)據(jù)訓(xùn)練的預(yù)訓(xùn)練模型五層交叉驗(yàn)證的平均準(zhǔn)確率為90.77%,說(shuō)明模型學(xué)習(xí)到一些能夠?qū)⒖乖砦慌c普通蛋白質(zhì)多肽區(qū)分開(kāi)來(lái)的特征。接下來(lái)在16個(gè)單分型驗(yàn)證集上評(píng)估Pluto與從頭訓(xùn)練模型的0.1%PPV,結(jié)果發(fā)現(xiàn)Pluto的0.1%PPV在所有驗(yàn)證集上都比沒(méi)有經(jīng)預(yù)訓(xùn)練的模型要高,平均0.1% PPV提升了0.078。然后觀察了訓(xùn)練集大小與模型表現(xiàn)提升之間的關(guān)系,從圖2B中可以發(fā)現(xiàn)這樣一種趨勢(shì):遷移學(xué)習(xí)對(duì)數(shù)據(jù)量小的分型的表現(xiàn)提升幫助更加明顯,而對(duì)數(shù)據(jù)量較大的分型來(lái)說(shuō),遷移學(xué)習(xí)對(duì)模型的提升則比較小。
因此,上述結(jié)果表明預(yù)訓(xùn)練模型能夠?qū)W習(xí)到不同分型抗原表位的共同特征,并且能夠幫助提高分型特異性的抗原呈遞預(yù)測(cè)模型的表現(xiàn),而提升的幅度可能受到抗原呈遞預(yù)測(cè)模型的訓(xùn)練集大小的影響。
利用Trolle等[29]產(chǎn)生的單分型質(zhì)譜數(shù)據(jù)對(duì)Pluto的使用效果進(jìn)行評(píng)估,并與從頭訓(xùn)練的模型和主流預(yù)測(cè)工具作(包括MixMHCpred (v2.0)[19,20]、NetMHC-pan4.0-EL[13]、NetMHCpan4.0-BA[13]和MHCflurry[16])進(jìn)行比較,結(jié)果發(fā)現(xiàn) Pluto在獨(dú)立測(cè)試集上的平均0.1% PPV為0.4255,顯著優(yōu)于從頭訓(xùn)練模型、Mix-MHCpred、NetMHCpan4.0-BA和MHCflurry,這些模型的平均0.1% PPV分別為0.3824、0.3369、0.3188、0.3002 (= 0.02538、0.002035、0.01102、0.01929,paired-test)。值得注意的是,雖然Pluto的平均0.1% PPV沒(méi)有顯著高于NetMHCpan4.0-EL (0.42550.4000,= 0.05311),但是在每個(gè)分型上Pluto的表現(xiàn)都要好于NetMHCpan4.0-EL (圖3)。
MixMHCpred是基于位置特異性打分矩陣(posi-tion specific scoring matrix, PSSM)以及只用質(zhì)譜數(shù)據(jù)訓(xùn)練的抗原呈遞預(yù)測(cè)模型。PSSM屬于線性模型的一種,它基于的假設(shè)是多肽的每個(gè)位置都是獨(dú)立,而從圖3的分析結(jié)果看,從頭訓(xùn)練模型和Net-MHCpan4.0-EL表現(xiàn)要顯著優(yōu)于MixMHCpred (= 0.03048, 5.674e-05, paired-test),因此推測(cè)多肽的不同位置之間可能存在一定的聯(lián)系,而不是單純的線性關(guān)系(本研究選擇從頭訓(xùn)練模型與NetMHCpan4.0- EL和MixMHCpred比較,是因?yàn)樗鼈兌际强乖蔬f預(yù)測(cè)模型,而NetMHCpan4.0-BA和MHCflurry是親和力預(yù)測(cè)模型)。
圖2 Pluto與從頭訓(xùn)練模型的性能比較
A:在16個(gè)單分型相同的驗(yàn)證集上Pluto的0.1% PPV表現(xiàn)都要優(yōu)于從頭訓(xùn)練的模型;B:預(yù)訓(xùn)練模型對(duì)Pluto表現(xiàn)提升的幅度受訓(xùn)練集大小的影響。
圖3 在外部質(zhì)譜數(shù)據(jù)上進(jìn)行獨(dú)立評(píng)估
Pluto的平均0.1% PPV要顯著高于從頭訓(xùn)練模型,MixMHCpred,NetMHCpan4.0-BA和MHCflurry。Pluto的平均0.1%PPV雖然沒(méi)有顯著高于NetMHCpan4.0-EL,但是在每個(gè)分型上的表現(xiàn)都要高于NetMHCpan4.0-EL。*代表<0.05,**代表<0.005 (paired-test)。
綜上所述,通過(guò)獨(dú)立評(píng)估,本研究驗(yàn)證了Pluto能夠達(dá)到甚至優(yōu)于目前主流的抗原呈遞預(yù)測(cè)工具的水平。
為評(píng)估Pluto預(yù)測(cè)抗原呈遞的能力能否用于尋找新抗原,本研究從Stronen等[30]和Gros等[31]的研究中收集了7條經(jīng)實(shí)驗(yàn)驗(yàn)證具有免疫原性的多肽,并利用這些多肽評(píng)估Pluto、MixMHCpred、Net-MHCpan4.0-EL、NetMHCpan4.0-BA和MHCflurry預(yù)測(cè)新抗原的能力。結(jié)果如表2所示,在每個(gè)病人排名前10的多肽中,Pluto能夠找回7條免疫原性多肽中的4條,MixMHCpred和NetMHCpan4.0-EL能夠找回其中的兩條,而NetMHCpan4.0-BA和MHCflurry只能找到其中的1條。因此,評(píng)估結(jié)果證明了Pluto對(duì)于鑒定腫瘤新抗原具有重要價(jià)值。
抗原呈遞的準(zhǔn)確預(yù)測(cè)是判斷新抗原能否激活新抗原特異性T細(xì)胞從而殺死腫瘤細(xì)胞的關(guān)鍵一步。雖然近幾年來(lái)質(zhì)譜技術(shù)飛速發(fā)展,積累了不少通過(guò)質(zhì)譜鑒定的抗原表位數(shù)據(jù),但是對(duì)于特定分型來(lái)說(shuō),每個(gè)分型對(duì)應(yīng)的抗原表位數(shù)據(jù)還不是很多,對(duì)于建立一個(gè)基于深度學(xué)習(xí)的分型特異性的抗原表位預(yù)測(cè)模型來(lái)說(shuō)是不足夠的。因此本研究利用遷移學(xué)習(xí)的方法,從大量的混合分型抗原表位數(shù)據(jù)和蛋白質(zhì)組中隨機(jī)多肽數(shù)據(jù)建立了一個(gè)深度學(xué)習(xí)模型以識(shí)別抗原表位是否存在一些共性,使之能夠與普通的多肽區(qū)分開(kāi)。然后在預(yù)訓(xùn)練模型的基礎(chǔ)上,利用分型特異性的數(shù)據(jù)訓(xùn)練了抗原呈遞預(yù)測(cè)模型Pluto。
本研究首先展示了預(yù)訓(xùn)練模型能夠?qū)⒋蟛糠值目乖砦慌c蛋白質(zhì)組的普通多肽分開(kāi),說(shuō)明模型學(xué)到了抗原表位的一些共同特征。但是因?yàn)樯疃葘W(xué)習(xí)本身的原因,預(yù)訓(xùn)練模型學(xué)習(xí)到哪些共同特征尚無(wú)法明確,值得后續(xù)研究給予重點(diǎn)關(guān)注。Pluto的表現(xiàn)相對(duì)于從頭訓(xùn)練的模型的表現(xiàn)有明顯的提升,但是提升的幅度受到分型特異性的訓(xùn)練集大小的影響。分析造成這種影響的原因可能有3個(gè):一是隨著分型特異性的抗原表位數(shù)據(jù)增加,所包含的信息量更多,與混合分型的抗原表位提供的信息有更大重合,這導(dǎo)致預(yù)訓(xùn)練模型學(xué)習(xí)到的特征起到的作用更??;二是模型可能已經(jīng)接近飽和狀態(tài),增加數(shù)據(jù)量對(duì)模型提高幫助不大;三是隨著數(shù)據(jù)量的增加,需要建立更加復(fù)雜的網(wǎng)絡(luò)以學(xué)習(xí)更多的特征才能提高模型的表現(xiàn)。在利用外部數(shù)據(jù)進(jìn)行獨(dú)立評(píng)估以及鑒定新抗原上,Pluto的表現(xiàn)也優(yōu)于從頭訓(xùn)練的模型以及這個(gè)領(lǐng)域的其他主流工具。本文中用到的所有訓(xùn)練數(shù)據(jù)和評(píng)估數(shù)據(jù)都可以從https://github.com/weipen-egHU/Pluto獲取。
抗原表位需要經(jīng)過(guò)源蛋白的表達(dá),源蛋白經(jīng)蛋白酶體消化切割后產(chǎn)生的多肽被轉(zhuǎn)運(yùn)到內(nèi)質(zhì)網(wǎng)內(nèi)部與MHC-I分子結(jié)合,最后才能被呈遞到細(xì)胞表面。在本研究中,Pluto只是根據(jù)抗原表位序列自身包含的信息來(lái)判定多肽能否被呈遞到細(xì)胞表面,而序列本身提供的信息是非常有限的。據(jù)文獻(xiàn)報(bào)道,多肽的表達(dá)量對(duì)抗原呈遞具有很大的影響[17,24,32]。此外,抗原表位的上下游序列能夠幫助預(yù)測(cè)多肽能否被蛋白酶體切割[17,24,33]。還有文獻(xiàn)報(bào)道能夠產(chǎn)生抗原表位的蛋白質(zhì)只占細(xì)胞內(nèi)所有蛋白質(zhì)的一部分[21],以及蛋白質(zhì)中存在產(chǎn)生抗原表位的熱點(diǎn)[34]。相信這些特征能夠進(jìn)一步提高Pluto的表現(xiàn),開(kāi)發(fā)和利用這些特征將是未來(lái)工作的重要方向。
表2 Pluto與主流工具對(duì)免疫原性多肽的排名
雖然根據(jù)抗原呈遞的可能性挑選新抗原具有一定效果[17],但是被呈遞的多肽不一定具有免疫原性(多肽的免疫原性是指多肽能否被T細(xì)胞識(shí)別從而殺死腫瘤細(xì)胞)[35~37]。所以除了抗原呈遞預(yù)測(cè)外,對(duì)多肽的免疫原性預(yù)測(cè)也具有重要意義。但是目前因?yàn)槊庖咴詳?shù)據(jù)缺乏積累,所以難以建立多肽免疫原性預(yù)測(cè)模型。未來(lái)通過(guò)共同協(xié)作產(chǎn)生更多的免疫原性數(shù)據(jù),更好的實(shí)驗(yàn)方法來(lái)了解TCR和多肽-MHC分子的相互作用[38,39]以產(chǎn)生更大的數(shù)據(jù)集和對(duì)免疫原性更深的生物學(xué)認(rèn)識(shí),最終能夠更準(zhǔn)確地預(yù)測(cè)免疫原性。
綜上所述,本研究利用遷移學(xué)習(xí)的方法建立了一個(gè)新的抗原呈遞預(yù)測(cè)工具Pluto,其表現(xiàn)顯著優(yōu)于目前主流的預(yù)測(cè)軟件。同時(shí),這些結(jié)果說(shuō)明了遷移學(xué)習(xí)對(duì)解決目前因分型特異性的抗原表位數(shù)據(jù)不足而難以建立一個(gè)可靠的抗原呈遞預(yù)測(cè)模型的問(wèn)題有所幫助。
[1] Gros A, Parkhurst MR, Tran E, Pasetto A, Robbins PF, Ilyas S, Prickett TD, Gartner JJ, Crystal JS, Roberts IM, Trebska-Mcgowan K, Wunderlich JR, Yang JC, Rosenberg SA. Prospective identification of neoantigen-specific lymphocytes in the peripheral blood of melanoma patients., 2016, 22(4): 433–438.
[2] Malekzadeh P, Pasetto A, Robbins PF, Parkhurst MR, Paria BC, Jia L, Gartner JJ, Hill V, Yu Z, Restifo NP, Sachs A, Tran E, Lo W, Somerville RPT, Rosenberg SA, Deniger DC. Neoantigen screening identifies broad TP53 mutant immunogenicity in patients with epithelial cancers., 2019, 129(3): 1109–1114.
[3] Robbins PF, Lu YC, El-Gamil M, Li YF, Gross C, Gartner J, Lin JC, Teer JK, Cliften P, Tycksen E, Samuels Y, Rosenberg SA. Mining exomic sequencing data to identify mutated antigens recognized by adoptively transferred tumor-reactive T cells., 2013, 19(6): 747–752.
[4] Sahin U, Derhovanessian E, Miller M, Kloke BP, Simon P, L?wer M, Bukur V, Tadmor AD, Luxemburger U, Schr?rs B, Omokoko T, Vormehr M, Albrecht C, Paruzynski A, Kuhn AN, Buck J, Heesch S, Schreeb KH, Müller F, Ortseifer I, Vogler I, Godehardt E, Attig S, Rae R, Breitkreuz A, Tolliver C, Suchan M, Martic G, Hohberger A, Sorn P, Diekmann J, Ciesla J, Waksmann O, Brück A K, Witt M, Zillgen M, Rothermel A, Kasemann B, Langer D, Bolte S, Diken M, Kreiter S, Nemecek R, Gebhardt C, Grabbe S, H?ller C, Utikal J, Huber C, Loquai C, Türeci O. Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer., 2017, 547(7662): 222–226.
[5] Tran E, Ahmadzadeh M, Lu YC, Gros A, Turcotte S, Robbins PF, Gartner JJ, Zheng Z, Li YF, Ray S, Wunderlich JR, Somerville RP, Rosenberg SA. Immuno-genicity of somatic mutations in human gastrointestinal cancers., 2015, 350(6266): 1387–1390.
[6] Tran E, Robbins PF, Lu YC, Prickett TD, Gartner JJ, Jia L, Pasetto A, Zheng Z, Ray S, Groh EM, Kriley IR, Rosen-berg SA. T-Cell transfer therapy targeting mutant KRAS in cancer., 2016, 375(23): 2255–2262.
[7] Zacharakis N, Chinnasamy H, Black M, Xu H, Lu YC, Zheng Z, Pasetto A, Langhan M, Shelton T, Prickett T, Gartner J, Jia L, Trebska-Mcgowan K, Somerville RP, Robbins PF, Rosenberg SA, Goff SL, Feldman SA. Immune recognition of somatic mutations leading to complete durable regression in metastatic breast cancer., 2018, 24(6): 724–730.
[8] Strickland KC, Howitt BE, Shukla SA, Rodig S, Ritterhouse LL, Liu JF, Garber JE, Chowdhury D, Wu CJ, D'andrea AD. Association and prognostic significance of BRCA1/2-mutation status with neoantigen load, number of tumor-infiltrating lymphocytes and expression of PD-1/ PD-L1 in high grade serous ovarian cancer., 2016, 7(12): 13587-13598.
[9] Lu HZ,Wang DK,Wang Z. Correlation analysis of the prognosis of HPV positive oropharyngeal cancer patients with T cell infiltration and neoantigen load., 2019, 41(8): 725–735.盧渙滋, 王迪侃, 王智. HPV陽(yáng)性口咽癌患者預(yù)后與T細(xì)胞浸潤(rùn)和新抗原負(fù)荷相關(guān)性分析. 遺傳, 2019, 41(8): 725–735.
[10] Brown SD, Warren RL, Gibb EA, Martin SD, Spinelli JJ, Nelson BH, Holt RA. Neo-antigens predicted by tumor genome meta-analysis correlate with increased patient survival., 2014, 24(5): 743–750.
[11] Shukla SA, Howitt BE, Wu CJ, Konstantinopoulos PA. Predicted neoantigen load in non-hypermutated endome-trial cancers: Correlation with outcome and tumor-specific genomic alterations., 2016, 19: 42–45.
[12] Sa HL, Ma KW, Gao Y, Wang DQ. Predictive value of tumor mutation burden in immunotherapy for lung cancer., 2019, 22(6): 380–384.撒煥蘭, 馬克威, 高勇, 王德強(qiáng). 腫瘤突變負(fù)荷對(duì)肺癌免疫治療療效的預(yù)測(cè)價(jià)值. 中國(guó)肺癌雜志, 2019, 22(6): 380–384.
[13] Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M. NetMHCpan-4.0: Improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data., 2017, 199(9): 3360–3368.
[14] Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integ-rating information from multiple receptor and peptide length datasets., 2016, 8(1): 33.
[15] Nielsen M, Lundegaard C, Blicher T, Lamberth K, Harndahl M, Justesen S, R?der G, Peters B, Sette A, Lund O, Buus S. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence., 2007, 2(8): e796.
[16] O'donnelL TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J. MHCflurry: open-source class I MHC binding affinity prediction., 2018, 7(1): 129–132 e4.
[17] Bulik-Sullivan B, Busby J, Palmer CD, Davis MJ, Murphy T, Clark A, Busby M, Duke F, Yang A, Young L, Ojo NC, Caldwell K, Abhyankar J, Boucher T, Hart MG, Makarov V, Montpreville VT, Mercier O, Chan TA, Scagliotti G, Bironzo P, Novello S, Karachaliou N, Rosell R, Anderson I, Gabrail N, Hrom J, Limvarapuss C, Choquette K, Spira A, Rousseau R, Voong C, Rizvi NA, Fadel E, Frattini M, Jooss K, Skoberne M, Francis J, Yelensky R. Deep learning using tumor HLA peptide mass spectrometry datasets improves neoantigen identification., 2018, 37(1): 55–63
[18] Gfeller D, Bassani-Sternberg M. Predicting antigen presentation—what could we learn from a million peptides?, 2018, 9: 1716.
[19] Bassani-Sternberg M, Chong C, Guillaume P, Solleder M, Pak H, Gannon PO, Kandalaft LE, Coukos G, Gfeller D. Deciphering HLA-I motifs across HLA peptidomes im-proves neo-antigen predictions and identifies allostery regulating HLA specificity., 2017, 13(8): e1005725.
[20] Gfeller D, Guillaume P, Michaux J, Pak HS, Daniel RT, Racle J, Coukos G and Bassani-Sternberg M. The length distribution and multiple specificity of naturally presented HLA-I ligands., 2018, 201(12): 3705–3716.
[21] Pearson H, Daouda T, Granados DP, Durette C, Bonneil E, Courcelles M, Rodenbrock A, Laverdure JP, Coté C, Mader S, Lemieux S, Thibault P, Perreault C. MHC class I-associated peptides derive from selective regions of the human genome., 2016, 126(12): 4690–4701.
[22] Bassani-Sternberg M, Pletscher-Frankild S, Jensen LJ, Mann M. Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abun-dance and turnover on antigen presentation., 2015, 14(3): 658–673.
[23] Shao W, Pedrioli PGA, Wolski W, Scurtescu C, Schmid E, Vizcaíno JA, Courcelles M, Schuster H, Kowalewski D, Marino F, Arlehamn CSL, Vaughan K, Peters B, Sette A, Ottenhoff THM, Meijgaarden KE, Nieuwenhuizen N, Kaufmann SHE, Schlapbach R, Castle JC, Nesvizhskii A I, Nielsen M, Deutsch E W, Campbell D S, Moritz R L, Zubarev R A, Ytterberg A J, Purcell A W, Marcilla M, Paradela A, Wang Q, Costello CE, Ternette N, van Veelen PA, van Els CACM, Heck AJR, de Souza GA, Sollid LM, Admon A, Stevanovic S, Rammensee HG, Thibault P, Perreault C, Bassani-Sternberg M, Aebersold R, Caron E. The SysteMHC atlas project., 2018, 46(D1): D1237–D1247.
[24] Abelin JG, Keskin DB, Sarkizova S, Hartigan CR, Zhang W, Sidney J, Stevens J, Lane W, Zhang GL, Eisenhaure TM, Clauser KR, Hacohen N, Rooney MS, Carr SA, Wu CJ. Mass spectrometry profiling of HLA-Associated peptidomes in Mono-allelic cells enables more accurate epitope prediction., 2017, 46(2): 315–326.
[25] Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, Wheeler DK, Gabbard JL, Hix D, Sette A, Peters B. The immune epitope database (IEDB) 3.0., 2015, 43(Database issue): D405–412.
[26] Rammensee HG, Friede T, Stevanoviíc S. MHC ligands and peptide motifs: first listing., 1995, 41(4): 178–228.
[27] Hunt DF, Henderson RA, Shabanowitz J, Sakaguchi K, Michel H, Sevilir N, Cox AL, Appella E, Engelhard VH. Characterization of peptides bound to the class I MHC molecule HLA-A2.1 by mass spectrometry., 1992, 255(5049): 1261–1263.
[28] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia YQ, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng XQ. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.:1603.04467, 2016,
[29] Trolle T, Mcmurtrey CP, Sidney J, Bardet W, Osborn SC, Kaever T, Sette A, Hildebrand WH, Nielsen M, Peters B. The length distribution of class I-restricted T cell epitopes is determined by both peptide supply and MHC allele- specific binding preference., 2016, 196(4): 1480–1487.
[30] Str?nen E, Toebes M, Kelderman S, van Buuren MM, Yang W, van Rooij N, Donia M, B?schen ML, Lund- Johansen F, Olweus J, Schumacher TN. Targeting of cancer neoantigens with donor-derived T cell receptor repertoires., 2016, 352(6291): 1337–1341.
[31] Gros A, Parkhurst MR, Tran E, Pasetto A, Robbins PF, Ilyas S, Prickett TD, Gartner JJ, Crystal JS, Roberts IM. Prospective identification of neoantigen-specific lymphocytes in the peripheral blood of melanoma patients., 2016, 22(4): 433–438.
[32] Hu WP, Qiu S, Li YP, Lin XX, Zhang L, Xiang HT, Han X, Chen L, Li S, Li WH, Ren Z, Hou GX, Lin ZL, Lu JL, Liu G, Li B, Lee LJ. EPIC: MHC-I epitope prediction integrating mass spectrometry derived motifs and tissue- specific expression profiles., 2019, 567081.
[33] Nielsen M, Lundegaard C, Lund O, Kesmir C. The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage., 2005, 57(1–2): 33–41.
[34] Müller M, Gfeller D, Coukos G, Bassani-Sternberg M. 'Hotspots' of antigen presentation revealed by human leukocyte antigen ligandomics for neoantigen prioritization., 2017, 8: 1367.
[35] Mcgranahan N, Furness AJ, Rosenthal R, Ramskov S, Lyngaa R, Saini SK, Jamal-Hanjani M, Wilson GA, Birkbak NJ, Hiley CT, Watkins TB, Shafi S, Murugaesu N, Mitter R, Akarca AU, Linares J, Marafioti T, Henry JY, Van Allen EM, Miao D, Schilling B, Schadendorf D, Garraway LA, Makarov V, Rizvi NA, Snyder A, Hellmann MD, Merghoub T, Wolchok JD, Shukla SA, Wu CJ, Peggs KS, Chan TA, Hadrup SR, Quezada SA, Swanton C. Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade., 2016, 351(6280): 1463–1469.
[36] Calis JJ, Maybeno M, Greenbaum JA, Weiskopf D, de Silva AD, Sette A, Ke?mir C, Peters B. Properties of MHC class I presented peptides that enhance immunogenicity., 2013, 9(10): e1003266.
[37] Assarsson E, Sidney J, Oseroff C, Pasquetto V, Bui HH, Frahm N, Brander C, Peters B, Grey H, Sette A. A quantitative analysis of the variables affecting the repertoire of T cell specificities recognized after vaccinia virus infection., 2007, 178(12): 7890–7901.
[38] Bentzen AK, Such L, Jensen KK, Marquard AM, Jessen LE, Miller NJ, Church CD, Lyngaa R, Koelle DM, Becker JC, Linnemann C, Schumacher TNM, Marcatili P, Nghiem P, Nielsen M, Hadrup SR. T cell receptor fingerprinting enables in-depth characterization of the interactions governing recognition of peptide–MHC complexes., 2018, 36(12): 1191–11996.
[39] Bentzen AK, Marquard AM, Lyngaa R, Saini SK, Ramskov S, Donia M, Such L, Furness AJ, Mcgranahan N, Rosenthal R, Straten PT, Szallasi Z, Svane IM, Swanton C, Quezada SA, Jakobsen SN, Eklund AC, Hadrup SR. Large-scale detection of antigen-specific T cells using peptide-MHC-I multimers labeled with DNA barcodes., 2016, 34(10): 1037–1045.
MHC-I epitope presentation prediction based on transfer learning
Weipeng Hu1,2,3, Youping Li2,3,4, Xiuqing Zhang2,3,4
Accurate epitope presentation prediction is a key procedure in tumour immunotherapies based on neoantigen for targeting T cell specific epitopes. Epitopes identified by mass spectrometry (MS) is valuable to train an epitope presentation prediction model. In spite of the accelerating accumulation of MS data, the number of epitopes that match most of human leukocyte antigens (HLAs) is relatively small, which makes it difficult to build a reliable prediction model. Therefore, this research attempted to use the transfer learning method to train a model to learn common features among the mixed allele specific epitopes. Then based on this pre-trained model, we used the allele-specific epitopes to train the final epitope presentation prediction model, termed Pluto. The average 0.1% positive predictive value (PPV) of Pluto outperformed the prediction model without pretraining with a margin of 0.078 on the same validation dataset. When evaluating Pluto on external HLA eluted ligand datasets, Pluto achieved an averaged 0.1% PPV of 0.4255, which is better than the prediction model without pretraining (0.3824) and other popular methods, including MixMHCpred (0.3369), NetMHCpan4.0-EL (0.4000), NetMHCpan4.0-BA (0.3188) and MHCflurry (0.3002). Moreover, when it comes to the evaluation of predicting immunogenicity, Pluto can identify more neoantigens than other tools. Pluto is publicly available at https://github.com/weipenegHU/Pluto.
immunotherapy; neoantigen; epitope presentation; deep learning; transfer learning
2019-06-21;
2019-09-17
國(guó)家自然科學(xué)基金項(xiàng)目(編號(hào):81702826,81772910),深圳市科創(chuàng)委項(xiàng)目(編號(hào):JCYJ20170303151334808)和深圳市經(jīng)信委項(xiàng)目(編號(hào):20170731162715261)資助[Supported by the National Natural Science Foundation of China (Nos. 81702826, 81772910 ), Science, Technology and Innovation Commission of Shenzhen Municipality (No. JCYJ20170303151334808) and Shenzhen Municipal Government of China (No. 20170731162715261)]
胡偉澎,碩士研究生,專(zhuān)業(yè)方向:基因組學(xué)。E-mail: huweipeng@genomics.cn 李佑平,碩士研究生,專(zhuān)業(yè)方向:基因組學(xué)。E-mail: liyouping@genomics.cn 胡偉澎和李佑平并列第一作者。
張秀清,博士,教授,研究方向:基因組學(xué)及免疫治療。E-mail: zhangxq@genomics.cn
10.16288/j.yczz.19-155
2019/11/8 13:27:56
URI: http://kns.cnki.net/kcms/detail/11.1913.R.20191107.1628.005.html
(責(zé)任編委: 趙要鳳)