李曉維 鄢貴海 韓銀和
摘 要:高通量計(jì)算系統(tǒng)由海量的計(jì)算節(jié)點(diǎn)、存儲(chǔ)節(jié)點(diǎn)通過(guò)網(wǎng)絡(luò)互連而成。由于規(guī)模巨大,系統(tǒng)的可靠性成為一個(gè)非常嚴(yán)重的問(wèn)題,部件失效已經(jīng)成為一種常態(tài),系統(tǒng)設(shè)計(jì)必須考慮容錯(cuò)的問(wèn)題。我們需要建立新的高通量計(jì)算系統(tǒng)的可靠性保障框架,來(lái)適應(yīng)高通量計(jì)算中不同層次的可靠性需求,研究從芯片級(jí)到系統(tǒng)級(jí)跨層次的可靠計(jì)算技術(shù)。圍繞該目標(biāo),該研究從高通量處理芯片的故障檢測(cè)和容錯(cuò)設(shè)計(jì)方法,高通量計(jì)算系統(tǒng)的失效檢測(cè)和恢復(fù)方法和從芯片級(jí)到系統(tǒng)級(jí)的故障自預(yù)測(cè)、自檢測(cè)、自定位、自隔離和自愈合(5S)支撐環(huán)境3方面展開研究。截至2013年各項(xiàng)工作按照任務(wù)書原定計(jì)劃正在穩(wěn)步推進(jìn),部分工作取得階段性成果。在(1)針對(duì)NBTI老化故障的在線預(yù)測(cè)技術(shù);(2)深度學(xué)習(xí)等系統(tǒng)故障預(yù)測(cè)技術(shù);(3)寄存器故障診斷;(4)片上網(wǎng)絡(luò)通信隔離技術(shù)等技術(shù)點(diǎn)上取得了突破,共發(fā)表錄用了IEEE Transactions論文6篇,其他期刊論文1篇。從研究點(diǎn)覆蓋來(lái)看,部署到研究點(diǎn)已經(jīng)全部覆蓋了任務(wù)書規(guī)定的所有研究計(jì)劃,并對(duì)某些研究點(diǎn)進(jìn)行了細(xì)化。
關(guān)鍵詞:可靠性設(shè)計(jì) 故障檢測(cè) 深度學(xué)習(xí) 在線預(yù)測(cè) 通信隔離
Abstract:High-throughput computing system incorporates massive computing nodes, storage nodes and their associate inner interconnection network. It is very common that components of such system will encounter malfunction due to its large scale, which makes reliability an imperative issue that needs to be considered seriously. In other words, computing system design must take fault tolerance into account. We intend to build unprecedented reliability framework specially for high-throughput computing system, in order to accommodate the desirable reliability demands of various layers in high-throughput computingdesign the corresponding reliable computing techniques across chip level and system level. To achieve this objective, this study commences the relevant research in three consecutive aspects: (1)fault detection/tolerance approaches in high-through computing, (2)malfunction detection/recovery methods in high-throughput computing system, (3)self-prediction, self-detection, self-isolation and self-healing across chip level and system level (5S supportive environments). Up to the year 2013, various work has been carried on in align with task specification steadily, and parts of the work have reached preset milestones. We have made breakthrough in some researches, such as (1) NBTI aging prediction, (2) fault prediction based on deep learning,(3)register fault diagnosis, and (4) on-chip communication isolation techniques, along with abundant high-rank research publications. In terms of research comprehensiveness, the deployment has covered all research plans defined in the proposal, and some research techniques are further refined as well.
Key Words:Reliability design;Fault detection;Deep learning;Online prediction;Communication isolation
閱讀全文鏈接(需實(shí)名注冊(cè)):http://www.nstrs.cn/xiangxiBG.aspx?id=50730&flag=1