用Python對常用字進(jìn)行多維度特征分析

2020-04-01 15:08:07溫且姆·薩迪克布合力齊姑麗·瓦斯力熱依漢古麗·薩迪克木合塔爾·沙地克

教育教學(xué)論壇 2020年10期

關(guān)鍵詞：常用字游歷聲調(diào)

溫且姆·薩迪克布合力齊姑麗·瓦斯力熱依漢古麗·薩迪克木合塔爾·沙地克

摘要：文章用Python實(shí)現(xiàn)對常用字的詞性、拼音、韻母與聲調(diào)之間的多維度特征分析，從開發(fā)環(huán)境搭建開始詳細(xì)介紹了每個步驟和代碼。

關(guān)鍵詞：Python;Jieba;python-docx-master;python-pinyin-master

中圖分類號：G642.0? ? ?文獻(xiàn)標(biāo)志碼：A? ? ?文章編號：1674-9324（2020）10-0120-02

一、環(huán)境的搭建

用Python對漢字進(jìn)行分析不僅要安裝Python開發(fā)環(huán)境，還需要安裝Python中文分詞組件jieba，Python Word文本處理組件python-docx-master，Python處理漢字轉(zhuǎn)拼音組件python-pinyin-master。本文用Anaconda Spyder作為開發(fā)環(huán)境，從相關(guān)網(wǎng)站下載以上各組件壓縮包，解壓到工作目錄，在命令行分別進(jìn)入各組件解壓目錄，執(zhí)行python setup.py install即可完成環(huán)境的準(zhǔn)備。

二、主要代碼解釋

（一）引入相關(guān)組件，并以gb18030編碼格式打開常用字txt文件

（引入相關(guān)組件代碼略）

#打開常用字txt文件

text=open（'sys_Char2500.txt'，encoding='gb18030'）.read（）

#從text中篩選字符部分

char_changyong=[char for char in text if char.isalpha（）]

（二）定義一個函數(shù)，獲取漢字的詞性，把詞性英文簡稱轉(zhuǎn)換中文名稱

def get_peg（arg）：

#獲取arg的詞性

pegc=peg.cut（arg）

flag2=''

#把詞性英文簡稱轉(zhuǎn)換漢字名稱

for peg1，flag1 in pegc：

if 'n' == flag1[0]：

flag2='名詞'

elif 't' == flag1[0]：

flag2='時間詞'

…

else：

flag2=flag1

return flag2

（三）獲取漢字的詞性、拼音和韻母特征，把他們存儲在一個詞典里，并進(jìn)行排序統(tǒng)計

…

for char in char_changyong：

#獲取漢字帶聲調(diào)的拼音

yin3=''.join（lazy_pinyin（char，style=Style.TONE3））

#獲取漢字不帶聲調(diào)的拼音

pyin=''.join（lazy_pinyin（char））

#獲取漢字的韻母

yunm=''.join（lazy_pinyin（char，style=Style.FINALS））

…

#把以上獲取的特征存放在詞典和列表里

char_flag_dict[char]=（tone，pegc，pyin，yunm）

…

（四）對漢字的詞性與聲調(diào)進(jìn)行統(tǒng)計分析，將統(tǒng)計結(jié)果存入Document對象的表格中

…

#游歷所有詞性統(tǒng)計列表

for pegc，count in pegc_all_count：

…

#游歷存放漢字特征的詞典

for char，val in char_flag_dict.items（）：

if pegc == val[1]：

tones = tones + str（val[0]）

chars = chars + str（char）

pegc_tones[pegc]=tones

#定義Document對象

doc_new = Document（）

doc_new.add_heading（'一、詞性統(tǒng)計：'，0）

#定義表格

table = doc_new.add_table（rows=1，cols=8）

hdr_cells = table.rows[0].cells

#創(chuàng)建表格列名

hdr_cells[0].text = '序號'

…

#對漢字的詞性與聲調(diào)進(jìn)行統(tǒng)計

#將統(tǒng)計結(jié)果存入表格中

for key，val in pegc_tones.items（）：

len_tones=len（pegc_tones[key]）

count=Counter（pegc_tones[key]）

row_cells = table.add_row（）.cells

row_cells[0].text = str（i）

…

（五）對漢字的拼音與聲調(diào)進(jìn)行統(tǒng)計分析，將統(tǒng)計結(jié)果存入Document對象的表格中

…

#游歷所有拼音統(tǒng)計列表

for pyin，count in pyin_all_count：

…

#游歷存放漢字特征的詞典

for char，val in char_flag_dict.items（）：

if pyin == val[2]：

tones = tones + str（val[0]）

chars = chars + str（char）

pyin_tones[pyin]=tones

#添加標(biāo)題

doc_new.add_heading（'二、拼音統(tǒng)計：'，0）

#定義表格

table = doc_new.add_table（rows=1，cols=8）

hdr_cells = table.rows[0].cells

hdr_cells[0].text = '序號'

…

#對漢字的拼音與聲調(diào)進(jìn)行統(tǒng)計

#將統(tǒng)計結(jié)果存入表格中

for key，val in pyin_tones.items（）：

len_tones=len（pyin_tones[key]）

count=Counter（pyin_tones[key]）

row_cells = table.add_row（）.cells

row_cells[0].text = str（i）

…

row_cells[7].text = str（count6）

（六）對漢字的韻母與聲調(diào)進(jìn)行統(tǒng)計分析，將統(tǒng)計結(jié)果存入Document對象中

…

#游歷所有韻母統(tǒng)計列表

for yunm，count in yunm_all_count：

…

#游歷存放漢字特征的詞典

for char，val in char_flag_dict.items（）：

if yunm == val[3]：

tones = tones + str（val[0]）

chars = chars + str（char）

yunm_tones[yunm]=tones

#添加標(biāo)題

doc_new.add_heading（'三、韻母統(tǒng)計：'，0）

#創(chuàng)建表格

table = doc_new.add_table（rows=1，cols=8）

hdr_cells = table.rows[0].cells

hdr_cells[0].text = '序號'

…

#對漢字的韻母與聲調(diào)進(jìn)行統(tǒng)計

#將統(tǒng)計結(jié)果存入表格中

for key，val in yunm_tones.items（）：

count=Counter（yunm_tones[key]）

len_tones=len（yunm_tones[key]）

row_cells = table.add_row（）.cells

row_cells[0].text = str（i）

…

row_cells[7].text = str（count6）

（七）把統(tǒng)計結(jié)果存入Word文檔中，用于下一步分析

doc_new.save（'漢字統(tǒng)計分析.docx'）

Multi-dimensional Feature Analysis of Common Words with Python

Wynchem Sadiq1，Buzhiguri Vasley2，Hayhanguri Sadiq3，Muhtar Shadick4

（1.Kashgar Shule County Secondary Vocational and Technical School，Kashgar，Xinjiang 844200，China;

2.College of Mathematics and Science，Xinjiang Institute of Education，Urumqi，Xinjiang 830043，China;

3.Hanan Like Town Middle School，Kashgar Shule County，Kashgar，Xinjiang 844207，China;4.Education Management Information Center of Xinjiang Uygur Autonomous Region，Urumqi，Xinjiang 830049，China）

Abstract：In this paper，Python is used to analyze the multi-dimensional features of common characters，such as part of speech，pinyin，vowel and tone.Start with the development environment building and introduce each step and code in detail.

Key words：Python;Jieba;python-docx-master;python-pinyin-master