基于網(wǎng)絡(luò)爬蟲的單詞翻譯器設(shè)計與研究

2021-09-13 08:52:16周游宇孫洪波梅良才

科技資訊 2021年16期

周游宇孫洪波梅良才

摘? 要：該文基于機器學(xué)習(xí)中的網(wǎng)絡(luò)爬蟲技術(shù)提出了一種單詞翻譯器的設(shè)計與研究流程。首先，該文對Iciba網(wǎng)站進行爬蟲，經(jīng)過前期url分析，編寫定向頁面requests爬蟲，得到單詞釋義和例句。其次，通過一個查詢單詞的通用程序框架，編寫requests定向爬蟲，實時獲得最新的詞語解釋和例句。最后，該文設(shè)計了一個GUI窗體界面，用于展示相關(guān)結(jié)果，具有較好的實用性和有效性。該文提出的研究方法是機器學(xué)習(xí)相關(guān)研究領(lǐng)域的一個擴充，且該研究結(jié)果給教育相關(guān)領(lǐng)域提供了一個有效的應(yīng)用產(chǎn)品。

關(guān)鍵詞：requests框架? 網(wǎng)絡(luò)爬蟲? GUI界面編程? Python

中圖分類號：TP391? ? ? ? ? ? ? ? ? ? ? ?文獻標(biāo)識碼：A文章編號：1672-3791（2021）06（a）-0004-03

Design and Research of Word Translator Based on Web Crawler

ZHOU Youyu? SUN Hongbo? MEI Liangcai*

（Beijing Institute of Technology， Zhuhai， Zhuhai， Guangdong Province， 519088? China）

Absrtact： This paper presents the design and research flow of a word translator based on the web crawler technology in machine learning. Firstly， this paper crawled Iciba website， compiled directional page requests crawler through early url analysis， compiled the directed page requests crawler， got the word definition and example sentences. Secondly， through a general program framework for querying words， write requests directional crawler to obtain the latest word interpretation and example sentences in real time. Finally， a GUI form interface is designed to show the relevant results， which has good practicability and effectiveness. The research method proposed in this paper is an extension of the research field related to machine learning， and the research results provide an effective application product for the field related to education.

Key Words： Requests framework; Web crawler; GUI interface programming; Python

網(wǎng)絡(luò)爬蟲是從互聯(lián)網(wǎng)搜集數(shù)據(jù)的一種工具，眾多學(xué)者利用網(wǎng)絡(luò)爬蟲獲取研究數(shù)據(jù)[1]。機器學(xué)習(xí)是一種從現(xiàn)有數(shù)據(jù)中找到數(shù)據(jù)特征之間變化規(guī)律的一門科學(xué)，學(xué)者們在翻譯器設(shè)計、數(shù)據(jù)預(yù)測等多種交叉領(lǐng)域都用到了機器學(xué)習(xí)方法[2-4]。另外，市場上大多數(shù)查詢單詞App的桌面版功能都不夠方便快捷，基于此現(xiàn)狀，該文主要基于以下任務(wù)來設(shè)計單詞查詢App。

（1）對于網(wǎng)頁架構(gòu)的前期url分析，找到相應(yīng)的單詞釋義和例句。

（2）對于html框架中的具體label中的內(nèi)容進行編程設(shè)計爬取。

（3）設(shè)計GUI界面進行單詞釋義和例句的展示。

1? 包的安裝與描述

因為要GUI界面編程和網(wǎng)絡(luò)爬蟲，因此需要下列包。

from PyQt5 import QtCore， QtGui， QtWidgets

from bs4 import BeautifulSoup

from PyQt5.QtCore import QRect

import requests

from PyQt5.QtWidgets.

import QApplication，QWidget

import sys

import trans

2? 爬蟲解決過程

Iciba的域名為http：//www.iciba.com/，在域名后加word？w=，再加入所要搜索的單詞，如book。顯示出如下網(wǎng)址：http：//www.iciba.com/word？w=book，即可完成搜索，url見圖1。

由圖1可知，單詞釋義都在class=Mean_part_1RA2V的ul標(biāo)簽下，每一個li標(biāo)簽里帶有一行釋義;li標(biāo)簽下的i標(biāo)簽帶有此行釋義的詞性，span標(biāo)簽為漢語解釋。同理，例句在 class = NormalSentence_sentence_3q5Wk的div標(biāo)簽下。三個p標(biāo)簽分別為英語例句、漢語翻譯、出處。

至此，筆者寫出爬蟲的主要框架具體如下所示。

r = requests.get（url）

try：

soup = BeautifulSoup（r.text，'html.parser'）

meaning = soup.find（'ul'，class_='Mean_part_1RA2V'）.children

for li in meaning：

text += li.i.string

text+=' '

for span in li.div.children：

text+=span.text

text+=' '

text+='＼n'

text+='＼n例句：＼n'

for div in soup.findAll（'div'，class_='NormalSentence_sentence_3q5Wk'）[：9]：

ps = div.children

i=0

for p in ps：

if i == 2：

break

text += p.text

text+='＼n'

i+=1

text+='＼n'

self.label.setText（text）

except：

self.label.setText（'搜索失敗'）

利用try-except語句用一些亂七八糟搜索的過濾。

3? GUI界面解決過程

GUI界面能很直觀地展示搜集結(jié)果，是展示網(wǎng)絡(luò)爬蟲數(shù)據(jù)的好工具[5-6]。利用類的定義和使用的方法，筆者根據(jù)官網(wǎng)例子寫出的GUI如下所示。

from PyQt5 import QtCore， QtGui， QtWidgets

from bs4 import BeautifulSoup

from PyQt5.QtCore import QRect

import requests

class Ui_Form（object）：

def setupUi（self， Form）：

Form.setObjectName（"Form"）

Form.resize（412， 800）

self.Buttons = QtWidgets.QPushButton（Form）

self.Buttons.setGeometry（QtCore.QRect（300， 10， 93， 28））

self.Buttons.setObjectName（"Buttons"）

self.lineEdit = QtWidgets.QLineEdit（Form）

self.lineEdit.setGeometry（QtCore.QRect（10， 10， 271， 31））

self.lineEdit.setObjectName（"lineEdit"）

self.label = QtWidgets.QLabel（Form）

self.label.setGeometry（QtCore.QRect（10， 50， 381， 711））

self.label.setText（""）

self.label.setObjectName（"label"）

self.label.setGeometry（QRect（10， 50， 381， 711））

self.label.setWordWrap（True）

self.label.setAlignment（QtCore.Qt.AlignTop）

self.Buttons.clicked.connect（self.sOnClicked）

self.retranslateUi（Form）

QtCore.QMetaObject.connectSlotsByName（Form）

def sOnClicked（self）：

text = '釋義：＼n'

url_root = 'http：//www.iciba.com/word？w='

url = url_root+self.lineEdit.text（）

r = requests.get（url）

try：

soup = BeautifulSoup（r.text，'html.parser'）

meaning = soup.find（'ul'，class_='Mean_part_1RA2V'）.children

for li in meaning：

text += li.i.string

text+=' '

for span in li.div.children：

text+=span.text

text+=' '

text+='＼n'

text+='＼n例句：＼n'

for div in soup.findAll（'div'，class_='NormalSentence_sentence_3q5Wk'）[：9]：

ps = div.children

i=0

for p in ps：

if i == 2：

break

text += p.text

text+='＼n'

i+=1

text+='＼n'

self.label.setText（text）

except：

self.label.setText（'搜索失敗'）

def retranslateUi（self， Form）：

_translate = QtCore.QCoreApplication.translate

Form.setWindowTitle（_translate（"Form"， "Form"））

self.Buttons.setText（_translate（"Form"， "搜詞"））

4? 總結(jié)與評價

（1）創(chuàng)新點。運用的GUI界面編程，程序有了界面可以和用戶互動;根據(jù)網(wǎng)絡(luò)爬蟲可快速制作出編譯器，無需自己的詞典庫;界面自適應(yīng)，長出界面的詞句會自動換行;詞性、釋義、例句，一應(yīng)俱全。

（2）不足和改進。查詢需要聯(lián)網(wǎng)，沒有自己的數(shù)據(jù)備份。

該款A(yù)pp可以用于日常英語學(xué)習(xí)，隨時查詢，沒有多余功能，程序小巧，查詢到的釋義例句齊全。

參考文獻

[1] 朱策，徐宏，林新，等.基于網(wǎng)絡(luò)爬蟲的能源政策監(jiān)測[J].科技創(chuàng)新導(dǎo)報，2019，16（35）：141-142.

[2] 楊浩波.神經(jīng)機器翻譯關(guān)鍵技術(shù)研究與應(yīng)用[D].成都：電子科技大學(xué)，2020.

[3] 梁娟.英語翻譯器語音識別系統(tǒng)的設(shè)計及功能實現(xiàn)[J].微型電腦應(yīng)用，2018，34（12）：46-48.

[4] 季春元，熊澤金，侯艷芳，等.基于人機交互的網(wǎng)絡(luò)化智能翻譯系統(tǒng)設(shè)計[J].自動化與儀器儀表，2019（8）：25-28.

[5] 劉江，劉國璽，張雁，等.基于多線程和翻譯的網(wǎng)絡(luò)爬蟲鳥類音頻數(shù)據(jù)采集系統(tǒng)設(shè)計與實現(xiàn)[J].現(xiàn)代計算機，2018（30）：85-88，92.

[6] 明日科技.Python從入門到精通[M].北京：清華大學(xué)出版社，2018.