文/朱利葉斯·切爾尼奧斯卡斯 譯/云天
近來,互聯(lián)網(wǎng)正經(jīng)歷著與18 世紀早期“采金熱”類似的現(xiàn)象,特別是在數(shù)據(jù)提取方面。數(shù)據(jù)因其巨大的價值而被某些分析師稱為“新石油”。數(shù)據(jù)領(lǐng)域仍然對大大小小的參與者開放,但這也導致了若干不專業(yè)的行為,甚至有人設(shè)法獲取有密碼保護的數(shù)據(jù)。
2盡管許多網(wǎng)站確實包含IP禁令等防御措施,但由于競爭加劇和各種經(jīng)濟因素,網(wǎng)絡(luò)爬蟲和服務(wù)器之間的無形沖突仍在持續(xù),并愈演愈烈。盡管大多數(shù)人很樂意利用億客行、谷歌購物、PriceGrabber 和天巡網(wǎng)等聚合網(wǎng)站的低價優(yōu)勢,但人們并沒有意識到上述沖突正發(fā)生在不同的電商平臺之間。
3使用工具的目的有好有壞,網(wǎng)頁數(shù)據(jù)抓取也不例外。一種相當常見的情況是以營銷為目的抓取個人數(shù)據(jù)。數(shù)億用戶通過電商平臺上的服務(wù)協(xié)議條款同意公開他們的數(shù)據(jù),無論他們是否意識到了這一操作。然而,數(shù)據(jù)遭泄露的問題在于,這些數(shù)據(jù)由社交媒體機構(gòu)提取,卻為僵尸網(wǎng)站所用。這類網(wǎng)站在未經(jīng)用戶許可的情況下創(chuàng)建個人資料,并羅列出個人的詳細信息。
4結(jié)果,網(wǎng)頁數(shù)據(jù)抓取的負面新聞越來越多,這使得公眾對自身數(shù)據(jù)價值和隱私的認識有所提高。網(wǎng)頁數(shù)據(jù)抓取本身并沒有什么不道德的,因為它不過是把人們通常需要手動操作的活動自動化了。主要的區(qū)別在于,網(wǎng)頁數(shù)據(jù)抓取使用機器人程序,在極短時間內(nèi)爬取大量網(wǎng)站、提取海量信息,從而實現(xiàn)更大規(guī)模的信息搜集。
5提取公開的數(shù)據(jù)需要代理。簡單來說,代理是網(wǎng)絡(luò)爬蟲和服務(wù)器之間的中介。使用代理可以將數(shù)據(jù)請求均勻地分配到服務(wù)器,這樣能確保以合理的速率請求數(shù)據(jù),也可保證請求方匿名。
6不道德抓取所采用的數(shù)據(jù)提取方式可能損害個人隱私,導致服務(wù)器過載。
7盡管很多網(wǎng)站試圖通過IP禁令來防止不道德抓取,但這漸漸變得徒勞,因為使用了代理,而且這些代理能夠模擬人類行為來規(guī)避服務(wù)器問題。這最終可能導致服務(wù)器過載(使在線企業(yè)耗費資金)、互聯(lián)網(wǎng)透明度降低、公眾在隱私問題上的不信任加重。
8網(wǎng)頁數(shù)據(jù)抓取大有裨益,但這有賴于有自由且透明的互聯(lián)網(wǎng)可用。我確信,如果我們能遵循一些準則,使局面對每個人都公平,那么網(wǎng)頁數(shù)據(jù)抓取將有益于整個科技領(lǐng)域:
1. 只抓取公開的網(wǎng)頁
2. 研究目標網(wǎng)站的法律文件以確定你依照法律是否接受其服務(wù)條款。如果接受,確定自己是否不會違背
3. 合理請求數(shù)據(jù)以保證服務(wù)器功能不受損害(DDoS 攻擊)
4. 尊重源網(wǎng)站對所獲得的任何數(shù)據(jù)的隱私保護
5. 使用以合乎道德的手段獲取的代理
9眾所周知,當今正在運行的某些代理,其獲取方式并不道德。許多代理通常是人們從下載到個人設(shè)備里的應(yīng)用程序中獲取的。很難確定這些用戶是否意識到了他們的設(shè)備正在被使用。但可以肯定的是,如果用戶同意了具有誤導性或是容易混淆的服務(wù)條款,從而不情愿地將個人設(shè)備變成住宅代理網(wǎng)絡(luò)中的參與者,那么將這類程序用作代理一定是不道德的。
10現(xiàn)代網(wǎng)頁數(shù)據(jù)抓取的某些方面缺乏明確性,需要道德規(guī)范來為行業(yè)帶來秩序。如果業(yè)內(nèi)人士能夠就專業(yè)的網(wǎng)頁數(shù)據(jù)抓取方法達成共識,這將有助于維護一個公平、開放、自由的網(wǎng)絡(luò)環(huán)境,使企業(yè)與消費者雙贏。關(guān)于數(shù)據(jù)抓取在各行各業(yè)所能發(fā)揮的最大潛能,我們對此的了解仍處在早期階段,所以讓我們抓住這個大好時機,以最合乎道德的方式來推動創(chuàng)新、促進發(fā)展。 □
The internet is currently undergoing a similar phenomenon to the gold rushes of the early eighteenth century,specifically when it comes to data extraction. With data now dubbed by some analysts as the “new oil” in terms of its value, the field is still open to small and large players alike, which has led to some unprofessional activities that extend all the way towards the acquisition of password-protected data.
2While many websites do contain defensive measures such as IP bans, the invisible conflicts between scrapers1scraper 網(wǎng)絡(luò)爬蟲,一種按照一定的規(guī)則,自動抓取萬維網(wǎng)信息的程序或腳本。后文的抓取、爬取,均指從萬維網(wǎng)上收集數(shù)據(jù)。and servers are ongoing and gaining in intensity, due to increased competition and economic factors. Most people don’t realise these are taking place between e-commerce stores, although they are happily taking advantage of the low prices found on aggregator websites2aggregator website 聚合網(wǎng)站,指的是通過人為技術(shù)方式收集其他網(wǎng)站的熱點內(nèi)容,進而將相關(guān)鏈接內(nèi)容分類聚合成為自己網(wǎng)站內(nèi)容的網(wǎng)站。
2 aggregator website 聚合網(wǎng)站,指的是通過人為技術(shù)方式收集其他網(wǎng)站的熱點內(nèi)容,進而將相關(guān)鏈接內(nèi)容分類聚合成為自己網(wǎng)站內(nèi)容的網(wǎng)站。like Expedia, Google Shopping, Price-Grabber and Skyscanner.
3Tools can be used for positive and negative purposes, and web scraping is no exception. A fairly common scenario is the scraping of personal data for marketing purposes. Hundreds of millions of users agree to release their data through terms of service agreements on e-commerce sites—whether they realise it or not. The issue with the exposed data, however, is that it has been extracted by social media agencies and used by now-defunct websites that create profiles and list personal details without user permission.
4As a result, web scraping is increasingly being subjected to negative press that has resulted in increased awareness from the public with respect to the value and privacy of their data. There is nothing inherently unethical about web scraping as it automates activities that people often do on a manual basis. The main difference is that web scraping does it on a much bigger scale by using bots to crawl numerous websites and extract huge amounts of information in seconds.
5Extracting publicly available data requires proxies3proxy 代理,一種特殊的網(wǎng)絡(luò)服務(wù)。它允許客戶端通過這個服務(wù)與服務(wù)器進行連接。. In short, proxies act as intermediaries between the web scraper and web server. Employing proxies allows distributing data requests evenly to the web server, ensuring that the data is requested at a fair rate, as well as providing the anonymity factor to the requesting party.
6Unethical scraping uses data extraction in a way that may compromise4compromise 危及,損害。privacy and result in server overload.
7While many websites try to prevent it through IP bans, this is becoming futile5futile 徒勞的。due to the use of proxies and their function in circumventing66 circumvent 逃避(規(guī)則或限制)。server issues by simulating human behaviour. The end results can be server overloads that cost online businesses money, reduced internet transparency and more distrust from the public with respect to privacy issues.
8Web scraping has many benefits that depend upon the availability of a free and transparent internet. I believe it would benefit the entire tech space if we adopted a few guidelines in order to make the landscape fair for everyone:
1. Scrape publicly available web pages only
2. Study the target website’s legal documents to determine whether you will legally accept their terms of service and if you will do so—whether you will not breach these terms
3. Make reasonable requests for data in order to ensure that server function is not compromised (DDoS attack7DDoS attack 即distributed denial-of-service attack,分散式阻斷服務(wù)攻擊,一種網(wǎng)絡(luò)攻擊手法。該手法的目的在于將目標電腦的網(wǎng)絡(luò)資源及系統(tǒng)資源耗盡,待目標電腦負荷過重而倒下后,通過系統(tǒng)漏洞入侵目標電腦。)
4. Respect privacy concerns of source websites with regards to any data obtained
5. Make use of proxies procured in an ethical manner
9It is commonly known that some proxies operating today are not ethically sourced, with many often obtained through applications downloaded by people on their devices. Whether these individuals are aware that their device is being used is difficult to ascertain.What’s certain is that it’s definitely not ethical to use them as a proxy in cases where they consented to misleading or confusing terms of service that unwillingly turn their device into a participant on a residential proxy network.
10There are some aspects of modern web scraping activity that are missing clarity, and a code of ethics is needed to bring order to the industry. If those in the industry can come together in agreement over a professional approach to web scraping, it will help to maintain a fair, open and free internet that will benefit both businesses and consumers. We are still in the early stages of discovering the full potential of data scraping in different industries, so let’s take advantage of this golden opportunity to drive innovation and create growth in the most ethical way possible. ■