Python網(wǎng)絡(luò)數(shù)據(jù)采集筆記

上傳人：r*** IP屬地：貴州上傳時間：2020-06-22 格式：DOC 頁數(shù)：5 大小：32KB 積分：20 舉報 版權(quán)申訴

全文預(yù)覽已結(jié)束

 下載本文檔

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、1. BeautifulSoup簡介from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen()#打開一個網(wǎng)址bsObj = BeautifulSoup(html.read(),html.parser)#建立了一個美麗湯對象，以網(wǎng)頁內(nèi)容為參數(shù)#調(diào)用html.read獲取網(wǎng)頁的HTML內(nèi)容#這樣就可以把HTML內(nèi)容傳到美麗湯對象print (bsObj.h1) #提取h1標(biāo)簽導(dǎo)入urlopen，然后調(diào)用html.read()獲取網(wǎng)頁的HTML內(nèi)容，這樣就可以把H

2、TML內(nèi)容傳到BeautifulSoup對象用bsObj.h1從對象里提取h1標(biāo)簽任何HTML文件的任意節(jié)點的信息都可以被提取出來處理異常html = urlopen(/pages/page1.html)這一句可能出現(xiàn)兩種異常：l 網(wǎng)頁在服務(wù)器上不存在（提取網(wǎng)頁時出現(xiàn)錯誤）返回HTTP錯誤，urlopen函數(shù)拋出HTTPError異常處理：try: html = urlopen(/pages/page1.html) except HTTPError as e: print(e)#返回

3、空值，中斷程序，或者執(zhí)行另一個方案 else： #程序繼續(xù)l 服務(wù)器不存在（連接打不開、寫錯了），urlopen就會返回一個None對象，可以增加一個判斷語句檢測返回的html是不是None：if html is None:print(URL is not found)else: #程序繼續(xù) 第一個爬蟲：from urllib.request import urlopenfrom urllib.error import HTTPError,URLErrorfrom bs4 import BeautifulSoupdef getTitle(url): try: html = urlopen(ur

4、l) except (HTTPError,URLError) as e: return None try: bsObj = BeautifulSoup(html.read(),html.parser) title = bsObj.html.head.title except AttributeError as e: return None return titletitle = getTitle(/#signin)if title = None: print(title could not be found)else: print(title)2. 復(fù)雜

5、HTML解析/pages/warandpeace.html抓出整個頁面，然后創(chuàng)建一個BeautifulSoup對象：from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen(/pages/warandpeace.html)bsObj = BeautifulSoup(html)通過BeautifulSoup對象，可以用findAll函數(shù)抽取只包含在某個標(biāo)簽里的文字，如：namelist =

6、 bsObj.findAll(span,class:green)for name in namelist: print(name.get_text() #得到一個包含人物名稱的Python列表find()與findAll()函數(shù)findAll(tag,attributes,recursive,text,limit,keywords)find(tag, attributes,recursive,text,keywords)tag傳一個或多個標(biāo)簽的名稱組成的列表做標(biāo)簽函數(shù)，例如：.findAll(h1,h2,h3,h4,h5)attributes用一個Python字典封裝一個標(biāo)簽的若干屬性和對應(yīng)的

7、屬性值，例如：.findAll(span,class:green,red #返回紅色與綠色的span標(biāo)簽text用標(biāo)簽的文本內(nèi)容去匹配，例如：namelist = bsObj.findAll(text = the prince)print(len(namelist)其他BeautifulSoup對象BeautifulSoup對象標(biāo)簽tag對象NavigableString對象Comment對象導(dǎo)航樹1.處理子標(biāo)簽和其他后代標(biāo)簽children()函數(shù)和descendants()函數(shù)如果只想找出子標(biāo)簽，可以用.children標(biāo)簽from urllib.request import urlope

8、nfrom bs4 import BeautifulSouphtml = urlopen(/pages/warandpeace.html)bsObj = BeautifulSoup(html)for child in bsObj.find(table,id:giftlist).children: print(child) 2.處理兄弟標(biāo)簽next_siblings()函數(shù)可以讓手機表格數(shù)據(jù)成為簡單的事情for sibling in bsObj.find(table,id:giftlist).tr.next_siblings:print(s

9、ibling)#這段代碼會打印產(chǎn)品列表里所有行的產(chǎn)品（表格標(biāo)題除外，自己不能是自己的兄弟）3.父標(biāo)簽處理parent和parents正則表達式和BeautifulSoup獲取屬性對于一個標(biāo)簽對象，可以用下面的代碼獲取它的全部屬性：myTag.attrs要注意這行代碼返回的是一個Python對象，可以獲取和操作這些屬性，例如要獲取圖片的資源位置src，可以使用：myImgTag.attrssrcLambda表達式例如：soup.findAll(lambda tag: len(tag.attrs) = 2)3. 開始采集遍歷單個域名獲取維基百科網(wǎng)站的任何頁面并提取頁面鏈接的Python代碼：fro

10、m urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen()bsObj = BeautifulSoup(html)for link in bsObj.findAll(a):if href in link.attrs:print(link.attrshref)會包含一些不需要的鏈接，例如側(cè)邊欄頁面頁腳鏈接指向詞條頁面的鏈接的共同點：它們都在id是bodyContent的div標(biāo)簽里，URL鏈接不含冒號，URL鏈接可能都有共同的開頭，。因此可以改成for link

11、in bsObj.find(div,id:bodyContent).findAll(a,href=pile(/wiki/)(?!:).)*$)from urllib.request import urlopen from bs4 import BeautifulSoup import datetimeimport randomimport rerandom.seed(datetime.datetime.now()def getLinks(articleUrl):html = urlopen(+articleUrl)bsObj = Beau

12、tifulSoup(html)return bsObj.find(div,id:bodyContent).findAll(a,href = pile(/wiki/)(?!:).)*$)links = getLinks(/wiki/Kevin_Bacon)while len(links) 0:newArticle = linksrandom.randint(0,len(links)-1).attrshrefprint(newArticle)links = getLinks(newArticle)采集整個網(wǎng)站from urllib.request import urlopen from

13、 bs4 import BeautifulSoup import repages = set()def getLinks(pageUrl):global pageshtml = urlopen(+pageUrl)bsObj = BeautifulSoup(html)for link in bsObj.findAll(a,href = pile(/wiki/):if href in link.attrs:if link.attrshref not in pages:#我們遇到了新的頁面newPage = link.attrshrefprin

14、t(newPage)pages.add(newPage)getLinks(newPage)getLinks()一開始，用getLinks處理一個空URL，其實是維基百科的主頁，因為在函數(shù)里空URL就是。然后，遍歷首頁上每個鏈接，并檢查是否已經(jīng)在全局變量集合pages里面了（已經(jīng)采集的頁面集合）。如果不在，就打印到屏幕上，并把鏈接加入pages集合，再用getLinks遞歸的處理這個鏈接。收集整個網(wǎng)站數(shù)據(jù)from urllib.request import urlopen from bs4 import BeautifulSoup import rep

15、ages = set()def getLinks(pageUrl):global pageshtml = urlopen(g+pageUrl)bsObj = BeautifulSoup(html)try:print(bsObj.h1.get_text()print(bsObj.find(id=mw-content-text).findAll(p)0)print(bsObj.find(id=ca-edit).find(span).find(a).attrshef)except AttributeError:print(頁面缺少一些屬性，不過不用擔(dān)心)for lin

人人文庫> 全部分類> 應(yīng)用文書 > 事務(wù)文書

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

Python網(wǎng)絡(luò)數(shù)據(jù)采集筆記

文檔簡介

溫馨提示

最新文檔

評論

Python網(wǎng)絡(luò)數(shù)據(jù)采集筆記

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔