久久久久免费精品,色婷婷一区二区三区四,99精品欧美一区二区

教材：《Web Scraping with Python——Collecting Data from the Modern Web》? 2015 by Ryan Mitchell

之所以叫網(wǎng)絡(luò)爬蟲（Web crawler），是因?yàn)樗鼈兛梢匝刂W(wǎng)絡(luò)爬行。本質(zhì)就是一種遞歸方式。為了找到 URL 鏈接，爬蟲必須首先獲取網(wǎng)頁內(nèi)容，檢查這個(gè)頁面的內(nèi)容，再尋找另一個(gè) URL，然后獲取 URL 對(duì)應(yīng)的網(wǎng)頁內(nèi)容，不斷循環(huán)這一過程。

提取頁面鏈接：

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a"):

? ? if 'href' in link.attrs:

? ? print(link.attrs['href'])

過濾多余的連接：

以僅提取“詞條鏈接”為例，相比于“其他鏈接”，“詞條鏈接”：

? 都在 id 是 bodyContent 的 div 標(biāo)簽里?

? URL 鏈接不包含分號(hào)

? URL 鏈接都以 /wiki/ 開頭

——利用find()方法和正則表達(dá)式過濾“其他鏈接”：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import datetime

import random

import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):

? ? html = urlopen("http://en.wikipedia.org"+articleUrl)

? ? bsObj = BeautifulSoup(html, "html.parser")

? ? return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")

while len(links) > 0:

? ? newArticle = links[random.randint(0, len(links)-1)].attrs["href"]

? ? print(newArticle)

links = getLinks(newArticle)

鏈接去重：

為了避免一個(gè)頁面被采集兩次，鏈接去重是非常重要的。在代碼運(yùn)行時(shí)，把已發(fā)現(xiàn)的所有鏈接都放到一起，并保存在方便查詢的列表里（下文示例指 Python 的集合 set 類型）。只有“新”鏈接才會(huì)被采集，之后再從頁面中搜索其他鏈接：

遍歷首頁上每個(gè)鏈接，并檢查是否已經(jīng)在全局變量集合 pages 里面了（已經(jīng)采集的頁面集合）。如果不在，就打印到屏幕上，并把鏈接加入pages 集合，再用 getLinks 遞歸地處理這個(gè)鏈接。

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

global pages

html = urlopen("http://en.wikipedia.org"+pageUrl)

bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

if 'href' in link.attrs:

if link.attrs['href'] not in pages:

# we meet the new page

newPage = link.attrs['href']

print(newPage)

pages.add(newPage)

getLinks(newPage)

getLinks("")

收集整個(gè)網(wǎng)站數(shù)據(jù)的組合程序：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

global pages

html = urlopen("http://en.wikipedia.org"+pageUrl)

bsObj = BeautifulSoup(html, "html.parser")

try:

print(bsObj.h1.get_text())

print(bsObj.find(id ="mw-content-text").findAll("p")[0])

print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])

except AttributeError:

print("This page is missing something! No worries though!")

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

if 'href' in link.attrs:

if link.attrs['href'] not in pages:

#We have encountered a new page

newPage = link.attrs['href']

print("----------------\n"+newPage)

pages.add(newPage)

getLinks(newPage)

getLinks("")

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python 網(wǎng)絡(luò)爬蟲學(xué)習(xí)筆記.CH3 采集數(shù)據(jù)

Python 網(wǎng)絡(luò)爬蟲學(xué)習(xí)筆記.CH3 采集數(shù)據(jù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python 網(wǎng)絡(luò)爬蟲 學(xué)習(xí)筆記.CH3 采集數(shù)據(jù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python 網(wǎng)絡(luò)爬蟲學(xué)習(xí)筆記.CH3 采集數(shù)據(jù)