python爬蟲--day06

進(jìn)程

進(jìn)程的概念

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n4" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">python中的多線程其實(shí)并不是真正的多線程,如果想要充分地使用多核CPU的資源,在python中大部分情況需要使用多進(jìn)程。
?
進(jìn)程的概念:
進(jìn)程是程序的一次執(zhí)行過(guò)程, 正在進(jìn)行的一個(gè)過(guò)程或者說(shuō)一個(gè)任務(wù),而負(fù)責(zé)執(zhí)行任務(wù)的則是CPU.

進(jìn)程的生命期:
當(dāng)操作系統(tǒng)要完成某個(gè)任務(wù)時(shí),它會(huì)創(chuàng)建一個(gè)進(jìn)程。當(dāng)進(jìn)程完成任務(wù)之后,系統(tǒng)就會(huì)撤銷這個(gè)進(jìn)程,收回它所占用的資源。從創(chuàng)建到撤銷的時(shí)間段就是進(jìn)程的生命期
?
進(jìn)程之間存在并發(fā)性:
在一個(gè)系統(tǒng)中,同時(shí)會(huì)存在多個(gè)進(jìn)程。他們輪流占用CPU和各種資源
?
并行與并發(fā)的區(qū)別:
無(wú)論是并行還是并發(fā),在用戶看來(lái)都是同時(shí)運(yùn)行的,不管是進(jìn)程還是線程,都只是一個(gè)任務(wù)而已,
真正干活的是CPU,CPU來(lái)做這些任務(wù),而一個(gè)cpu(單核)同一時(shí)刻只能執(zhí)行一個(gè)任務(wù)。
并行:多個(gè)任務(wù)同時(shí)運(yùn)行,只有具備多個(gè)cpu才能實(shí)現(xiàn)并行,含有幾個(gè)cpu,也就意味著在同一時(shí)刻可以執(zhí)行幾個(gè)任務(wù)。
CPU數(shù)量 >= 任務(wù)數(shù)量
并發(fā):是偽并行,即看起來(lái)是同時(shí)運(yùn)行的,實(shí)際上是單個(gè)CPU在多道程序之間來(lái)回的進(jìn)行切換。
CPU數(shù)量 < 任務(wù)數(shù)量
?
同步與異步的概念:
同步就是指一個(gè)進(jìn)程在執(zhí)行某個(gè)請(qǐng)求的時(shí)候,若該請(qǐng)求需要一段時(shí)間才能返回信息,那么這個(gè)進(jìn)程將會(huì)一直等待下去,直到收到返回信息才繼續(xù)執(zhí)行下去。
異步是指進(jìn)程不需要一直等下去,而是繼續(xù)執(zhí)行下面的操作,不管其他進(jìn)程的狀態(tài)。當(dāng)有消息返回時(shí)系統(tǒng)會(huì)通知進(jìn)行處理,這樣可以提高執(zhí)行的效率。
比如:打電話的過(guò)程就是同步通信,發(fā)短信時(shí)就是異步通信。
?
多線程和多進(jìn)程的關(guān)系:
對(duì)于計(jì)算密集型應(yīng)用,應(yīng)該使用多進(jìn)程;
對(duì)于IO密集型應(yīng)用,應(yīng)該使用多線程。線程的創(chuàng)建比進(jìn)程的創(chuàng)建開銷小的多。
?</pre>

創(chuàng)建進(jìn)程

使用multiprocessing.Process

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import time
?
def func(arg):
pname = multiprocessing.current_process().name
pid = multiprocessing.current_process().pid
print("當(dāng)前進(jìn)程ID=%d,name=%s" % (pid, pname))
?
for i in range(5):
print(arg)
time.sleep(1)
?
if name == "main":
p = multiprocessing.Process(target=func, args=("hello",))

p.daemon = True # 設(shè)為【守護(hù)進(jìn)程】(隨主進(jìn)程的結(jié)束而結(jié)束)

p.start()
?
while True:
print("子進(jìn)程是否活著?", p.is_alive())
time.sleep(1)
print("main over")
?</pre>

通過(guò)繼承Process實(shí)現(xiàn)自定義進(jìn)程

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n9" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import os
?

通過(guò)繼承Process實(shí)現(xiàn)自定義進(jìn)程

class MyProcess(multiprocessing.Process):
def init(self, name, url):
super().init()
self.name = name
self.url = url # 自定義屬性
?

重寫run

def run(self):
pid = os.getpid()
ppid = os.getppid()
pname = multiprocessing.current_process().name
print("當(dāng)前進(jìn)程name:", pname)
print("當(dāng)前進(jìn)程id:", pid)
print("當(dāng)前進(jìn)程的父進(jìn)程id:", ppid)
?
if name == 'main':

創(chuàng)建3個(gè)進(jìn)程

MyProcess("小分隊(duì)1", "").start()
MyProcess("小分隊(duì)2", "").start()
MyProcess("小分隊(duì)3", "").start()
print("主進(jìn)程ID:", multiprocessing.current_process().pid)
?

CPU核數(shù)

coreCount = multiprocessing.cpu_count()
print("我的CPU是%d核的" % coreCount)
?

獲取當(dāng)前活動(dòng)的進(jìn)程列表

print(multiprocessing.active_children()) </pre>

同步異步和進(jìn)程鎖

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n11" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import random
import time
?
def fn():
name = multiprocessing.current_process().name
print("開始執(zhí)行進(jìn)程:", name)
time.sleep(random.randint(1, 4))
print("執(zhí)行結(jié)束:", name)
?

多進(jìn)程

異步執(zhí)行進(jìn)程

def processAsync():
p1 = multiprocessing.Process(target=fn, name="小分隊(duì)1")
p2 = multiprocessing.Process(target=fn, name="小分隊(duì)2")
p1.start()
p2.start()
?

同步執(zhí)行

def processSync():
p1 = multiprocessing.Process(target=fn, name="小分隊(duì)1")
p2 = multiprocessing.Process(target=fn, name="小分隊(duì)2")
p1.start()
p1.join()
p2.start()
p2.join()
?

加鎖

def processLock():

進(jìn)程鎖

lock = multiprocessing.Lock()
p1 = multiprocessing.Process(target=fn2, name="小分隊(duì)1", args=(lock,))
p2 = multiprocessing.Process(target=fn2, name="小分隊(duì)2", args=(lock,))
p1.start()
p2.start()
?
def fn2(lock):
name = multiprocessing.current_process().name
print("開始執(zhí)行進(jìn)程:", name)
?

加鎖

方式一

if lock.acquire():

print("正在工作...")

time.sleep(random.randint(1, 4))

lock.release()

?

方式二

with lock:
print("%s:正在工作..." % name)
time.sleep(random.randint(1, 4))
?
print("%s:執(zhí)行結(jié)束:"% name)
?
?
if name == 'main':

processAsync() # 異步執(zhí)行

processSync() # 同步執(zhí)行

processLock() # 加進(jìn)程鎖
?</pre>

使用Semaphore控制進(jìn)程的最大并發(fā)

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n13" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import time
?
def fn(sem):
with sem:
name = multiprocessing.current_process().name
print("子線程開始:", name)
time.sleep(3)
print("子線程結(jié)束:", name)
?
if name == 'main':
sem = multiprocessing.Semaphore(3)
for i in range(8):
multiprocessing.Process(target=fn, name="小分隊(duì)%d"%i, args=(sem, )).start()
?</pre>

練習(xí): 多進(jìn)程抓取鏈家 https://sz.lianjia.com/ershoufang/rs/
練習(xí): 多進(jìn)程+多協(xié)程抓取鏈家 https://sz.lianjia.com/ershoufang/rs/
練習(xí): 多線程分頁(yè)抓取斗魚妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4
練習(xí): 多進(jìn)程分頁(yè)抓取斗魚妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4

擴(kuò)展

線程池

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n21" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import threading
import threadpool
?
import time
import random
?

=================================================================

def fn(who):
tname = threading.current_thread().getName()
?
print("%s開始%s..." % (tname, who))
time.sleep(random.randint(1, 5))
print("-----%s,%s-----" % (tname, who))
?

=================================================================

請(qǐng)求執(zhí)行結(jié)束回調(diào)

request=已完成的請(qǐng)求

result=任務(wù)的返回值

def cb(request, result):
print("cb", request, result)
?
if name == 'main':
?

創(chuàng)建一個(gè)最大并發(fā)為4的線程池(4個(gè)線程)

pool = threadpool.ThreadPool(4)
?
argsList = ["張三豐", "趙四", "王五", "六爺", "洪七公", "朱重八"]

允許回調(diào)

requests = threadpool.makeRequests(fn, argsList, callback=cb)
?
for req in requests:
pool.putRequest(req)
?

阻塞等待全部請(qǐng)求返回(線程池創(chuàng)建的并發(fā)默認(rèn)為【守護(hù)線程】)

pool.wait()
print("Over")
?
?</pre>

進(jìn)程池

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import random
import time
?
def fn1(arg, name):
print("正在執(zhí)行任務(wù)1: {}...".format(arg))
time.sleep(random.randint(1, 5))
print("進(jìn)程%d完畢!" % (name))
?
def fn2(arg, name):
print("正在執(zhí)行任務(wù)2: {}...".format(arg))
time.sleep(random.randint(1, 5))
print("進(jìn)程%d完畢!" % (name))
?
?

回調(diào)函數(shù)

def onback(result):
print("得到結(jié)果{}".format(result))
?
if name == "main":

待并發(fā)執(zhí)行的函數(shù)列表

funclist = [fn1, fn2, fn1, fn2]
?

創(chuàng)建一個(gè)3并發(fā)的進(jìn)程池

pool = multiprocessing.Pool(3)
?

遍歷函數(shù)列表,將每一個(gè)函數(shù)丟入進(jìn)程池中

for i in range(len(funclist)):

同步執(zhí)行

pool.apply(func=funclist[i], args=("hello", i))

異步執(zhí)行

pool.apply_async(func=funclist[i], args=("hello", i), callback=onback)
?
pool.close() # 關(guān)閉進(jìn)程池,不再接收新的進(jìn)程
pool.join() # 令主進(jìn)程阻塞等待池中所有進(jìn)程執(zhí)行完畢
?</pre>

Scrapy 框架介紹

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" contenteditable="true" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy是用純Python實(shí)現(xiàn)一個(gè)為了爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架,用途非常廣泛。
Scrapy框架:用戶只需要定制開發(fā)幾個(gè)模塊就可以輕松的實(shí)現(xiàn)一個(gè)爬蟲,用來(lái)抓取網(wǎng)頁(yè)內(nèi)容以及各種圖片,非常之方便。
Scrapy 使用了Twisted(其主要對(duì)手是Tornado)多線程異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通訊,可以加快我們的下載速度,不用自己去實(shí)現(xiàn)異步框架,并且包含了各種中間件接口,可以靈活的完成各種需求。</pre>

Scrapy架構(gòu)圖

[圖片上傳失敗...(image-561889-1546908816469)]

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n31" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy主要包括了以下組件:
Scrapy Engine(引擎):
負(fù)責(zé)Spider、ItemPipeline、Downloader、Scheduler中間的通訊,信號(hào)、數(shù)據(jù)傳遞等。
?
Scheduler(調(diào)度器):
它負(fù)責(zé)接受引擎發(fā)送過(guò)來(lái)的Request請(qǐng)求,并按照一定的方式進(jìn)行整理排列,入隊(duì),當(dāng)引擎需要時(shí),交還給引擎。
?
Downloader(下載器):
負(fù)責(zé)下載Scrapy Engine(引擎)發(fā)送的所有Requests請(qǐng)求,并將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來(lái)處理,
?
Spider(爬蟲):
它負(fù)責(zé)處理所有Responses,從中分析提取數(shù)據(jù),獲取Item字段需要的數(shù)據(jù),并將需要跟進(jìn)的URL提交給引擎,再次進(jìn)入Scheduler(調(diào)度器),
?
Item Pipeline(管道):
它負(fù)責(zé)處理Spider中獲取到的Item,并進(jìn)行后期處理(詳細(xì)分析、過(guò)濾、存儲(chǔ)等)的地方.
?
Downloader Middlewares(下載中間件):
你可以當(dāng)作是一個(gè)可以自定義擴(kuò)展下載功能的組件。
?
Spider Middlewares(Spider中間件):
你可以理解為是一個(gè)可以自定擴(kuò)展和操作引擎Spider中間通信的功能組件(比如進(jìn)入Spider的Responses和從Spider出去的Requests)
</pre>

安裝Scrapy

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n34" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy的安裝介紹
Scrapy框架官方網(wǎng)址:http://doc.scrapy.org/en/latest
Scrapy中文維護(hù)站點(diǎn):http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

安裝方式:
1、安裝wheel
pip install wheel
2、安裝lxml
pip install lxml
3、安裝pyopenssl
pip install pyopenssl
4、安裝Twisted
需要我們自己下載Twisted,然后安裝。這里有Python的各種依賴包。選擇適合自己Python以及系統(tǒng)的Twisted版本:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

3.6版本(cp后是python版本)

pip install Twisted-18.9.0-cp36-cp36m-win_amd64.whl

5、安裝pywin32
pip install pywin32
6、安裝scrapy
pip install scrapy

安裝后,只要在命令終端輸入scrapy來(lái)檢測(cè)是否安裝成功
?</pre>

使用Scrapy

使用爬蟲可以遵循以下步驟:

  1. 創(chuàng)建一個(gè)Scrapy項(xiàng)目

  2. 定義提取的Item

  3. 編寫爬取網(wǎng)站的 spider 并提取 Item

  4. 編寫 Item Pipeline 來(lái)存儲(chǔ)提取到的Item(即數(shù)據(jù))

1. 新建項(xiàng)目(scrapy startproject)

創(chuàng)建一個(gè)新的Scrapy項(xiàng)目來(lái)爬取 http://www.meijutt.com/new100.html 中的數(shù)據(jù),使用以下命令:

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy startproject meiju</pre>

創(chuàng)建爬蟲程序

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n52" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">cd meiju
scrapy genspider meijuSpider meijutt.com
?
其中:
meijuSpider為爬蟲文件名
meijutt.com為爬取網(wǎng)址的域名</pre>

創(chuàng)建Scrapy工程后, 會(huì)自動(dòng)創(chuàng)建多個(gè)文件,下面來(lái)簡(jiǎn)單介紹一下各個(gè)主要文件的作用:

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n54" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy.cfg:
項(xiàng)目的配置信息,主要為Scrapy命令行工具提供一個(gè)基礎(chǔ)的配置信息。(真正爬蟲相關(guān)的配置信息在settings.py文件中)
items.py:
設(shè)置數(shù)據(jù)存儲(chǔ)模板,用于結(jié)構(gòu)化數(shù)據(jù),如:Django的Model
pipelines:
數(shù)據(jù)處理行為,如:一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py:
配置文件,如:遞歸的層數(shù)、并發(fā)數(shù),延遲下載等
spiders:
爬蟲目錄,如:創(chuàng)建文件,編寫爬蟲規(guī)則

注意:一般創(chuàng)建爬蟲文件時(shí),以網(wǎng)站域名命名</pre>

2. 定義Item

Item是保存爬取到的數(shù)據(jù)的容器;其使用方法和python字典類似,雖然我們可以在Scrapy中直接使用dict,但是 Item提供了額外保護(hù)機(jī)制來(lái)避免拼寫錯(cuò)誤導(dǎo)致的未定義字段錯(cuò)誤;

類似ORM中的Model定義字段,我們可以通過(guò)scrapy.Item 類來(lái)定義要爬取的字段。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n58" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import scrapy
?
class MeijuItem(scrapy.Item):
name = scrapy.Field()</pre>

3. 編寫爬蟲

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n60" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
import scrapy
from lxml import etree
from meiju.items import MeijuItem
?
class MeijuspiderSpider(scrapy.Spider):

爬蟲名

name = 'meijuSpider'

被允許的域名

allowed_domains = ['meijutt.com']

起始爬取的url

start_urls = ['http://www.meijutt.com/new100.html']
?

數(shù)據(jù)處理

def parse(self, response):

response響應(yīng)對(duì)象

xpath

mytree = etree.HTML(response.text)
movie_list = mytree.xpath('//ul[@class="top-list fn-clear"]/li')
?
for movie in movie_list:
name = movie.xpath('./h5/a/text()')
?

創(chuàng)建item(類字典對(duì)象)

item = MeijuItem()
item['name'] = name
yield item
?</pre>

啟用一個(gè)Item Pipeline組件

為了啟用Item Pipeline組件,必須將它的類添加到 settings.py文件ITEM_PIPELINES 配置修改settings.py,并設(shè)置優(yōu)先級(jí),分配給每個(gè)類的整型值,確定了他們運(yùn)行的順序,item按數(shù)字從低到高的順序,通過(guò)pipeline,通常將這些數(shù)字定義在0-1000范圍內(nèi)(0-1000隨意設(shè)置,數(shù)值越低,組件的優(yōu)先級(jí)越高)

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n63" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">ITEM_PIPELINES = {
'meiju.pipelines.MeijuPipeline': 300,
}</pre>

設(shè)置UA

在setting.py中設(shè)置USER_AGENT的值

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n66" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
?</pre>

4. 編寫 Pipeline 來(lái)存儲(chǔ)提取到的Item(即數(shù)據(jù))

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n68" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">class SomethingPipeline(object):
def init(self):

可選實(shí)現(xiàn),做參數(shù)初始化等

def process_item(self, item, spider):

item (Item 對(duì)象) – 被爬取的item

spider (Spider 對(duì)象) – 爬取該item的spider

這個(gè)方法必須實(shí)現(xiàn),每個(gè)item pipeline組件都需要調(diào)用該方法,

這個(gè)方法必須返回一個(gè) Item 對(duì)象,被丟棄的item將不會(huì)被之后的pipeline組件所處理。

return item
?
def open_spider(self, spider):

spider (Spider 對(duì)象) – 被開啟的spider

可選實(shí)現(xiàn),當(dāng)spider被開啟時(shí),這個(gè)方法被調(diào)用。

?
def close_spider(self, spider):

spider (Spider 對(duì)象) – 被關(guān)閉的spider

可選實(shí)現(xiàn),當(dāng)spider被關(guān)閉時(shí),這個(gè)方法被調(diào)用

</pre>

運(yùn)行爬蟲:

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n70" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl meijuSpider
?

nolog模式

scrapy crawl meijuSpider --nolog </pre>

scrapy保存信息的最簡(jiǎn)單的方法主要有這幾種,-o 輸出指定格式的文件,命令如下:

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n72" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl meijuSpider -o meiju.json
scrapy crawl meijuSpider -o meiju.csv
scrapy crawl meijuSpider -o meiju.xml</pre>

練習(xí): Scrapy爬取新浪新聞存入數(shù)據(jù)庫(kù) http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_1.shtml
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容