進(jìn)程
進(jìn)程的概念
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n4" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">python中的多線程其實(shí)并不是真正的多線程,如果想要充分地使用多核CPU的資源,在python中大部分情況需要使用多進(jìn)程。
?
進(jìn)程的概念:
進(jìn)程是程序的一次執(zhí)行過(guò)程, 正在進(jìn)行的一個(gè)過(guò)程或者說(shuō)一個(gè)任務(wù),而負(fù)責(zé)執(zhí)行任務(wù)的則是CPU.
進(jìn)程的生命期:
當(dāng)操作系統(tǒng)要完成某個(gè)任務(wù)時(shí),它會(huì)創(chuàng)建一個(gè)進(jìn)程。當(dāng)進(jìn)程完成任務(wù)之后,系統(tǒng)就會(huì)撤銷這個(gè)進(jìn)程,收回它所占用的資源。從創(chuàng)建到撤銷的時(shí)間段就是進(jìn)程的生命期
?
進(jìn)程之間存在并發(fā)性:
在一個(gè)系統(tǒng)中,同時(shí)會(huì)存在多個(gè)進(jìn)程。他們輪流占用CPU和各種資源
?
并行與并發(fā)的區(qū)別:
無(wú)論是并行還是并發(fā),在用戶看來(lái)都是同時(shí)運(yùn)行的,不管是進(jìn)程還是線程,都只是一個(gè)任務(wù)而已,
真正干活的是CPU,CPU來(lái)做這些任務(wù),而一個(gè)cpu(單核)同一時(shí)刻只能執(zhí)行一個(gè)任務(wù)。
并行:多個(gè)任務(wù)同時(shí)運(yùn)行,只有具備多個(gè)cpu才能實(shí)現(xiàn)并行,含有幾個(gè)cpu,也就意味著在同一時(shí)刻可以執(zhí)行幾個(gè)任務(wù)。
CPU數(shù)量 >= 任務(wù)數(shù)量
并發(fā):是偽并行,即看起來(lái)是同時(shí)運(yùn)行的,實(shí)際上是單個(gè)CPU在多道程序之間來(lái)回的進(jìn)行切換。
CPU數(shù)量 < 任務(wù)數(shù)量
?
同步與異步的概念:
同步就是指一個(gè)進(jìn)程在執(zhí)行某個(gè)請(qǐng)求的時(shí)候,若該請(qǐng)求需要一段時(shí)間才能返回信息,那么這個(gè)進(jìn)程將會(huì)一直等待下去,直到收到返回信息才繼續(xù)執(zhí)行下去。
異步是指進(jìn)程不需要一直等下去,而是繼續(xù)執(zhí)行下面的操作,不管其他進(jìn)程的狀態(tài)。當(dāng)有消息返回時(shí)系統(tǒng)會(huì)通知進(jìn)行處理,這樣可以提高執(zhí)行的效率。
比如:打電話的過(guò)程就是同步通信,發(fā)短信時(shí)就是異步通信。
?
多線程和多進(jìn)程的關(guān)系:
對(duì)于計(jì)算密集型應(yīng)用,應(yīng)該使用多進(jìn)程;
對(duì)于IO密集型應(yīng)用,應(yīng)該使用多線程。線程的創(chuàng)建比進(jìn)程的創(chuàng)建開銷小的多。
?</pre>
創(chuàng)建進(jìn)程
使用multiprocessing.Process
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import time
?
def func(arg):
pname = multiprocessing.current_process().name
pid = multiprocessing.current_process().pid
print("當(dāng)前進(jìn)程ID=%d,name=%s" % (pid, pname))
?
for i in range(5):
print(arg)
time.sleep(1)
?
if name == "main":
p = multiprocessing.Process(target=func, args=("hello",))
p.daemon = True # 設(shè)為【守護(hù)進(jìn)程】(隨主進(jìn)程的結(jié)束而結(jié)束)
p.start()
?
while True:
print("子進(jìn)程是否活著?", p.is_alive())
time.sleep(1)
print("main over")
?</pre>
通過(guò)繼承Process實(shí)現(xiàn)自定義進(jìn)程
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n9" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import os
?
通過(guò)繼承Process實(shí)現(xiàn)自定義進(jìn)程
class MyProcess(multiprocessing.Process):
def init(self, name, url):
super().init()
self.name = name
self.url = url # 自定義屬性
?
重寫run
def run(self):
pid = os.getpid()
ppid = os.getppid()
pname = multiprocessing.current_process().name
print("當(dāng)前進(jìn)程name:", pname)
print("當(dāng)前進(jìn)程id:", pid)
print("當(dāng)前進(jìn)程的父進(jìn)程id:", ppid)
?
if name == 'main':
創(chuàng)建3個(gè)進(jìn)程
MyProcess("小分隊(duì)1", "").start()
MyProcess("小分隊(duì)2", "").start()
MyProcess("小分隊(duì)3", "").start()
print("主進(jìn)程ID:", multiprocessing.current_process().pid)
?
CPU核數(shù)
coreCount = multiprocessing.cpu_count()
print("我的CPU是%d核的" % coreCount)
?
獲取當(dāng)前活動(dòng)的進(jìn)程列表
print(multiprocessing.active_children()) </pre>
同步異步和進(jìn)程鎖
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n11" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import random
import time
?
def fn():
name = multiprocessing.current_process().name
print("開始執(zhí)行進(jìn)程:", name)
time.sleep(random.randint(1, 4))
print("執(zhí)行結(jié)束:", name)
?
多進(jìn)程
異步執(zhí)行進(jìn)程
def processAsync():
p1 = multiprocessing.Process(target=fn, name="小分隊(duì)1")
p2 = multiprocessing.Process(target=fn, name="小分隊(duì)2")
p1.start()
p2.start()
?
同步執(zhí)行
def processSync():
p1 = multiprocessing.Process(target=fn, name="小分隊(duì)1")
p2 = multiprocessing.Process(target=fn, name="小分隊(duì)2")
p1.start()
p1.join()
p2.start()
p2.join()
?
加鎖
def processLock():
進(jìn)程鎖
lock = multiprocessing.Lock()
p1 = multiprocessing.Process(target=fn2, name="小分隊(duì)1", args=(lock,))
p2 = multiprocessing.Process(target=fn2, name="小分隊(duì)2", args=(lock,))
p1.start()
p2.start()
?
def fn2(lock):
name = multiprocessing.current_process().name
print("開始執(zhí)行進(jìn)程:", name)
?
加鎖
方式一
if lock.acquire():
print("正在工作...")
time.sleep(random.randint(1, 4))
lock.release()
?
方式二
with lock:
print("%s:正在工作..." % name)
time.sleep(random.randint(1, 4))
?
print("%s:執(zhí)行結(jié)束:"% name)
?
?
if name == 'main':
processAsync() # 異步執(zhí)行
processSync() # 同步執(zhí)行
processLock() # 加進(jìn)程鎖
?</pre>
使用Semaphore控制進(jìn)程的最大并發(fā)
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n13" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import time
?
def fn(sem):
with sem:
name = multiprocessing.current_process().name
print("子線程開始:", name)
time.sleep(3)
print("子線程結(jié)束:", name)
?
if name == 'main':
sem = multiprocessing.Semaphore(3)
for i in range(8):
multiprocessing.Process(target=fn, name="小分隊(duì)%d"%i, args=(sem, )).start()
?</pre>
練習(xí): 多進(jìn)程抓取鏈家 https://sz.lianjia.com/ershoufang/rs/
練習(xí): 多進(jìn)程+多協(xié)程抓取鏈家 https://sz.lianjia.com/ershoufang/rs/
練習(xí): 多線程分頁(yè)抓取斗魚妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4
練習(xí): 多進(jìn)程分頁(yè)抓取斗魚妹子 https://www.douyu.com/gapi/rkc/directory/2_201/4
擴(kuò)展
線程池
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n21" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import threading
import threadpool
?
import time
import random
?
=================================================================
def fn(who):
tname = threading.current_thread().getName()
?
print("%s開始%s..." % (tname, who))
time.sleep(random.randint(1, 5))
print("-----%s,%s-----" % (tname, who))
?
=================================================================
請(qǐng)求執(zhí)行結(jié)束回調(diào)
request=已完成的請(qǐng)求
result=任務(wù)的返回值
def cb(request, result):
print("cb", request, result)
?
if name == 'main':
?
創(chuàng)建一個(gè)最大并發(fā)為4的線程池(4個(gè)線程)
pool = threadpool.ThreadPool(4)
?
argsList = ["張三豐", "趙四", "王五", "六爺", "洪七公", "朱重八"]
允許回調(diào)
requests = threadpool.makeRequests(fn, argsList, callback=cb)
?
for req in requests:
pool.putRequest(req)
?
阻塞等待全部請(qǐng)求返回(線程池創(chuàng)建的并發(fā)默認(rèn)為【守護(hù)線程】)
pool.wait()
print("Over")
?
?</pre>
進(jìn)程池
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import multiprocessing
import random
import time
?
def fn1(arg, name):
print("正在執(zhí)行任務(wù)1: {}...".format(arg))
time.sleep(random.randint(1, 5))
print("進(jìn)程%d完畢!" % (name))
?
def fn2(arg, name):
print("正在執(zhí)行任務(wù)2: {}...".format(arg))
time.sleep(random.randint(1, 5))
print("進(jìn)程%d完畢!" % (name))
?
?
回調(diào)函數(shù)
def onback(result):
print("得到結(jié)果{}".format(result))
?
if name == "main":
待并發(fā)執(zhí)行的函數(shù)列表
funclist = [fn1, fn2, fn1, fn2]
?
創(chuàng)建一個(gè)3并發(fā)的進(jìn)程池
pool = multiprocessing.Pool(3)
?
遍歷函數(shù)列表,將每一個(gè)函數(shù)丟入進(jìn)程池中
for i in range(len(funclist)):
同步執(zhí)行
pool.apply(func=funclist[i], args=("hello", i))
異步執(zhí)行
pool.apply_async(func=funclist[i], args=("hello", i), callback=onback)
?
pool.close() # 關(guān)閉進(jìn)程池,不再接收新的進(jìn)程
pool.join() # 令主進(jìn)程阻塞等待池中所有進(jìn)程執(zhí)行完畢
?</pre>
Scrapy 框架介紹
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" contenteditable="true" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy是用純Python實(shí)現(xiàn)一個(gè)為了爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架,用途非常廣泛。
Scrapy框架:用戶只需要定制開發(fā)幾個(gè)模塊就可以輕松的實(shí)現(xiàn)一個(gè)爬蟲,用來(lái)抓取網(wǎng)頁(yè)內(nèi)容以及各種圖片,非常之方便。
Scrapy 使用了Twisted(其主要對(duì)手是Tornado)多線程異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通訊,可以加快我們的下載速度,不用自己去實(shí)現(xiàn)異步框架,并且包含了各種中間件接口,可以靈活的完成各種需求。</pre>
Scrapy架構(gòu)圖
[圖片上傳失敗...(image-561889-1546908816469)]
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n31" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy主要包括了以下組件:
Scrapy Engine(引擎):
負(fù)責(zé)Spider、ItemPipeline、Downloader、Scheduler中間的通訊,信號(hào)、數(shù)據(jù)傳遞等。
?
Scheduler(調(diào)度器):
它負(fù)責(zé)接受引擎發(fā)送過(guò)來(lái)的Request請(qǐng)求,并按照一定的方式進(jìn)行整理排列,入隊(duì),當(dāng)引擎需要時(shí),交還給引擎。
?
Downloader(下載器):
負(fù)責(zé)下載Scrapy Engine(引擎)發(fā)送的所有Requests請(qǐng)求,并將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來(lái)處理,
?
Spider(爬蟲):
它負(fù)責(zé)處理所有Responses,從中分析提取數(shù)據(jù),獲取Item字段需要的數(shù)據(jù),并將需要跟進(jìn)的URL提交給引擎,再次進(jìn)入Scheduler(調(diào)度器),
?
Item Pipeline(管道):
它負(fù)責(zé)處理Spider中獲取到的Item,并進(jìn)行后期處理(詳細(xì)分析、過(guò)濾、存儲(chǔ)等)的地方.
?
Downloader Middlewares(下載中間件):
你可以當(dāng)作是一個(gè)可以自定義擴(kuò)展下載功能的組件。
?
Spider Middlewares(Spider中間件):
你可以理解為是一個(gè)可以自定擴(kuò)展和操作引擎和Spider中間通信的功能組件(比如進(jìn)入Spider的Responses和從Spider出去的Requests)
</pre>
安裝Scrapy
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n34" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Scrapy的安裝介紹
Scrapy框架官方網(wǎng)址:http://doc.scrapy.org/en/latest
Scrapy中文維護(hù)站點(diǎn):http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
安裝方式:
1、安裝wheel
pip install wheel
2、安裝lxml
pip install lxml
3、安裝pyopenssl
pip install pyopenssl
4、安裝Twisted
需要我們自己下載Twisted,然后安裝。這里有Python的各種依賴包。選擇適合自己Python以及系統(tǒng)的Twisted版本:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
3.6版本(cp后是python版本)
pip install Twisted-18.9.0-cp36-cp36m-win_amd64.whl
5、安裝pywin32
pip install pywin32
6、安裝scrapy
pip install scrapy
安裝后,只要在命令終端輸入scrapy來(lái)檢測(cè)是否安裝成功
?</pre>
使用Scrapy
使用爬蟲可以遵循以下步驟:
創(chuàng)建一個(gè)Scrapy項(xiàng)目
定義提取的Item
編寫爬取網(wǎng)站的 spider 并提取 Item
編寫 Item Pipeline 來(lái)存儲(chǔ)提取到的Item(即數(shù)據(jù))
1. 新建項(xiàng)目(scrapy startproject)
創(chuàng)建一個(gè)新的Scrapy項(xiàng)目來(lái)爬取 http://www.meijutt.com/new100.html 中的數(shù)據(jù),使用以下命令:
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy startproject meiju</pre>
創(chuàng)建爬蟲程序
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n52" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">cd meiju
scrapy genspider meijuSpider meijutt.com
?
其中:
meijuSpider為爬蟲文件名
meijutt.com為爬取網(wǎng)址的域名</pre>
創(chuàng)建Scrapy工程后, 會(huì)自動(dòng)創(chuàng)建多個(gè)文件,下面來(lái)簡(jiǎn)單介紹一下各個(gè)主要文件的作用:
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n54" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy.cfg:
項(xiàng)目的配置信息,主要為Scrapy命令行工具提供一個(gè)基礎(chǔ)的配置信息。(真正爬蟲相關(guān)的配置信息在settings.py文件中)
items.py:
設(shè)置數(shù)據(jù)存儲(chǔ)模板,用于結(jié)構(gòu)化數(shù)據(jù),如:Django的Model
pipelines:
數(shù)據(jù)處理行為,如:一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py:
配置文件,如:遞歸的層數(shù)、并發(fā)數(shù),延遲下載等
spiders:
爬蟲目錄,如:創(chuàng)建文件,編寫爬蟲規(guī)則
注意:一般創(chuàng)建爬蟲文件時(shí),以網(wǎng)站域名命名</pre>
2. 定義Item
Item是保存爬取到的數(shù)據(jù)的容器;其使用方法和python字典類似,雖然我們可以在Scrapy中直接使用dict,但是 Item提供了額外保護(hù)機(jī)制來(lái)避免拼寫錯(cuò)誤導(dǎo)致的未定義字段錯(cuò)誤;
類似ORM中的Model定義字段,我們可以通過(guò)scrapy.Item 類來(lái)定義要爬取的字段。
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n58" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import scrapy
?
class MeijuItem(scrapy.Item):
name = scrapy.Field()</pre>
3. 編寫爬蟲
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n60" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
import scrapy
from lxml import etree
from meiju.items import MeijuItem
?
class MeijuspiderSpider(scrapy.Spider):
爬蟲名
name = 'meijuSpider'
被允許的域名
allowed_domains = ['meijutt.com']
起始爬取的url
start_urls = ['http://www.meijutt.com/new100.html']
?
數(shù)據(jù)處理
def parse(self, response):
response響應(yīng)對(duì)象
xpath
mytree = etree.HTML(response.text)
movie_list = mytree.xpath('//ul[@class="top-list fn-clear"]/li')
?
for movie in movie_list:
name = movie.xpath('./h5/a/text()')
?
創(chuàng)建item(類字典對(duì)象)
item = MeijuItem()
item['name'] = name
yield item
?</pre>
啟用一個(gè)Item Pipeline組件
為了啟用Item Pipeline組件,必須將它的類添加到 settings.py文件ITEM_PIPELINES 配置修改settings.py,并設(shè)置優(yōu)先級(jí),分配給每個(gè)類的整型值,確定了他們運(yùn)行的順序,item按數(shù)字從低到高的順序,通過(guò)pipeline,通常將這些數(shù)字定義在0-1000范圍內(nèi)(0-1000隨意設(shè)置,數(shù)值越低,組件的優(yōu)先級(jí)越高)
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n63" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">ITEM_PIPELINES = {
'meiju.pipelines.MeijuPipeline': 300,
}</pre>
設(shè)置UA
在setting.py中設(shè)置USER_AGENT的值
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n66" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
?</pre>
4. 編寫 Pipeline 來(lái)存儲(chǔ)提取到的Item(即數(shù)據(jù))
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n68" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">class SomethingPipeline(object):
def init(self):
可選實(shí)現(xiàn),做參數(shù)初始化等
def process_item(self, item, spider):
item (Item 對(duì)象) – 被爬取的item
spider (Spider 對(duì)象) – 爬取該item的spider
這個(gè)方法必須實(shí)現(xiàn),每個(gè)item pipeline組件都需要調(diào)用該方法,
這個(gè)方法必須返回一個(gè) Item 對(duì)象,被丟棄的item將不會(huì)被之后的pipeline組件所處理。
return item
?
def open_spider(self, spider):
spider (Spider 對(duì)象) – 被開啟的spider
可選實(shí)現(xiàn),當(dāng)spider被開啟時(shí),這個(gè)方法被調(diào)用。
?
def close_spider(self, spider):
spider (Spider 對(duì)象) – 被關(guān)閉的spider
可選實(shí)現(xiàn),當(dāng)spider被關(guān)閉時(shí),這個(gè)方法被調(diào)用
</pre>
運(yùn)行爬蟲:
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n70" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl meijuSpider
?
nolog模式
scrapy crawl meijuSpider --nolog </pre>
scrapy保存信息的最簡(jiǎn)單的方法主要有這幾種,-o 輸出指定格式的文件,命令如下:
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n72" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">scrapy crawl meijuSpider -o meiju.json
scrapy crawl meijuSpider -o meiju.csv
scrapy crawl meijuSpider -o meiju.xml</pre>