通過(guò)核心API啟動(dòng)單個(gè)或多個(gè)scrapy爬蟲(chóng)

1.?可以使用API從腳本運(yùn)行Scrapy,而不是運(yùn)行Scrapy的典型方法scrapy crawl;Scrapy是基于Twisted異步網(wǎng)絡(luò)庫(kù)構(gòu)建的,因此需要在Twisted容器內(nèi)運(yùn)行它,可以通過(guò)兩個(gè)API來(lái)運(yùn)行單個(gè)或多個(gè)爬蟲(chóng)scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

2.?啟動(dòng)爬蟲(chóng)的的第一個(gè)實(shí)用程序是scrapy.crawler.CrawlerProcess?。該類(lèi)將為您啟動(dòng)Twisted reactor,配置日志記錄并設(shè)置關(guān)閉處理程序,此類(lèi)是所有Scrapy命令使用的類(lèi)。

示例運(yùn)行單個(gè)爬蟲(chóng):

交流群:1029344413 源碼、素材學(xué)習(xí)資料

import

scrapyfromscrapy.crawlerimport CrawlerProcessclass MySpider(scrapy.Spider):

? ? # Your spider definition? ? ...

process = CrawlerProcess({

? ? 'USER_AGENT':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished

通過(guò)CrawlerProcess傳入?yún)?shù),并使用get_project_settings獲取Settings 項(xiàng)目設(shè)置的實(shí)例。

fromscrapy.crawlerimport CrawlerProcessfromscrapy.utils.projectimport get_project_settings

process = CrawlerProcess(get_project_settings())# 'followall' is the name of one of the spiders of the project.process.crawl('followall', domain='scrapinghub.com')

process.start() # the script will block here until the crawling is finished

還有另一個(gè)Scrapy實(shí)例方式可以更好地控制爬蟲(chóng)運(yùn)行過(guò)程:scrapy.crawler.CrawlerRunner。此類(lèi)封裝了一些簡(jiǎn)單的幫助程序來(lái)運(yùn)行多個(gè)爬蟲(chóng)程序,但它不會(huì)以任何方式啟動(dòng)或干擾現(xiàn)有的爬蟲(chóng)。

使用此類(lèi),顯式運(yùn)行reactor。如果已有爬蟲(chóng)在運(yùn)行想在同一個(gè)進(jìn)程中開(kāi)啟另一個(gè)Scrapy,建議您使用CrawlerRunner 而不是CrawlerProcess。

注意,爬蟲(chóng)結(jié)束后需要手動(dòng)關(guān)閉Twisted reactor,通過(guò)向CrawlerRunner.crawl方法返回的延遲添加回調(diào)來(lái)實(shí)現(xiàn)。


下面是它的用法示例,在MySpider完成運(yùn)行后手動(dòng)停止容器的回調(diào)。

fromtwisted.internetimport reactorimport scrapyfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider(scrapy.Spider):

? ? # Your spider definition? ? ...

configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})

runner = CrawlerRunner()

d = runner.crawl(MySpider)

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until the crawling is finished

在同一個(gè)進(jìn)程中運(yùn)行多個(gè)蜘蛛

默認(rèn)情況下,Scrapy在您運(yùn)行時(shí)為每個(gè)進(jìn)程運(yùn)行一個(gè)蜘蛛。但是,Scrapy支持使用內(nèi)部API為每個(gè)進(jìn)程運(yùn)行多個(gè)蜘蛛。

這是一個(gè)同時(shí)運(yùn)行多個(gè)蜘蛛的示例:

import scrapyfromscrapy.crawlerimport CrawlerProcessclass MySpider1(scrapy.Spider):

? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):

? ? # Your second spider definition? ? ...

process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner示例:

import scrapyfromtwisted.internetimport reactorfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider1(scrapy.Spider):

? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):

? ? # Your second spider definition? ? ...

configure_logging()

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

相同的示例,但通過(guò)異步運(yùn)行爬蟲(chóng)蛛:

fromtwisted.internetimport reactor, deferfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider1(scrapy.Spider):

? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):

? ? # Your second spider definition? ? ...

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacksdef crawl():

? ? yield runner.crawl(MySpider1)

? ? yield runner.crawl(MySpider2)

? ? reactor.stop()

crawl()

reactor.run() # the script will block here until the last crawl call is finished

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容