1.?可以使用API從腳本運(yùn)行Scrapy,而不是運(yùn)行Scrapy的典型方法scrapy crawl;Scrapy是基于Twisted異步網(wǎng)絡(luò)庫(kù)構(gòu)建的,因此需要在Twisted容器內(nèi)運(yùn)行它,可以通過(guò)兩個(gè)API來(lái)運(yùn)行單個(gè)或多個(gè)爬蟲(chóng)scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。
2.?啟動(dòng)爬蟲(chóng)的的第一個(gè)實(shí)用程序是scrapy.crawler.CrawlerProcess?。該類(lèi)將為您啟動(dòng)Twisted reactor,配置日志記錄并設(shè)置關(guān)閉處理程序,此類(lèi)是所有Scrapy命令使用的類(lèi)。
示例運(yùn)行單個(gè)爬蟲(chóng):
交流群:1029344413 源碼、素材學(xué)習(xí)資料
import
scrapyfromscrapy.crawlerimport CrawlerProcessclass MySpider(scrapy.Spider):
? ? # Your spider definition? ? ...
process = CrawlerProcess({
? ? 'USER_AGENT':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
通過(guò)CrawlerProcess傳入?yún)?shù),并使用get_project_settings獲取Settings 項(xiàng)目設(shè)置的實(shí)例。
fromscrapy.crawlerimport CrawlerProcessfromscrapy.utils.projectimport get_project_settings
process = CrawlerProcess(get_project_settings())# 'followall' is the name of one of the spiders of the project.process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
還有另一個(gè)Scrapy實(shí)例方式可以更好地控制爬蟲(chóng)運(yùn)行過(guò)程:scrapy.crawler.CrawlerRunner。此類(lèi)封裝了一些簡(jiǎn)單的幫助程序來(lái)運(yùn)行多個(gè)爬蟲(chóng)程序,但它不會(huì)以任何方式啟動(dòng)或干擾現(xiàn)有的爬蟲(chóng)。
使用此類(lèi),顯式運(yùn)行reactor。如果已有爬蟲(chóng)在運(yùn)行想在同一個(gè)進(jìn)程中開(kāi)啟另一個(gè)Scrapy,建議您使用CrawlerRunner 而不是CrawlerProcess。
注意,爬蟲(chóng)結(jié)束后需要手動(dòng)關(guān)閉Twisted reactor,通過(guò)向CrawlerRunner.crawl方法返回的延遲添加回調(diào)來(lái)實(shí)現(xiàn)。
下面是它的用法示例,在MySpider完成運(yùn)行后手動(dòng)停止容器的回調(diào)。
fromtwisted.internetimport reactorimport scrapyfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider(scrapy.Spider):
? ? # Your spider definition? ? ...
configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
在同一個(gè)進(jìn)程中運(yùn)行多個(gè)蜘蛛
默認(rèn)情況下,Scrapy在您運(yùn)行時(shí)為每個(gè)進(jìn)程運(yùn)行一個(gè)蜘蛛。但是,Scrapy支持使用內(nèi)部API為每個(gè)進(jìn)程運(yùn)行多個(gè)蜘蛛。
這是一個(gè)同時(shí)運(yùn)行多個(gè)蜘蛛的示例:
import scrapyfromscrapy.crawlerimport CrawlerProcessclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
使用CrawlerRunner示例:
import scrapyfromtwisted.internetimport reactorfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的示例,但通過(guò)異步運(yùn)行爬蟲(chóng)蛛:
fromtwisted.internetimport reactor, deferfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacksdef crawl():
? ? yield runner.crawl(MySpider1)
? ? yield runner.crawl(MySpider2)
? ? reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished