亚洲中文一区老鸭窝,成人V精品,操逼二区国产

第一個(gè)需求

從新浪主頁(yè)抓取當(dāng)天的所有熱門(mén)內(nèi)容。有文字就爬取文字，圖片就爬圖片，其實(shí)直接爬取博文的連接就好。

抓取內(nèi)容以?xún)?nèi)容作者為主要的id

爬取作者的昵稱(chēng)和發(fā)表日期，博文鏈接，博文標(biāo)題。

這里不用登錄，因?yàn)?，熱門(mén)內(nèi)容主要是在微博首頁(yè)，主要的要求就是使用selenium渲染工具去采集動(dòng)態(tài)內(nèi)容。

第一個(gè)問(wèn)題：

新浪微博的采用下拉式更新的方式，所以需要使用selenium去執(zhí)行js代碼完成下拉的操作。

js='window.scrollTo(0,document.body.scrollHeight);'
browser.execute_script(js)

第二個(gè)問(wèn)題：?

定位相關(guān)的元素，獲取需要的內(nèi)容，并完成清洗。

#用戶(hù)昵稱(chēng)
name=b[0].find_element_by_xpath('//div[@class="list_des"]/div[@class="subinfo_box clearfix"]/a[2]/span').text
ptime=b[0].find_element_by_xpath('//div[@class="list_des"]/div[@class="subinfo_box clearfix"]/span[@class="subinfo S_txt2"]').text
#整合在一起
browser.find_elements_by_xpath('//div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]')
#完整的信息
b=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/div[@class="subinfo_box clearfix"]')
#清洗出用戶(hù)和日期
for i in b:
    i.text.split('\n')[0:2]
#獲取博文文本內(nèi)容
b=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/h3')
#博文鏈接
content=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/h3')
content=[i.text.replace('\n','').replace('#','') for i in content]

第三個(gè)問(wèn)題：

等待，因?yàn)槭褂胹elenium相當(dāng)于是打開(kāi)一個(gè)瀏覽器，和正常的瀏覽器一樣。收到網(wǎng)絡(luò)的影響，加載頁(yè)面有可能不及時(shí)，導(dǎo)致在內(nèi)容加載前，selenium就判定定位不到相關(guān)內(nèi)容報(bào)錯(cuò)。

在這里使用顯示等待的方法，并且在每次更新內(nèi)容時(shí)使用time.sleep延遲。下列的方法，會(huì)等待10秒，以便于找到ID為下列的元素被加載。

"PCD_pictext_i_v5"

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.ID,"PCD_pictext_i_v5")))

下列是完整的代碼，最后保存為json：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
?
from selenium.webdriver.support import expected_conditions as EC
import time
import random
import json
path='F:/NBA_TEAM/slang.json'
option=webdriver.FirefoxOptions()
option.add_argument('-headless')
?
browser=webdriver.Firefox(options=option)
?
browser.get('https://weibo.com/?category=0')
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.ID,"PCD_pictext_i_v5")))
num=10
for i in range(num):
?
    # b=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div')
#整合在一起
    # browser.find_elements_by_xpath('//div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]')
?
#完整的信息
    user_id_time=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/div[@class="subinfo_box clearfix"]')
#清洗出用戶(hù)和日期
    user_id_time=[i.text.split('\n')[0:2] for i in user_id_time]
?
    
#獲取博文文本內(nèi)容
    content=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/h3')
    content=[i.text.replace('\n','').replace('#','') for i in content]
?
#博文鏈接
    content_link=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]')
    content_link=['https:'+i.get_attribute('href') for i in content_link]
    
   
    js='window.scrollTo(0,document.body.scrollHeight);'
    browser.execute_script(js)
    time.sleep(random.randint(1,3))
    print(i)
?
    
?
data={}
for i in range(len(content_link)):
    data[str(i)]={
        "name":str(user_id_time[i][0]),
        "time":user_id_time[i][1],
        "content":content[i],
        "content_link":content_link[i]
    }
?
?
with open(path,'w',encoding='utf-8') as f:
    json.dump(data,fp=f,ensure_ascii=False)
?
print('all')

不過(guò)，這種方式需要等待的時(shí)間很長(zhǎng)，而且，這種腳本的方式，是必須在抓取完所有數(shù)據(jù)后才能進(jìn)行保存，我意識(shí)到一個(gè)問(wèn)題，這里每一次遍歷都重新創(chuàng)建了一個(gè)列表，并丟棄了原來(lái)的。因?yàn)槊看窝h(huán)，都得重最開(kāi)始的地方開(kāi)始抓取。這沒(méi)什么問(wèn)題，無(wú)非就是去重的問(wèn)題。問(wèn)題是保存數(shù)據(jù)的問(wèn)題。以后使用scrapy抓取數(shù)據(jù)的時(shí)候肯定是邊抓取邊保存。?。?我嘗試了下拉刷新100次，總的時(shí)長(zhǎng)是3151秒。差不多52分鐘，就算去掉延時(shí)的300秒也是47分鐘，而總共抓取了800條數(shù)據(jù)。確實(shí)很慢。

在這里插入圖片描述

關(guān)于去重的問(wèn)題，使用redis數(shù)據(jù)庫(kù)的集合就可以很好的去重，并且保證能夠邊刷新，邊保存，不至于出現(xiàn)意外的時(shí)候，數(shù)據(jù)丟失。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
?
from selenium.webdriver.support import expected_conditions as EC
import time
import random
import json
import redis
r=redis.Redis(host='localhost',port='6379',decode_responses=True)
path='F:/NBA_TEAM/slang.json'
option=webdriver.FirefoxOptions()
option.add_argument('-headless')
?
browser=webdriver.Firefox(options=option)
?
browser.get('https://weibo.com/?category=0')
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.ID,"PCD_pictext_i_v5")))
num=100
?
start=time.time()
for i in range(num):
    print(i)
    
#完整的信息
    user_id_time=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/div[@class="subinfo_box clearfix"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/div[@class="subinfo_box clearfix"]')
#清洗出用戶(hù)和日期
    user_id_time=[i.text.split('\n')[0:2] for i in user_id_time]
    for i in user_id_time:
        r.sadd('nameid',i[0])
        r.sadd('time',i[1])
        r.sm
    
#獲取博文文本內(nèi)容
    content=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]/div[@class="list_des"]/h3|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]/h3')
    content=[i.text.replace('\n','').replace('#','') for i in content]
    for i in content:
        r.sadd('content',i)
#博文鏈接
    content_link=browser.find_elements_by_xpath('//ul[@class="pt_ul clearfix"]/div[@class="UG_list_v2 clearfix"]/div[@class="list_des"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_b"]|//ul[@class="pt_ul clearfix"]/div[@class="UG_list_a"]')
    content_link=['https:'+i.get_attribute('href') for i in content_link]
    for i in content_link:
        r.sadd('contentlink',i)
    
    js='window.scrollTo(0,document.body.scrollHeight);'
    browser.execute_script(js)
    time.sleep(random.randint(1,3))
    
?
    
end=time.time()
?
print('運(yùn)行時(shí)間{}'.format(end-start))
?
?
print('all')
browser.quit()

第二個(gè)需求：

完成登錄，?從個(gè)人新浪微博爬取關(guān)注的用戶(hù)的微博。

抓取內(nèi)容以?xún)?nèi)容作者為主要的id

爬取作者的昵稱(chēng)和發(fā)表日期，博文內(nèi)容。

這里主要是使用selenium進(jìn)行微博登錄，截止2020-4-26日，微博網(wǎng)頁(yè)版登錄沒(méi)有驗(yàn)證，腳本代碼如下：

import numpy as np
import cv2
from selenium import webdriver
import time
option=webdriver.FirefoxOptions()
option.add_argument('-headless')
browser=webdriver.Firefox()
browser.get('https://weibo.com/')  
browser.get('https://weibo.com/?category=0')
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.ID,"PCD_pictext_i_v5")))
?
#登錄用戶(hù)
usr=browser.find_element_by_xpath('//input[@id="loginname"]') 
usr.clear()
usr.send_keys('**********') 
pas=browser.find_element_by_xpath('//input[@class="W_input"]')
pas.click()
?
pasw=browser.find_element_by_xpath('//input[@class="W_input W_input_focus"]') 
pasw.clear()
pasw.send_keys('**********') 
browser.find_element_by_xpath('//div[@class="info_list login_btn"]').click()
#跳轉(zhuǎn)到關(guān)注人網(wǎng)頁(yè)
browser.get('https://weibo.com/5748544426/follow?rightmod=1&wvr=6')  
#登錄后我的用戶(hù)名
?
?
 browser.find_element_by_xpath('//div[@class="pf_username"]').text
#獲取一頁(yè)的關(guān)注人
fo=browser.find_elements_by_xpath('//div[@class="mod_info"]/div/a[@class="S_txt1"]') 
for i in fo:
    i.text#關(guān)注用戶(hù)名
    i.get_attribute('href')#關(guān)注用戶(hù)主頁(yè)
?
?
?
#下一頁(yè)
browser.find_element_by_link_text("下一頁(yè)").click() 
?
#判斷還有沒(méi)有下一頁(yè)
while(1):
    if browser.find_element_by_link_text("下一頁(yè)").get_attribute('href') is None:
        print('ok')
        break
    else:
        browser.find_element_by_link_text("下一頁(yè)").click()
?
?
#進(jìn)入主頁(yè)后
browser.get('https://weibo.com/u/5856472352?from=myfollow_all')  
?
##用戶(hù)昵稱(chēng)
browser.find_elements_by_xpath('//h1[@class="username"]')[0].text 
##獲取發(fā)表時(shí)間
?
fo=browser.find_elements_by_xpath('//div[@class="WB_from S_txt2"]')
#獲取博文內(nèi)容
fo=browser.find_elements_by_xpath('//div[@class="WB_text W_f14"]')

使用text來(lái)獲取內(nèi)容，似乎只有16條

會(huì)介紹如何使用scrapy對(duì)接selenium完成任務(wù)?

爬取思路

登錄
獲取所有的關(guān)注人網(wǎng)頁(yè)鏈接
訪(fǎng)問(wèn)網(wǎng)頁(yè)
獲取時(shí)間、昵稱(chēng)、網(wǎng)頁(yè)

程序設(shè)計(jì)思路

我是要使用scrapy對(duì)接selenium完成我的關(guān)注用戶(hù)博文的爬取。

問(wèn)題1：如何對(duì)接selenium？通過(guò)編寫(xiě)scrapy的下載器中間件，截胡spider的請(qǐng)求并將其改成使用selenium發(fā)起request，然后將渲染后的page_source加載到response的body中返回到spider進(jìn)行解析。。
問(wèn)題2：如何登錄？通過(guò)在spider中重寫(xiě)start_request函數(shù)，在這個(gè)函數(shù)中再次調(diào)用selenium完成登錄并獲取cookie。并將cookie定義為一個(gè)屬性實(shí)例，因?yàn)閟elenium是阻塞的，所以完成多個(gè)任務(wù)會(huì)有些麻煩。

在這里我將登陸保存的代碼與scrapy分開(kāi)了，在scrapy中直接獲取代碼。因?yàn)?，小?xiàng)目執(zhí)行時(shí)間短，方便調(diào)試代碼。

登陸獲取cookie代碼：

import numpy as np
import cv2
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import requests
import time
import json
option=webdriver.FirefoxOptions()
option.add_argument('-headless')
browser=webdriver.Firefox(options=option)
?
browser.get('https://weibo.com/5748544426/follow?rightmod=1&wvr=6')
time.sleep(5)
if WebDriverWait(browser,10).until(EC.presence_of_element_located((By.ID,"loginname"))):
#登錄用戶(hù)
    print('登錄')
    time.sleep(10)
    usr=browser.find_element_by_xpath('//input[@id="loginname"]') 
    usr.clear()
    usr.send_keys('*******') 
   
    browser.find_element_by_xpath('//div[@class="info_list login_btn"]').click()
    time.sleep(10)
    js='document.getElementById("myBtn").onclick=function(){displayDate()}'
    pasw=browser.find_element_by_xpath('//input[@class="W_input W_input_focus"]') 
    pasw.clear()
    pasw.send_keys('******') 
    browser.find_element_by_xpath('//div[@class="info_list login_btn"]').click()
#跳轉(zhuǎn)到關(guān)注人網(wǎng)頁(yè)
    
    time.sleep(10)
    cookie=browser.get_cookies()
    browser.quit()
    print("jieshu")
?
browser=webdriver.Firefox()
path="C:/Users/CAPONEKD/sl/sinlangspider/sinlangspider/spiders/cookies.txt"
with open(path, "w") as fp:
    json.dump(cookie,fp)
?
?
with open(path, "r") as fp:
    cook = json.load(fp)
?
?
browser.get('https://weibo.com/5748544426/follow?rightmod=1&wvr=6')
for cookie in cook:
    browser.add_cookie(cookie)
browser.get('https://weibo.com/5748544426/follow?rightmod=1&wvr=6')
?
time.sleep(6)

如果，之后在調(diào)試代碼的時(shí)候，賬號(hào)被封了出現(xiàn)404，重新改密碼，并重新獲取cookie。

scrapy編寫(xiě)

編寫(xiě)item

這是全篇的數(shù)據(jù)存儲(chǔ)，不過(guò)本次項(xiàng)目的結(jié)果只有三個(gè)，所以數(shù)據(jù)結(jié)構(gòu)比較簡(jiǎn)單。
主要是關(guān)注用戶(hù)的ID，博文內(nèi)容，和發(fā)表日期。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field


class SinlangspiderItem(Item):
    name_id = Field()#用戶(hù)ID
    content = Field()#內(nèi)容
    date = Field()#發(fā)表如期

編寫(xiě)spider

完成主要的爬取開(kāi)始和頁(yè)面解析，本次項(xiàng)目中，主要是爬取個(gè)人的關(guān)注用戶(hù)，和其最近的微博。所以這次涉及到兩個(gè)方面的問(wèn)題，一個(gè)是爬取個(gè)人的關(guān)注用戶(hù)有哪些，另一個(gè)是獲取各個(gè)用戶(hù)的博文內(nèi)容。因?yàn)槭菍?duì)接selenium所以，會(huì)有阻塞，所以是先完成對(duì)所有關(guān)注用戶(hù)的爬取，在對(duì)關(guān)注用戶(hù)的博文爬取。

import scrapy
from scrapy.http import Request
import time
from sinlangspider import items
class QuotesSpider(scrapy.Spider):
    def __init__(self,**kwargs):
        self.follow_name = set()
        self.follow_url = set()
    name = "slan"
    allowed_domains = ['weibo.com']#允許的域名

    def start_requests(self):
        yield Request('https://weibo.com/5748544426/follow?rightmod=1&wvr=6',
                        callback = self.parse)
        
    def parse(self, response):
        print('---------爬取中-------')
        follow_info = response.xpath(
            '//div[@class="mod_info"]/div/a[@class="S_txt1"]')
        for i in follow_info:
            self.follow_name.add(i.xpath('text()').get())#關(guān)注用戶(hù)名
            self.follow_url.add(i.xpath('@href').get())#關(guān)注用戶(hù)主頁(yè)
        # print(self.follow_name,self.follow_url)
        try:    
            next_page = response.xpath('//a[@class="page next S_txt1 S_line1"]/@href').get()
            if next_page:
                print('下一頁(yè)',next_page)
                next_page = next_page.split('#')#下一頁(yè)利用#分割，#后的不要
                url = 'https://weibo.com'+next_page[0]
                yield Request(url,
                        callback = self.parse)
            else:
                new_url='https://weibo.com'+self.follow_url.pop()
                yield Request(new_url,
                        callback = self.parse2)
        except:
            print('異常沒(méi)了')
            new_url='https://weibo.com'+self.follow_url.pop()
            yield Request(new_url,
                    callback = self.parse2)
            
    def parse2(self,response):
        name_id = response.xpath('//h1[@class="username"]/text()').get()
        info=response.xpath('//div[@class="WB_detail"]')
        for i in info:
            date = i.xpath('div[@class="WB_from S_txt2"]/a[@class="S_txt2"][1]/text()').get()
            content = i.xpath('div[@class="WB_text W_f14"]/text()').get()
            item = items.SinlangspiderItem(name_id = name_id,
                                content=content,date = date)
            yield item
        print('完成')
        if self.follow_url:

            new_url = 'https://weibo.com'+self.follow_url.pop()
            print('新的開(kāi)始',new_url)
            yield Request(new_url,
                callback = self.parse2)
        else:
            print('全部結(jié)束')

編寫(xiě)pipeline

pipeline主要是用于對(duì)接mongoDB數(shù)據(jù)庫(kù)。并保存數(shù)據(jù)。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
import json
from scrapy.exceptions import DropItem
class SinlangspiderPipeline(object):
    def __init__(self,mongo_url,mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler,**kwargs):
        return cls(mongo_url = crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DATABASE'))
    
    def open_spider(self,spider):
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]
        self.mycol = self.db['slan']

    def close_spider(self,spider):
        self.client.close()

    def process_item(self, item, spider):
        print('開(kāi)始保存')
        if item['name_id']:
            self.mycol.insert_one(dict(item))
            
            return item
        else:
            raise DropItem('信息丟失')

middlewares編寫(xiě)

middlewares主要的作用是用于對(duì)接selenium，并且獲取給每次請(qǐng)求添加上cookie。并設(shè)置隨機(jī)UA用于反爬。可以的話(huà)最好加上代理IP。

class CookieMiddleware(object):
    
    def __init__(self,**kwargs):
        self.option=webdriver.FirefoxOptions()
        self.option.add_argument('-headless')
        self.browser=webdriver.Firefox(options=self.option)

    def __del__(self):
        self.browser.quit()

    def process_request(self, request, spider):
        path="C:/Users/CAPONEKD/sl/sinlangspider/sinlangspider/spiders/cookies.txt"
        with open(path, "r") as fp:
            cook = json.load(fp)
        try:
            self.browser.get(request.url)
            for cookie in cook:
                self.browser.add_cookie(cookie)
            print('新的url',request.url)
            self.browser.get(request.url)
            element = WebDriverWait(self.browser,10).until(
                EC.presence_of_element_located((By.XPATH, '//h1[@class="username"]')))
        
            return HtmlResponse(
                url=request.url, body=self.browser.page_source,
                request=request, encoding='utf-8', status=200)
        except TimeoutException:
            return HtmlResponse(
                    url=request.url, status=500, request=request)

    
class RandomUserAgent(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self,agents):
        self.agents=agents
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        # s = cls()
        # crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return cls(crawler.settings.getlist('USER_AGENTS'))

    def process_request(self, request, spider):
      
        request.headers.setdefault('User-Agent',random.choice(self.agents))
        print(request.headers['User-Agent'])

settings設(shè)置

#用于添加項(xiàng)目的地址
PROJECT_DIR = os.path.dirname(os.path.abspath(os.path.curdir))
#robots.txt禁用
ROBOTSTXT_OBEY = False
#下載器中間件啟用
DOWNLOADER_MIDDLEWARES = {
   #'sinlangspider.middlewares.SinlangspiderDownloaderMiddleware': 543,
    'sinlangspider.middlewares.RandomUserAgent':100,
    'sinlangspider.middlewares.CookieMiddleware':543
}
#mongodburl
MONGO_URL = 'mongodb://localhost:27017/'
#mongo數(shù)據(jù)庫(kù)
MONGO_DATABASE = "runoobdb"
#pipeline啟用
ITEM_PIPELINES = {
   'sinlangspider.pipelines.SinlangspiderPipeline': 300,
}

#設(shè)置UA
USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
   ]

最后是使用scrapyd部署

首先安裝scrapyd

pip install scrapyd

啟動(dòng)scrapyd：在命令行中輸入srapyd就行了，默認(rèn)情況下scrapyd運(yùn)行后臺(tái)會(huì)偵聽(tīng)6800端口。在瀏覽器中輸入http://127.0.0.1:6800。

[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來(lái)直接上傳(img-owhjs8iB-1587900829322)(713A3D3014CF4AD08F5433F579C310B3)]

scrapyd使用json數(shù)據(jù)返回狀態(tài)，所以其對(duì)應(yīng)的指令如下

獲取scrapyd狀態(tài)：http://127.0.0.1:6800/daemonstatus.json.GET請(qǐng)求方式。
獲取項(xiàng)目列表：http://127.00.1/6800/listprojects.json.GET請(qǐng)求方式。
獲取項(xiàng)目下以發(fā)布的爬蟲(chóng)列表http://127.0.0.16800/listspiders.json?project=myproject.GET請(qǐng)求。項(xiàng)目名稱(chēng)為myproject。
獲取已發(fā)布的爬蟲(chóng)版本列表。http://127.0.0.1:6800/listversion.json?project=myproject.GET請(qǐng)求。參數(shù)為項(xiàng)目名稱(chēng)project。
獲取爬蟲(chóng)運(yùn)行狀態(tài)：http://127.0.0.1/6800/listjobs.json?project=myproject.GET請(qǐng)求，
啟動(dòng)服務(wù)器上的某一個(gè)爬蟲(chóng)：http://127.0.0.1:6800/schedule.json.POST請(qǐng)求方式，蠶食為:"project":myproject,'spider':myspider,myproject為項(xiàng)目名稱(chēng)，myspider為爬蟲(chóng)名稱(chēng)。
刪除某一版本爬蟲(chóng)：http://127.0.0.1:6800/delversion.json.POST請(qǐng)求參數(shù)為''project':myproject,'version':myversion,myproject為項(xiàng)目名，version為爬蟲(chóng)版本。
刪除某一工程，并將工程下各個(gè)版本爬蟲(chóng)一起刪除。http://127.0.0.1:6800/delproject.json.POST請(qǐng)求方式，參數(shù)為'project':myproject,myproject為項(xiàng)目名稱(chēng)。
給工程添加版本，如果工程不存在則創(chuàng)建：http://127..0.1/6800/addversion.json.POST請(qǐng)求方式，參數(shù)為""project":myproject,'version':myversion,myproject為項(xiàng)目名稱(chēng)。version為項(xiàng)目版本。
取消一個(gè)運(yùn)行的爬蟲(chóng)任務(wù)。http://127.0.0.1:6800/cancel.json.POST請(qǐng)求方式，參數(shù)為''project':myproject,'job':jobid,myproject為項(xiàng)目名稱(chēng)，jobid為任務(wù)id。

scrapyd-client

scrapyd-client用于發(fā)布爬蟲(chóng)，首先安裝

pip install scrapyd-client

使用scrapyd-client 安裝完成后，將scrapyd-deploy拷貝到爬蟲(chóng)項(xiàng)目目錄下，與scrapy.cfg在同一級(jí)目錄。下面我們要修改scrapy.cfg文件，默認(rèn)生成的scrapy.cfg文件內(nèi)容如下：

[settings]
default = sinlangspider.settings

[deploy:100]
url = http://localhost:6800/
project = sinlang

deploy用于表示把爬蟲(chóng)發(fā)布到名為100的爬蟲(chóng)服務(wù)器上。一般在需要同時(shí)發(fā)去爬蟲(chóng)到多個(gè)目標(biāo)服務(wù)器時(shí)使用。

配置完成后，就可以使用scrapy-deploy進(jìn)行爬蟲(chóng)的發(fā)布。命令如下

scrapyd-deploy <target> -p sinlang --version ver2020426

版本如果不設(shè)置的話(huà)就會(huì)默認(rèn)使用時(shí)間戳。

在部署爬蟲(chóng)前，要確認(rèn)scrapyd啟動(dòng)了。

部署后啟動(dòng)爬蟲(chóng)：
在終端輸入：

curl http://localhost:6800/schedule.json -d project=sinlang -d spider=slan

就此全部完成。
查看數(shù)據(jù)有100條
這個(gè)練手的小項(xiàng)目有很多的不足，僅僅滿(mǎn)足基本需求。以后有機(jī)會(huì)會(huì)改進(jìn)的。?

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

scrapy-新浪關(guān)注用戶(hù)內(nèi)容爬取

scrapy-新浪關(guān)注用戶(hù)內(nèi)容爬取

第一個(gè)問(wèn)題：

第二個(gè)問(wèn)題：?

第三個(gè)問(wèn)題：

第二個(gè)需求：

爬取思路

程序設(shè)計(jì)思路

登陸獲取cookie代碼：

scrapy編寫(xiě)

編寫(xiě)item

編寫(xiě)spider

編寫(xiě)pipeline

middlewares編寫(xiě)

settings設(shè)置

最后是使用scrapyd部署

scrapyd-client

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

scrapy-新浪關(guān)注用戶(hù)內(nèi)容爬取

第一個(gè)問(wèn)題：

第二個(gè)問(wèn)題：?

第三個(gè)問(wèn)題：

第二個(gè)需求：

爬取思路

程序設(shè)計(jì)思路

登陸獲取cookie代碼：

scrapy編寫(xiě)

編寫(xiě)item

編寫(xiě)spider

編寫(xiě)pipeline

middlewares編寫(xiě)

settings設(shè)置

最后是使用scrapyd部署

scrapyd-client

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av