Python爬蟲 --- Scrapy爬取IT桔子網(wǎng)

目標(biāo):

此次爬取主要是針對(duì)IT桔子網(wǎng)的事件信息模塊,然后把爬取的數(shù)據(jù)存儲(chǔ)到mysql數(shù)據(jù)庫(kù)中。



目標(biāo)分析:

通過(guò)瀏覽器瀏覽發(fā)現(xiàn)事件模塊需要登錄才能訪問(wèn),因此我們需要先登錄,抓取登錄接口:


可以看到桔子網(wǎng)的登錄接口是:https://www.itjuzi.com/api/authorizations,請(qǐng)求方式是post請(qǐng)求,數(shù)據(jù)的提交方式是payload,提交的數(shù)據(jù)格式是json(payload方式提交的數(shù)據(jù)一般都是json),再看一下響應(yīng):

發(fā)現(xiàn)沒(méi)有響應(yīng)數(shù)據(jù),其實(shí)是有響應(yīng)數(shù)據(jù)的,只是F12調(diào)試看不到,我們可以用postman來(lái)查看響應(yīng)體:

可以發(fā)現(xiàn)響應(yīng)體是json數(shù)據(jù),我們先把它放到一邊,我們?cè)賮?lái)分析事件模塊,通過(guò)F12抓包調(diào)試發(fā)現(xiàn)事件模塊的數(shù)據(jù)其實(shí)是一個(gè)ajax請(qǐng)求得到的:

ajax請(qǐng)求得到的是json數(shù)據(jù),我們?cè)倏纯磆eaders:



可以發(fā)現(xiàn)headers里有一個(gè)Authorization參數(shù),參數(shù)的值恰好是我們登錄時(shí)登錄接口返回的json數(shù)據(jù)的token部分,所以這個(gè)參數(shù)很有可能是判斷我們是否登錄的憑證,我們可以用postman模擬請(qǐng)求一下:

通過(guò)postman的模擬請(qǐng)求發(fā)現(xiàn)如我們所料,我們只要在請(qǐng)求頭里加上這個(gè)參數(shù)我們就可以獲得對(duì)應(yīng)的數(shù)據(jù)了。
解決了如何獲得數(shù)據(jù)的問(wèn)題,再來(lái)分析一下如何翻頁(yè),通過(guò)對(duì)比第一頁(yè)和第二頁(yè)提交的json數(shù)據(jù)可以發(fā)現(xiàn)幾個(gè)關(guān)鍵參數(shù),page、pagetotal、per_page分別代表當(dāng)前請(qǐng)求頁(yè)、記錄總數(shù)、每頁(yè)顯示條數(shù),因此根據(jù)pagetotal和per_page我們可以算出總的頁(yè)數(shù),到此整個(gè)項(xiàng)目分析結(jié)束,可以開(kāi)始寫程序了。

scrapy代碼的編寫

1.創(chuàng)建scrapy項(xiàng)目和爬蟲:

E:\>scrapy startproject ITjuzi
E:\>scrapy genspider juzi itjuzi.com

2.編寫items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ItjuziItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    invse_des = scrapy.Field()
    invse_title = scrapy.Field()
    money = scrapy.Field()
    name = scrapy.Field()
    prov = scrapy.Field()
    round = scrapy.Field()
    invse_time = scrapy.Field()
    city = scrapy.Field()
    com_registered_name = scrapy.Field()
    com_scope = scrapy.Field()
    invse_company = scrapy.Field()

3.編寫Spider:

import scrapy
from itjuzi.settings import JUZI_PWD, JUZI_USER
import json


class JuziSpider(scrapy.Spider):
    name = 'juzi'
    allowed_domains = ['itjuzi.com']

    def start_requests(self):
        """
        先登錄桔子網(wǎng)
        """
        url = "https://www.itjuzi.com/api/authorizations"
        payload = {"account": JUZI_USER, "password": JUZI_PWD}
        # 提交json數(shù)據(jù)不能用scrapy.FormRequest,需要使用scrapy.Request,然后需要method、headers參數(shù)
        yield scrapy.Request(url=url,
                             method="POST",
                             body=json.dumps(payload),
                             headers={'Content-Type': 'application/json'},
                             callback=self.parse
                             )

    def parse(self, response):
        # 獲取Authorization參數(shù)的值
        token = json.loads(response.text)
        url = "https://www.itjuzi.com/api/investevents"
        payload = {
                    "pagetotal": 0, "total": 0, "per_page": 20, "page": 1, "type": 1, "scope": "", "sub_scope": "",
                    "round": [], "valuation": [], "valuations": "", "ipo_platform": "", "equity_ratio": [""],
                    "status": "", "prov": "", "city": [], "time": [], "selected": "", "location": "", "currency": [],
                    "keyword": ""
                }
        yield scrapy.Request(url=url,
                             method="POST",
                             body=json.dumps(payload),
                             meta={'token': token},
                             # 把Authorization參數(shù)放到headers中
                             headers={'Content-Type': 'application/json', 'Authorization': token['data']['token']},
                             callback=self.parse_info
                             )
    def parse_info(self, response):
        # 獲取傳遞的Authorization參數(shù)的值
        token = response.meta["token"]
        # 獲取總記錄數(shù)
        total = json.loads(response.text)["data"]["page"]["total"]
        # 因?yàn)槊宽?yè)20條數(shù)據(jù),所以可以算出一共有多少頁(yè)
        if type(int(total)/20) is not int:
            page = int(int(total)/20)+1
        else:
            page = int(total)/20

        url = "https://www.itjuzi.com/api/investevents"
        for i in range(1,page+1):
            payload = {
                "pagetotal": total, "total": 0, "per_page": 20, "page":i  , "type": 1, "scope": "", "sub_scope": "",
                "round": [], "valuation": [], "valuations": "", "ipo_platform": "", "equity_ratio": [""],
                "status": "", "prov": "", "city": [], "time": [], "selected": "", "location": "", "currency": [],
                "keyword": ""
            }
            yield scrapy.Request(url=url,
                                 method="POST",
                                 body=json.dumps(payload),
                                 headers={'Content-Type': 'application/json', 'Authorization': token['data']['token']},
                                 callback=self.parse_detail
                                 )
       

    def parse_detail(self, response):
        infos = json.loads(response.text)["data"]["data"]      
        for i in infos:
            item = ItjuziItem()
            item["invse_des"] = i["invse_des"]
            item["com_des"] = i["com_des"]
            item["invse_title"] = i["invse_title"]
            item["money"] = i["money"]
            item["com_name"] = i["name"]
            item["prov"] = i["prov"]
            item["round"] = i["round"]
            item["invse_time"] = str(i["year"])+"-"+str(i["year"])+"-"+str(i["day"])
            item["city"] = i["city"]
            item["com_registered_name"] = i["com_registered_name"]
            item["com_scope"] = i["com_scope"]
            invse_company = []
            for j in i["investor"]:
                invse_company.append(j["name"])
            item["invse_company"] = ",".join(invse_company)
            yield item

4.編寫PIPELINE:

from itjuzi.settings import DATABASE_DB, DATABASE_HOST, DATABASE_PORT, DATABASE_PWD, DATABASE_USER
import pymysql

class ItjuziPipeline(object):
    def __init__(self):
        host = DATABASE_HOST
        port = DATABASE_PORT
        user = DATABASE_USER
        passwd = DATABASE_PWD
        db = DATABASE_DB
        try:
            self.conn = pymysql.Connect(host=host, port=port, user=user, passwd=passwd, db=db, charset='utf8')
        except Exception as e:
            print("連接數(shù)據(jù)庫(kù)出錯(cuò),錯(cuò)誤原因%s"%e)
        self.cur = self.conn.cursor()

    def process_item(self, item, spider):
        params = [item['com_name'], item['com_registered_name'], item['com_des'], item['com_scope'],
                  item['prov'], item['city'], item['round'], item['money'], item['invse_company'],item['invse_des'],item['invse_time'],item['invse_title']]
        try:
            com = self.cur.execute(
                'insert into juzi(com_name, com_registered_name, com_des, com_scope, prov, city, round, money, invse_company, invse_des, invse_time, invse_title)values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',params)
            self.conn.commit()
        except Exception as e:
            print("插入數(shù)據(jù)出錯(cuò),錯(cuò)誤原因%s" % e)
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()

5.編寫settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for itjuzi project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'itjuzi'

SPIDER_MODULES = ['itjuzi.spiders']
NEWSPIDER_MODULE = 'itjuzi.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'itjuzi (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.25
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'itjuzi.middlewares.ItjuziSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'itjuzi.middlewares.ItjuziDownloaderMiddleware': 543,
   'itjuzi.middlewares.RandomUserAgent': 102,
   'itjuzi.middlewares.RandomProxy': 103,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'itjuzi.pipelines.ItjuziPipeline': 100,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
JUZI_USER = "1871111111111"
JUZI_PWD = "123456789"

DATABASE_HOST = '數(shù)據(jù)庫(kù)ip'
DATABASE_PORT = 3306
DATABASE_USER = '數(shù)據(jù)庫(kù)用戶名'
DATABASE_PWD = '數(shù)據(jù)庫(kù)密碼'
DATABASE_DB = '數(shù)據(jù)表'

USER_AGENTS = [
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.29 Safari/537.36"
    ]

PROXIES = [
    {'ip_port': '代理ip:代理IP端口', 'user_passwd': '代理ip用戶名:代理ip密碼'},
    {'ip_port': '代理ip:代理IP端口', 'user_passwd': '代理ip用戶名:代理ip密碼'},
    {'ip_port': '代理ip:代理IP端口', 'user_passwd': '代理ip用戶名:代理ip密碼'},
]

6.讓項(xiàng)目跑起來(lái):

E:\>scrapy crawl juzi

7.結(jié)果展示:


PS:詳情信息這里沒(méi)有爬取,詳情信息主要是根據(jù)上面列表頁(yè)返回的json數(shù)據(jù)中每個(gè)公司的id來(lái)爬取,詳情頁(yè)的數(shù)據(jù)可以不用登錄就能拿到如:https://www.itjuzi.com/api/investevents/10262327,https://www.itjuzi.com/api/get_investevent_down/10262327,還有最重要的一點(diǎn)是如果你的賬號(hào)不是vip會(huì)員的話只能爬取前3頁(yè)數(shù)據(jù)這個(gè)有點(diǎn)坑,其他的信息模塊也是一樣的分析方法,需要的可以自己去分析爬取。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容