日韩中文无码久久色,亚洲中文av,99er国产

知識點(diǎn)整理：

目錄：
1.分析目標(biāo)網(wǎng)頁代碼結(jié)構(gòu)；
2.代碼爬取數(shù)據(jù)；
3.保存或下載數(shù)據(jù)。

一、分析網(wǎng)頁

我們在頭條搜索“街拍”彈的網(wǎng)址https://www.toutiao.com/search/?keyword=街拍，我們按F12查詢源碼，但很明顯這個頁面沒有我們東西

image.png

因此可以想像這個頁面內(nèi)容可能是用ajax加載，點(diǎn)擊XHR，看到這里也有一個請求

image.png

而且這個請求頁的內(nèi)容適好就是這個頁面圖片內(nèi)容，所以目標(biāo)爬取索引頁地址就是這個https://www.toutiao.com/search_content/?，請求這個索引頁之后就來到圖片詳情頁https://www.toutiao.com/a6579859429166940680/，同樣也是F12打開網(wǎng)頁源碼，而經(jīng)過查找看到這些圖片的地址是直接藏在網(wǎng)頁里面，

image.png

所以我們就可以請求這個頁面，然后用正則獲取我們詳情頁的美圖地址

二、寫代碼爬取

這里要注意幾個點(diǎn)：

1、請求索引頁的時候https://www.toutiao.com/search_content/?這個頁面是有參數(shù)的，

image.png

因此需要將這些參數(shù)urlencode一下然后拼接到請求地址
2、請求詳情頁的時候，由于網(wǎng)站做了簡單的反爬，需要在請求的時候加入請求頭hearders，這樣才能獲取到剛剛內(nèi)容
3、獲取到詳情頁內(nèi)容，用正則獲取到藏有詳情頁圖片地址的json字符串就這樣的格式：{\"count\":6,\"sub_images\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/pgc-image\\/.......}，反斜杠\有轉(zhuǎn)義效果，因此不能直接用json.loads將字符串轉(zhuǎn)為python對象，直轉(zhuǎn)的話會報錯json.decoder.JSONDecodeError，所以我們可能用正則re.sub()或內(nèi)置函數(shù)replace()將字符串替換一下，然后再json.loads()轉(zhuǎn)換

三、下載保存

這一步比較簡單，就是直接insert 到MongoDB和write一下文件就可以了，直接看代碼就能理解

附上源碼：

配置文件---config.py

MONGO_URL='localhost' #MongoDB本地連接
MONGO_DB='toutiao' #數(shù)據(jù)庫名
MONGO_TABLE='toutiao' #表名

GROUP_START=0 #起始
GROUP_END =20 #結(jié)束

KEYWORD = '街拍' #搜索關(guān)鍵詞

執(zhí)行代碼文件---spider.py

# -*- coding: utf-8 -*-

import requests
from requests.exceptions import RequestException
from urllib.parse import urlencode
import json
from bs4 import BeautifulSoup
import re
from config import *
import pymongo
from hashlib import md5
import os
from multiprocessing import Pool

#client = pymongo.MongoClient(MONGO_URL,connect=False)#由于啟動多進(jìn)程爬取，所以與MongoDB數(shù)據(jù)庫連接時可能會報錯，就需要加上connect=False,
client = pymongo.MongoClient(MONGO_URL)#連接MongoDB
db = client[MONGO_DB]#創(chuàng)建數(shù)據(jù)庫

def get_page_index(offset,keyword):
    data={
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'cur_tab': 3,
        'from':'gallery',
    }
    url='https://www.toutiao.com/search_content/?'+urlencode(data)
    try:
        response = requests.get(url)
        if response.status_code==200:
            return response.text
        return None
    except RequestException:
        print("請求索引頁url出錯")
        return None

def parse_page_index(html):
    data = json.loads(html)#將json字符串轉(zhuǎn)換成python對象
    if data and 'data'in data.keys():
        for item in data.get('data'):
            yield item.get('article_url')#將方法變成生成器每次返回一個url

def get_page_detail(url):
    headers= {
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    try:
        response = requests.get(url,headers=headers)#爬取詳情頁需要加請求頭，不然獲取不了數(shù)據(jù)
        if response.status_code==200:
            return response.text#response.text一般來說這個返回網(wǎng)頁源碼
        return None
    except RequestException:
        print("請求詳情頁出錯",url)
        return None

def parse_page_detail(html,url):
    soup = BeautifulSoup(html,'lxml')
    title = soup.select('title')[0].get_text()
    print(title)
    images_pattern = re.compile('gallery: JSON.parse\("(.*?)"\),',re.S)#根據(jù)數(shù)據(jù)特點(diǎn)用正則篩選想要的數(shù)據(jù)
    result = re.search(images_pattern,html)
    if result:
        #print(result.group(1))
        page_detail = result.group(1)
        page_result = page_detail.replace('\\"','"',10000)#這個替換是為了修正數(shù)據(jù)格式，這樣才能用json.loads,否則會報錯json.decoder.JSONDecodeError
        last_change = page_result.replace('\\/','/',10000)#這個替換是為了修正詳情頁圖片地址
        #print(page_result)
        data = json.loads(last_change)
        if data and 'sub_images'in data.keys():
            sub_images = data.get('sub_images')
            images = [item.get('url') for item in sub_images]#將詳情頁圖片url循環(huán)生成列表
            for image in images: download_image(image)#將詳情頁圖片url循環(huán)下載
            return {
                'title':title,
                'url':url,
                'images':images
            }

def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):#將結(jié)果插入到MongoDB
        print('存儲到MongoDB成功',result)
        return True
    return False

def download_image(url):
    print("正在下載：",url)
    try:
        response = requests.get(url)
        if response.status_code==200:
            save_image(response.content)##response.content一般來說這個返回文件，圖片等二進(jìn)制內(nèi)容
        return None
    except RequestException:
        print("下載圖片出錯",url)
        return None

def save_image(content):
    file_path = '{0}/{1}/{2}.{3}'.format(os.getcwd(),'download_image', md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)
            f.close()

def main(offset):
    html = get_page_index(offset,KEYWORD)
    #print(html)
    for url in parse_page_index(html):
        print(url)
        html =  get_page_detail(url)
        #print(html)
        if html:
            result = parse_page_detail(html,url)
            #print(result)
            save_to_mongo(result)

if __name__ == '__main__':
    #main()
    groups = [x*20 for x in range(GROUP_START,GROUP_END+1)]
    poll = Pool()
    poll.map(main,groups)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

分析ajax請求，爬取今日頭條街拍圖片

分析ajax請求，爬取今日頭條街拍圖片

知識點(diǎn)整理：

一、分析網(wǎng)頁

二、寫代碼爬取

三、下載保存

附上源碼：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

分析ajax請求，爬取今日頭條街拍圖片

知識點(diǎn)整理：

一、分析網(wǎng)頁

二、寫代碼爬取

三、下載保存

附上源碼：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

分析ajax請求，爬取今日頭條街拍圖片

一、分析網(wǎng)頁

三、下載保存