色色一区综合,欧美精品午夜

暖場篇

?前一段時間一直在忙著開發(fā)公司的項目,博客更新就被擱置下來了,最近稍得空閑,就重溫了一下python,用python抓取今日頭條的圖集.

劇情篇

我們一起來開始python爬蟲之旅吧.

準(zhǔn)備工作

開發(fā)工具
PyCharm(python開發(fā)IDE)
需要的python包
pymongo (用來將抓取到的數(shù)據(jù)存儲到數(shù)據(jù)庫中)

sudo pip install pymongo

??BeautifulSoup (用來解析html文本)

sudo pip install BeautifulSoup4

開始抓包

分析目標(biāo)站點(diǎn)

?主要是分析目標(biāo)網(wǎng)頁的網(wǎng)絡(luò)請求方式和數(shù)據(jù)處理方式,找到自己所需要抓取的數(shù)據(jù)的網(wǎng)絡(luò)請求并記錄下來

創(chuàng)建python工程

請求列表頁面的數(shù)據(jù)

?拼裝data數(shù)據(jù),并通過requests模塊進(jìn)行網(wǎng)絡(luò)請求,獲取到搜索列表頁的數(shù)據(jù)

def get_page_index(offset, keyword):
    data = {'offset': offset,
            'format': 'json',
            'keyword': keyword,
            'autoload': 'true',
            'count': 20,
            'cur_tab': 3
            }
    url = 'http://www.toutiao.com/search_content/?' + urllib.urlencode(data)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print  ('請求索引失敗 ')
        return None

解析列表頁面的數(shù)據(jù)

def parse_page_index(html):
    data = json.loads(html)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            yield item.get('article_url')

獲取詳情頁數(shù)據(jù)

def get_page_detail(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print  ('請求詳情頁失敗')
        return None

解析詳情頁面的數(shù)據(jù)

?這里需要注意的點(diǎn)是使用BeautifulSoup包解析網(wǎng)頁時,需要先下載好lxml的解析器
?使用json.loads()方法可以將string類型的數(shù)據(jù)轉(zhuǎn)化成鍵值對,需要注意的點(diǎn)是jsonString中不能包含空格,否則會解析失敗需要先使用strJson = "".join([jsonString.strip().rsplit("}", 1)[0], "}"])對字符串進(jìn)行處理

def parse_page_detail(html, url):
    soup = BeautifulSoup(html, "lxml")
    # print  '調(diào)用成功'
    title = soup.select('title')[0].get_text()
    images_pattern = re.compile('gallery: (.*?)siblingList:', re.S)
    result = re.search(images_pattern, html)
    if result:
        jsonString = result.group(1)
        # 這句話是解析的關(guān)鍵問題
        strJson = "".join([jsonString.strip().rsplit("}", 1)[0], "}"])
        data = json.loads(strJson)
        if data and "sub_images" in data.keys():
            sub_images = data.get("sub_images")
            images = [item.get("url") for item in sub_images]
            for image in images: download_image(image)
            return {
                'title': title,
                'url': url,
                'images': images
            }
        return None
    return None

存儲數(shù)據(jù)庫

連接數(shù)據(jù)庫

client = pymongo.MongoClient(MONGO_URL,8000)
db = client[MONGO_DB]

將抓取到的數(shù)據(jù)存儲導(dǎo)數(shù)據(jù)庫中

def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        # print ('存儲到MongoDB成功')
        return True

下載圖片

def download_image(url):
    print '正在下載圖片 %s' % url
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content)
        return None
    except RequestException:
        print  ('下載圖片出錯 ')
        return None

存儲圖片到磁盤

def save_image(content):

    imagePath = os.path.join(os.getcwd(),KEYWORD)
    if not os.path.exists(imagePath):
        os.mkdir(KEYWORD)

    file_path = '{0}/{1}.{2}'.format(imagePath, md5(content).hexdigest(), 'jpg')
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(content)
            f.close()

開啟多線程

    groups = [x * 20 for x in range(GROUP_START + 1, GROUP_END + 1)]
    pool = Pool()
    pool.map(start_spider, groups)

總結(jié)

??通過以上這幾個步驟就可以完成爬蟲獲取數(shù)據(jù)的功能,針對不同的網(wǎng)站由于數(shù)據(jù)格式和數(shù)據(jù)處理方式不同,我們需要用不同的方式去抓取數(shù)據(jù)和解析數(shù)據(jù),但是原理和思路是不變的.

使用python進(jìn)行抓包大概分為:

分析要抓取的網(wǎng)頁的請求方式,數(shù)據(jù)處理方式.
使用python對獲取到的數(shù)據(jù)進(jìn)行過濾(正則表達(dá)式)
將抓取到的數(shù)據(jù)存儲到數(shù)據(jù)庫
使用多線程加快任務(wù)執(zhí)行速度

結(jié)束語

我們可以斷言,沒有激情,任何偉大的事業(yè)都不能完成. --黑格爾

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

使用python抓取今日頭條圖集數(shù)據(jù)

使用python抓取今日頭條圖集數(shù)據(jù)

暖場篇

劇情篇

準(zhǔn)備工作

開始抓包

分析目標(biāo)站點(diǎn)

創(chuàng)建python工程

請求列表頁面的數(shù)據(jù)

解析列表頁面的數(shù)據(jù)

獲取詳情頁數(shù)據(jù)

解析詳情頁面的數(shù)據(jù)

存儲數(shù)據(jù)庫

下載圖片

存儲圖片到磁盤

開啟多線程

總結(jié)

使用python進(jìn)行抓包大概分為:

結(jié)束語

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

使用python抓取今日頭條圖集數(shù)據(jù)

暖場篇

劇情篇

準(zhǔn)備工作

開始抓包

分析目標(biāo)站點(diǎn)

創(chuàng)建python工程

請求列表頁面的數(shù)據(jù)

解析列表頁面的數(shù)據(jù)

獲取詳情頁數(shù)據(jù)

解析詳情頁面的數(shù)據(jù)

存儲數(shù)據(jù)庫

下載圖片

存儲圖片到磁盤

開啟多線程

總結(jié)

使用python進(jìn)行抓包大概分為:

結(jié)束語

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av