免费蜜桃在线黄色网址,精产国产一区二区三区,日韩中美一线久久七区

實(shí)例1

功能描述：從CSDN搜索主頁，輸入keywords，進(jìn)入kerwords相關(guān)的搜索列表頁面，手動(dòng)獲取各個(gè)url鏈接，發(fā)送新的url請(qǐng)求，獲取該網(wǎng)頁頁面源碼，提取標(biāo)題并保存為html文件。

環(huán)境：python3.7+pycharm

庫：requests，re, beautifulsoup

分析網(wǎng)頁源碼：在搜索頁面中，查找出列表項(xiàng)的url，每個(gè)列表被包含在<dl>標(biāo)簽內(nèi)，ulr位于其data-track-view屬性中。根據(jù)列表項(xiàng)的url，進(jìn)入詳情頁面，同時(shí)找到標(biāo)題、發(fā)布時(shí)間分類并提取。其中需要注意的是解碼問題，網(wǎng)頁源代碼編寫有不同的編碼格式，注意解碼。然后搜索列表不止一頁，為了繼續(xù)獲取，比較前后兩頁的url，發(fā)現(xiàn)兩者的一處不同，p關(guān)鍵字，這里就要用到requests庫get方法的params參數(shù)，改變url繼續(xù)進(jìn)行爬取。

代碼：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import requests

import re

from bs4 import BeautifulSoup

url ="https://so.csdn.net/so/search/s.do"

p =0

# s = input()

s ='python'

for pin range(10):

p = p+1

? ? kv = {'p':'%d' % p, 'q':'%s' % s}

r = requests.get(url, params=kv)

r.encoding ='utf-8'

? ? so_url = r.request.url

html = r.text

# print(requests.get(so_url).text)

? ? soup = BeautifulSoup(html, "html.parser")

for dlin soup.find_all('dl'):

text = dl.prettify()

? ? ? ? search_url = dl.get('data-track-view')

search = re.findall(r'"con":"(.*?)"', search_url)[0]

content = requests.get(search).text

# print(content)

? ? ? ? tittle = re.findall(r'<div class="limit_width">\n.*?<a.*?>(.*?)</a>\n.*?<a', text, re.S)[0]

tittle = tittle.replace('<em>', '')

tittle = tittle.replace('</em>', '')

tittle = tittle.replace(' ', '')

tittle = tittle.replace('\n', '')

fb =open('%s.html' % tittle, 'w', encoding='utf-8')

fb.write(content)

print(search, tittle)

#exit()

# print(search)

其中主要

結(jié)果如下：

實(shí)例2

描述：爬取網(wǎng)絡(luò)小說章節(jié)

代碼：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import requests

import re

#下載一個(gè) 網(wǎng)頁

url ='http://www.17k.com/list/2932117.html'

#模擬瀏覽器發(fā)送http請(qǐng)求

response = requests.get(url)

response.encoding ='utf-8'

html = response.text

#小說標(biāo)題

tittle = re.findall(r'<h1 class="Title">(.*?)</h1', html)[0]

#print(tittle)

#新建文件保存小說內(nèi)容

fb =open('%s.txt' % tittle, 'w', encoding ='utf-8')

dl = re.findall(r'<dl class="Volume">.*?</dl>', html, re.S)[0]

dd = re.findall(r'<dd>.*?</dd>', dl, re.S)[0]

#注意正則表達(dá)式易錯(cuò)，.不能代替換行符

chapter_info_list = re.findall(r'href="(.*?)".*?>\n.*?<span class="ellipsis.*?">\n\s{60}(.*?)\s{52}<', dd)

#新建文件保存小說內(nèi)容

#with open('%s.txt' % tittle) as f:

#循環(huán)每個(gè)章節(jié)，分別下載

for chapter_infoin chapter_info_list:

#chapter_tittle = chapter_info[1]

#chapter_url = chapter_info[0]

? ? chapter_url, chapter_tittle = chapter_info

chapter_url ="http://www.17k.com%s" % chapter_url

#print(chapter_url, chapter_tittle)

#下載章節(jié)內(nèi)容

? ? chapter_response = requests.get(chapter_url)

chapter_response.encoding ='utf-8'

? ? chapter_html = chapter_response.text

#讀取章節(jié)內(nèi)容

? ? chapter_content = re.findall(r'<div class="p">(.*?)<div class="author-say"></div>', chapter_html, re.S)[0]

#清洗數(shù)據(jù)

? ? chapter_content = chapter_content.replace(' ', '')

chapter_content = chapter_content.replace('　', '')

chapter_content = chapter_content.replace('<br/>', '')

#持久化

? ? fb.write(chapter_tittle)

fb.write(chapter_content)

fb.write('\n')

print(chapter_url, chapter_tittle)

#exit()

結(jié)果：

中間遇到的問題：關(guān)鍵是正則表達(dá)式匹配那部分，.可以匹配的是除“/n"換行符以外的任意字符，html源碼標(biāo)簽中存在有換行符，但是我們無法看到，我一開始沒有注意到這一問題，導(dǎo)致返回列表為空。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python爬蟲實(shí)例

Python爬蟲實(shí)例

實(shí)例1

實(shí)例2

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python爬蟲實(shí)例

實(shí)例1

實(shí)例2

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av