人人看操妇女,97精品伊人久久久久,嗯啊好粗c在线

立一個(gè)flag在這里，認(rèn)真解決這個(gè)問題

舉個(gè)例子，使用requests庫(kù)爬取網(wǎng)頁(yè)，經(jīng)常會(huì)出現(xiàn)亂碼，尤其是稍微大型一點(diǎn)的網(wǎng)站，比如百度，新浪新聞等。

#coding:utf-8
import requests
#import urllib.request

# 注意這個(gè)亂碼的分析 用這個(gè)文件解釋的時(shí)候，得到的就是亂碼
# 但是用Html_download2 執(zhí)行的時(shí)候，就不是亂碼
# 真的是 
class HtmlDownload(object):
    def download(self, url):
        if url is None:
            return None
        response = requests.get(url)
        if response.status_code!= 200:
            return None
        #得到html 的全部?jī)?nèi)容

        print response.encoding
        #response.encoding=('utf8')
        #print response.encoding

        return response.text


hd=HtmlDownload()
url='https://baike.baidu.com/'
html_content=hd.download(url)
print (html_content)

如果print 爬取出來(lái)的網(wǎng)頁(yè)，會(huì)出現(xiàn)亂碼。如下圖。

image.png

為什么會(huì)這樣，剛剛?cè)腴Tpython的時(shí)候，被編碼問題搞得對(duì)編碼產(chǎn)生了陰影。

看來(lái)requests的源碼之后，大概找到了問題，就是requests 如果不能找到指定的編碼，它在爬取網(wǎng)頁(yè)的時(shí)候，會(huì)猜測(cè)網(wǎng)頁(yè)的編碼，這樣可能會(huì)帶來(lái)一個(gè)問題。

#coding:utf-8
import requests
#import urllib.request

class HtmlDownload(object):
    def download(self, url):
        if url is None:
            return None
        response = requests.get(url)
        if response.status_code!= 200:
            return None
        #得到html 的全部?jī)?nèi)容
        print ("ok")
        print (">>test")
        #輸出response的網(wǎng)頁(yè)內(nèi)容編碼和response的網(wǎng)頁(yè)的頭部的編碼
        #response的網(wǎng)頁(yè)內(nèi)容編碼
        print ('encoding:',response.encoding)
        #response的網(wǎng)頁(yè)頭部的編碼
        print ('apparent_encoding:',response.apparent_encoding)
        return response.text

hd=HtmlDownload()
url='https://baike.baidu.com/'
html_content=hd.download(url)
#print (html_content)

image.png

print ('encoding:',response.encoding)
print ('apparent_encoding:',response.apparent_encoding)
的運(yùn)行結(jié)果一個(gè)是ISO-8859-1,一個(gè)事utf-8，這樣就會(huì)帶來(lái)問題。

所以，問題解決的方法，也很簡(jiǎn)單。
將網(wǎng)頁(yè)文本的編碼指定為UTF-8就可以了。

插入如下代碼

 response.encoding=('utf8')

image.png

在次爬取一下網(wǎng)頁(yè)

image.png

問題已經(jīng)解決，哇咔咔！

附上源碼：

#coding:utf-8
import requests
#import urllib.request

# 注意這個(gè)亂碼的分析 用這個(gè)文件解釋的時(shí)候，得到的就是亂碼
# 但是用Html_download2 執(zhí)行的時(shí)候，就不是亂碼
# 真的是 
class HtmlDownload(object):
    def download(self, url):
        if url is None:
            return None
        response = requests.get(url)
        if response.status_code!= 200:
            return None
        #得到html 的全部?jī)?nèi)容


        print ("ok")
        print (">>test")
        print ('encoding:',response.encoding)
        print ('apparent_encoding:',response.apparent_encoding)
        response.encoding=('utf8')
        print ('encoding :',response.encoding)
        return response.text
        #print ('encoding:',response.encoding)
        #return response.text


hd=HtmlDownload()
url='https://baike.baidu.com/'
html_content=hd.download(url)
print (html_content)

關(guān)于編碼問題，參考資料：
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
http://blog.chinaunix.net/uid-13869856-id-5747417.html
http://blog.csdn.net/wyb199026/article/details/52562538

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

關(guān)于python 爬蟲爬取網(wǎng)頁(yè)的亂碼問題

關(guān)于python 爬蟲爬取網(wǎng)頁(yè)的亂碼問題

立一個(gè)flag在這里，認(rèn)真解決這個(gè)問題

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

關(guān)于python 爬蟲爬取網(wǎng)頁(yè)的亂碼問題

立一個(gè)flag在這里，認(rèn)真解決這個(gè)問題

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

立一個(gè)flag在這里，認(rèn)真解決這個(gè)問題