2021-06-26

爬蟲入門與綜合應(yīng)用

(一)

1.了解Requests庫

1)Requests是一款目前非常流行的http請求庫,使用python編寫,能非常方便的對網(wǎng)頁Requests進(jìn)行爬取,也是爬蟲最常用的發(fā)起請求第三方庫。

安裝Requests庫

pip install requests

應(yīng)用Requests庫

示例1)試一試對百度首頁進(jìn)行數(shù)據(jù)請求:

import requests
# 發(fā)出http請求
re=requests.get("https://www.baidu.com")
# 查看響應(yīng)狀態(tài)
print(re.status_code)
#輸出:200
#200就是響應(yīng)的狀態(tài)碼,表示請求成功
#我們可以通過re.status_code的值來判斷請求是否成功

測試結(jié)果:


image.png

講解:requests.get()返回一個名為 re的 Response 對象.
re.status_code 表示響應(yīng)的HTTP狀態(tài)碼
擴(kuò)展:
re.text表示 響應(yīng)內(nèi)容的字符串形式
rs.content表示 響應(yīng)內(nèi)容的二進(jìn)制形式
rs.encoding表示 響應(yīng)內(nèi)容的編碼
示例2)爬取指定網(wǎng)頁內(nèi)容

import requests
# 發(fā)出http請求
re = requests.get('https://apiv3.shanbay.com/codetime/articles/mnvdu')
# 查看響應(yīng)狀態(tài)
print('網(wǎng)頁的狀態(tài)碼為%s'%re.status_code)
with open('魯迅文章4.txt', 'w') as file:
  # 將數(shù)據(jù)的字符串形式寫入文件中
  print('正在爬取小說')

  file.write(re.text)#我們可以通過res.status_code的值來判斷請求是否成功。

結(jié)果:


image.png

image.png

講解:re.text用于文本內(nèi)容的獲取、下載
示例3)圖片、視頻、音頻等內(nèi)容的獲取、下載


import requests
# 發(fā)出http請求
#下載圖片
res=requests.get('https://img-blog.csdnimg.cn/20210424184053989.PNG')
# 以二進(jìn)制寫入的方式打開一個名為 datawhale.jpg 的文件
with open('datawhale.png','wb') as ff:
    # 將數(shù)據(jù)的二進(jìn)制形式寫入文件中
    ff.write(res.content)

結(jié)果:

image.png

講解:
re.text用于文本內(nèi)容的獲取、下載
re.content用于圖片、視頻、音頻等內(nèi)容的獲取、下載

(二)

1.HTML解析和提取

1)了解瀏覽器工作原理
向瀏覽器中輸入某個網(wǎng)址,瀏覽器回向服務(wù)器發(fā)出請求,然后服務(wù)器就會作出響應(yīng)。其實,服務(wù)器返回給瀏覽器的這個結(jié)果就是HTML代碼,瀏覽器會根據(jù)這個HTML代碼將網(wǎng)頁解析成平時我們看到的那樣
應(yīng)用
示例1)看看百度的html頁面

import requests
res=requests.get('https://baidu.com')
print(res.text)

結(jié)果:


image.png

講解:將會看到很多帶有標(biāo)簽的信息,即HTML(是一種超文本標(biāo)記語言,是由一堆標(biāo)記組成)

(三)

1.BeautifulSoup簡介

1)安裝

pip install bs4

2)簡介:

通過 from bs4 import BeautifulSoup 語句導(dǎo)入 BeautifulSoup

然后使用 BeautifulSoup(res.text, lxmlr’) 語句將網(wǎng)頁源代碼的字符串形式解析成了 BeautifulSoup 對象
對象find()方法返回符合條件的首個數(shù)據(jù)
對象find_all()返回符合條件的所有數(shù)據(jù)
3)應(yīng)用
示例1)解析豆瓣讀書

import io
import sys
import requests
from bs4 import BeautifulSoup
###運(yùn)行出現(xiàn)亂碼時可以修改編碼方式
#sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
###
headers = {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
res = requests.get('https://book.douban.com/top250', headers=headers)
soup = BeautifulSoup(res.text, 'lxml')
print(soup.find('a'))

結(jié)果:


image.png

(四)

來源,Datawhale自動化辦公課程

1.實踐項目:自如公寓數(shù)據(jù)抓取

import requests
from bs4 import BeautifulSoup
import random
import time
import csv

#這里增加了很多user_agent
#能一定程度能保護(hù)爬蟲
user_agent = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"]

def get_info():
    csvheader=['名稱','面積','朝向','戶型','位置','樓層','是否有電梯','建成時間',' 門鎖','綠化']
    with open('wuhan_ziru.csv', 'a+', newline='') as csvfile:
        writer  = csv.writer(csvfile)
        writer.writerow(csvheader)
        for i in range(1,50):  #總共有50頁
            print('正在爬取自如第%s頁'%i)
            timelist=[1,2,3]
            print('有點(diǎn)累了,需要休息一下啦(¬?¬)')
            time.sleep(random.choice(timelist))   #休息1-3秒,防止給對方服務(wù)器過大的壓力?。。?            url='https://wh.ziroom.com/z/p%s/'%i
            headers = {'User-Agent': random.choice(user_agent)}
            r = requests.get(url, headers=headers)
            r.encoding = r.apparent_encoding
            soup = BeautifulSoup(r.text, 'lxml')
            all_info = soup.find_all('div', class_='info-box')
            print('開始干活咯(?>?<?)')
            for info in all_info:
                href = info.find('a')
                if href !=None:
                    href='https:'+href['href']
                    try:
                        print('正在爬取%s'%href)
                        house_info=get_house_info(href)
                        writer.writerow(house_info)
                    except:
                        print('出錯啦,%s進(jìn)不去啦( ??? ? ??? )'%href)

def get_house_info(href):
    #得到房屋的信息
    time.sleep(1)
    headers = {'User-Agent': random.choice(user_agent)}
    response = requests.get(url=href, headers=headers)
    response=response.content.decode('utf-8', 'ignore')
    soup = BeautifulSoup(response, 'lxml')
    name = soup.find('h1', class_='Z_name').text
    sinfo=soup.find('div', class_='Z_home_b clearfix').find_all('dd')
    area=sinfo[0].text
    orien=sinfo[1].text
    area_type=sinfo[2].text
    dinfo=soup.find('ul',class_='Z_home_o').find_all('li')
    location=dinfo[0].find('span',class_='va').text
    loucen=dinfo[1].find('span',class_='va').text
    dianti=dinfo[2].find('span',class_='va').text
    niandai=dinfo[3].find('span',class_='va').text
    mensuo=dinfo[4].find('span',class_='va').text
    lvhua=dinfo[5].find('span',class_='va').text
    ['名稱','面積','朝向','戶型','位置','樓層','是否有電梯','建成時間',' 門鎖','綠化']
    room_info=[name,area,orien,area_type,location,loucen,dianti,niandai,mensuo,lvhua]
    return room_info

if __name__ == '__main__':
    get_info()

實踐項目2:36kr信息抓取與郵件發(fā)送

import requests
import random
from bs4 import BeautifulSoup
import smtplib  # 發(fā)送郵件模塊
from email.mime.text import MIMEText  # 定義郵件內(nèi)容
from email.header import Header  # 定義郵件標(biāo)題

smtpserver = 'smtp.163.com'

# 發(fā)送郵箱用戶名密碼
user = ''
password = '’

# 發(fā)送和接收郵箱
sender = ''
receive = ''

user_agent = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"]

def main():
    print('正在爬取數(shù)據(jù)')
    url = 'https://36kr.com/newsflashes'
    headers = {'User-Agent': random.choice(user_agent)}
    response = requests.get(url, headers=headers)
    response=response.content.decode('utf-8', 'ignore')
    soup = BeautifulSoup(response, 'lxml')
    news = soup.find_all('a', class_='item-title')
    news_list=[]
    for i in news:
        title=i.get_text()
        ]
        news_list.append(title+'<br>'+href)
    info='<br></br>'.join(news_list)
    print('正在發(fā)送信息')
    send_email(info)

def send_email(content):
    # 通過QQ郵箱發(fā)送
    title='36kr快訊'
    subject = title
    msg = MIMEText(content, 'html', 'utf-8')
    msg['Subject'] = Header(subject, 'utf-8')
    msg['From'] = sender
    msg['To'] = receive
    # SSL協(xié)議端口號要使用465
    smtp = smtplib.SMTP_SSL(smtpserver, 465)  # 這里是服務(wù)器端口!
    # HELO 向服務(wù)器標(biāo)識用戶身份
    smtp.helo(smtpserver)
    # 服務(wù)器返回結(jié)果確認(rèn)
    smtp.ehlo(smtpserver)
    # 登錄郵箱服務(wù)器用戶名和密碼
    smtp.login(user, password)
    smtp.sendmail(sender, receive, msg.as_string())
    smtp.quit()

if __name__ == '__main__':
    main()
···

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容