日本四级片,久久久久久久久久久爱

上一期R爬蟲(chóng)必備——靜態(tài)網(wǎng)頁(yè)+動(dòng)態(tài)網(wǎng)頁(yè)簡(jiǎn)單介紹了網(wǎng)頁(yè)的類(lèi)型。在實(shí)際R爬蟲(chóng)過(guò)程中，針對(duì)不同的網(wǎng)頁(yè)，采取的爬蟲(chóng)方法也會(huì)有所不同。通常情況下，R爬蟲(chóng)涉及的R包主要有3個(gè)：rvest、Rcurl和httr。

rvest抓取靜態(tài)網(wǎng)頁(yè)數(shù)據(jù)****

所謂靜態(tài)網(wǎng)頁(yè)，就是你打開(kāi)一個(gè)目標(biāo)網(wǎng)頁(yè)，在網(wǎng)頁(yè)里可以直接看到想要抓取的數(shù)據(jù)，點(diǎn)擊鼠標(biāo)右鍵查看源代碼后發(fā)現(xiàn)在HTML結(jié)構(gòu)中可以在本地找到剛剛在網(wǎng)頁(yè)里的目標(biāo)數(shù)據(jù)，這就是靜態(tài)網(wǎng)頁(yè)。對(duì)于這樣的網(wǎng)頁(yè)，rvest可以提供一套較為完整的數(shù)據(jù)抓取方案，配上一些小工具，就可以快速實(shí)現(xiàn)爬蟲(chóng)。

Rcurl/httr包實(shí)現(xiàn)對(duì)網(wǎng)頁(yè)動(dòng)態(tài)加載數(shù)據(jù)的抓取

對(duì)于網(wǎng)頁(yè)動(dòng)態(tài)加載的數(shù)據(jù)，繼續(xù)使用rvest可能就不合適了。這時(shí)R提供了其他選擇來(lái)實(shí)現(xiàn)相應(yīng)的抓取目的。RCurl功能強(qiáng)大，但對(duì)初學(xué)者來(lái)說(shuō)稍微有點(diǎn)難度。httr包相當(dāng)于RCurl的精簡(jiǎn)版，相對(duì)輕巧易上手，功能雖不如RCurl那么齊全，但對(duì)于用戶而言絕對(duì)友好。

今天呢，主要從最簡(jiǎn)單的靜態(tài)網(wǎng)頁(yè)抓取R包——rvest開(kāi)始，這款R包抓取靜態(tài)網(wǎng)頁(yè)的邏輯非常清楚，初學(xué)者可以很快理解和上手。下面簡(jiǎn)單看一下rvest數(shù)據(jù)抓取的幾個(gè)核心函數(shù)：

read_html()：下載并解析網(wǎng)頁(yè)
html_nodes()：定位并獲取節(jié)點(diǎn)信息
html_text()：提取節(jié)點(diǎn)文本信息
html_attr()：提取節(jié)點(diǎn)屬性信息

rvest的這些函數(shù)如何使用呢？下面我們來(lái)簡(jiǎn)單看個(gè)案例——rvest包爬取鏈家網(wǎng)二手房信息，包括房子名字，房子具體信息（房型、面積、樓層等信息），房子地址，房子總價(jià)，房子每平方米單價(jià)等信息。

image

簡(jiǎn)單點(diǎn)，只為演示作用, 僅爬取第一頁(yè)：http://hz.lianjia.com/ershoufang/pg1。具體代碼如下：

#加載所需的包：
library("rvest")
library("stringr")
web <- read_html("http://hz.lianjia.com/ershoufang/pg1", encoding = "UTF-8")
#提取房名信息：
house_name <- web%>%html_nodes("div.info div.title a")%>%html_text()
#提取房子詳情鏈接：
house_link <- web%>%html_nodes("div.info div.title a")%>%html_attrs("href")
#提取房名基本信息并消除空格
house_basic_inf <- web%>%html_nodes(".houseInfo")%>%html_text()
house_basic_inf <- str_replace_all(house_basic_inf," ","")
#提取二手房地址信息
house_address <- web%>%html_nodes(".positionInfo a")%>%html_text()
house_address <-house_address[seq(1,60,2)]
#提取二手房總價(jià)信息
house_totalprice <- web%>%html_nodes(".totalPrice")%>%html_text()
#提取二手房單價(jià)信息
house_unitprice <- web%>%html_nodes(".unitPrice span")%>%html_text()
#創(chuàng)建數(shù)據(jù)框存儲(chǔ)以上信息
house<-data.frame(house_name,house_basic_inf,house_address,house_totalprice,house_unitprice)
#將數(shù)據(jù)寫(xiě)入csv文檔
write.csv(house, file="../house.csv")

整個(gè)代碼的邏輯關(guān)系是這樣：read_html()獲取整個(gè)網(wǎng)頁(yè)信息；html_nodes()用來(lái)定位到相應(yīng)節(jié)點(diǎn)；html_text()用于獲取節(jié)點(diǎn)的文本內(nèi)容；html_attrs()用于獲取節(jié)點(diǎn)的屬性值信息。如：

#read_html獲取整個(gè)網(wǎng)頁(yè)  
web <- read_html("http://hz.lianjia.com/ershoufang/pg1", encoding = "UTF-8")
#html_node獲取包含房子名稱(chēng)的節(jié)點(diǎn) 
tmp <- html_nodes(web,"div.info div.title a")
#html_text獲取該節(jié)點(diǎn)的文本信息
house_name <- html_text(tmp)
#html_attrs獲取該節(jié)點(diǎn)的href屬性信息
house_link <- html_attr(tmp,"href")

為方便，信息的提取過(guò)程用“%>%”管道符加以連接，書(shū)寫(xiě)如下:

web <- read_html("http://hz.lianjia.com/ershoufang/pg1", encoding = "UTF-8")
house_name <- web%>%html_nodes("div.info div.title a")%>%html_text()
house_link <- web%>%html_nodes("div.info div.title a")%>%html_attr("href")

read_html()獲取整個(gè)網(wǎng)頁(yè)信息

如：web <- read_html("http://hz.lianjia.com/ershoufang/pg1", encoding = "UTF-8")，返回的就是網(wǎng)頁(yè)源碼信息。

web <- read_html("http://hz.lianjia.com/ershoufang/pg1", encoding = "UTF-8")
web
#{xml_document}
#<html>
#[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n< ...
#[2] <body>\n<script type="application/ld+json">{"@context": "https://ziyuan.baidu. ...

在read_html()函數(shù)使用中，輸入要爬取的網(wǎng)址以及編碼格式，前者相信大家都沒(méi)有問(wèn)題，第二個(gè)網(wǎng)頁(yè)的編碼格式該如何判斷呢？打開(kāi)相應(yīng)網(wǎng)頁(yè)，然后右擊，審查元素，打開(kāi)網(wǎng)頁(yè)源碼，在源碼中找到head標(biāo)簽，點(diǎn)開(kāi)他，查找“Content Type”這樣的字眼，在其后往往會(huì)帶有“charset=xxxxxx”這樣的描述。如下圖，charset=utf-8，這表示該網(wǎng)頁(yè)采用的是UTF-8的編碼格式。

image

html_nodes()定位到相應(yīng)節(jié)點(diǎn)

如：tmp <- html_nodes(web, "div.info div.title a") ，返回值是"div.info div.title a"這個(gè)節(jié)點(diǎn)的信息，這個(gè)節(jié)點(diǎn)包含有房子的名字和詳情鏈接信息。

tmp <- html_nodes(web, "div.info div.title a")
tmp
# {xml_nodeset (30)}
# [1] <a class=""  target ...
# [2] <a class=""  target ...
# [3] <a class=""  target ...
# [4] <a class=""  target ...
# [5] <a class=""  target ...
# [6] <a class=""  target ...
# [7] <a class=""  target ...
# [8] <a class=""  target ...
# [9] <a class=""  target ...
# [10] <a class=""  target ...
# [11] <a class=""  target ...
# [12] <a class=""  target ...
# [13] <a class=""  target ...
# [14] <a class=""  target ...
# [15] <a class=""  target ...
# [16] <a class=""  target ...
# [17] <a class=""  target ...
# [18] <a class=""  target ...
# [19] <a class=""  target ...
# [20] <a class=""  target ...
# ...

html_nodes()這步是關(guān)鍵，可以得到待抓取數(shù)據(jù)的節(jié)點(diǎn)信息。html_nodes這個(gè)函數(shù)的輸入值：一個(gè)是read_html函數(shù)獲取的整個(gè)網(wǎng)頁(yè)信息web，另一個(gè)是定位節(jié)點(diǎn)表達(dá)式：可以是CSS選擇器表達(dá)式，或者XPath表達(dá)式。我們這里"div.info div.title a"采用的CSS選擇器表達(dá)式，那如何構(gòu)建CSS選擇器表達(dá)式呢？下一期會(huì)具體介紹。

image

html_text()獲取節(jié)點(diǎn)的文本內(nèi)容

如：house_name <- html_text(tmp)，獲取節(jié)點(diǎn)下的文本信息，該文本信息就是房子的名稱(chēng)。

house_name <- html_text(tmp)
house_name
# [1] "新出！景溪北苑，精裝好房，看房 方便，"             
# [2] "精裝修三房，滿五年且唯一，看房方便，房東誠(chéng)售"      
# [3] "濱江寶龍旁 精裝修單身公寓 拎包入住 房東誠(chéng)心出售"   
# [4] "滿兩年自住精裝  邊套大客廳 房東誠(chéng)心出售"           
# [5] "房子滿五年，精裝修，邊套，兩房加開(kāi)放式書(shū)房"        
# [6] "視野開(kāi)闊，采光好，適合居住，小區(qū)新，D鐵近，教育好" 
# [7] "此房視野開(kāi)闊，一房朝南一房朝北，總價(jià)低，交通方便。"
# [8] "臨平山北 野風(fēng)啟城 房東誠(chéng)心出售 中裝修 價(jià)格實(shí)惠"    
# [9] "滿2年，自住精裝修，看房方便，誠(chéng)心賣(mài)"               
# [10] "龍湖品質(zhì) 地鐵口 精裝四房二衛(wèi)  三陽(yáng)臺(tái) 誠(chéng)心出售"     
# [11] "小區(qū)位置好樓層采光好，視野開(kāi)闊，自住裝修。"        
# [12] "國(guó)際城滿兩年剛需三房，雙地鐵，誠(chéng)心賣(mài)可隨時(shí)簽約"    
# [13] "投/資和過(guò)度首/選，小面積、低總價(jià)、地鐵口"          
# [14] "贊成林風(fēng)標(biāo)準(zhǔn)三房，標(biāo)準(zhǔn)的3房2衛(wèi)2陽(yáng)臺(tái)戶型"           
# [15] "采光好 房東誠(chéng)心賣(mài) 小區(qū)環(huán)境戶型好通透，一眼即中."   
# [16] "荷塘極少戶型， 本房滿2，無(wú)增值稅，戶型方正，"      
# [17] "融創(chuàng)品質(zhì) 南北通透大陽(yáng)臺(tái) 中間樓層 誠(chéng)心出售看房方便" 
# [18] "中間樓層 南面無(wú)樓幢 視野好 陽(yáng)光足 裝修清爽"        
# [19] "次新房、戶型好、業(yè)主自住精裝，隨時(shí)看房，近地鐵"    
# [20] "此房房東自住精裝，拎包入住，夾邊套經(jīng)典戶型"        
# [21] "稅費(fèi)少，總價(jià)低，剛需改善皆可。"                    
# [22] "香樟名苑 三室二廳二衛(wèi)一廚   有車(chē)庫(kù) 價(jià)格面談 帶書(shū)房"
# [23] "新出！房東誠(chéng)心出售。價(jià)格可談，看房隨時(shí)"            
# [24] "房子 精裝修 滿倆年 帶露臺(tái) 阿里巴巴"                
# [25] "開(kāi)發(fā)商精裝修.未住人.戶型通透.帶小露臺(tái).拎包即住"    
# [26] "交通方便，南北通暢，配套設(shè)施齊全！"                
# [27] "此房視野開(kāi)闊，一房朝南一房朝北，總價(jià)低，交通方便。"
# [28] "梅堰小區(qū) 2室1廳 128萬(wàn)"                             
# [29] "雙地鐵次新小區(qū)滿二中間樓層.總共軟裝花了18萬(wàn)"       
# [30] "精裝小戶型，滿二年，朝陽(yáng)，70年產(chǎn)權(quán)，成熟地段。"

html_text()這步已經(jīng)到了具體的信息抓取過(guò)程，但針對(duì)的是標(biāo)簽的文本內(nèi)容。html_nodes這個(gè)函數(shù)的輸入值：read_node函數(shù)獲取的節(jié)點(diǎn)信息。連貫起來(lái)理解house_name <- web%>%html_nodes("div.info div.title a")%>%html_text()，就是先獲取整個(gè)頁(yè)面信息，然后定位到所需信息所在的節(jié)點(diǎn)，再在返回的節(jié)點(diǎn)中提取文本內(nèi)容。

image

html_attr()獲取節(jié)點(diǎn)的屬性信息

如：house_link <- html_attr(tmp,"href")，獲取節(jié)點(diǎn)下的href這個(gè)屬性值，該屬性值就是房子的詳情鏈接。

house_link <- html_attrs(tmp,"href")
house_link
# [1] "https://hz.lianjia.com/ershoufang/103109543956.html"
# [2] "https://hz.lianjia.com/ershoufang/103109306746.html"
# [3] "https://hz.lianjia.com/ershoufang/103108174516.html"
# [4] "https://hz.lianjia.com/ershoufang/103109219801.html"
# [5] "https://hz.lianjia.com/ershoufang/103107527889.html"
# [6] "https://hz.lianjia.com/ershoufang/103109328178.html"
# [7] "https://hz.lianjia.com/ershoufang/103109405297.html"
# [8] "https://hz.lianjia.com/ershoufang/103109234723.html"
# [9] "https://hz.lianjia.com/ershoufang/103109496556.html"
# [10] "https://hz.lianjia.com/ershoufang/103109299194.html"
# [11] "https://hz.lianjia.com/ershoufang/103108655599.html"
# [12] "https://hz.lianjia.com/ershoufang/103108551516.html"
# [13] "https://hz.lianjia.com/ershoufang/103109485194.html"
# [14] "https://hz.lianjia.com/ershoufang/103108968651.html"
# [15] "https://hz.lianjia.com/ershoufang/103109357513.html"
# [16] "https://hz.lianjia.com/ershoufang/103109578551.html"
# [17] "https://hz.lianjia.com/ershoufang/103109454329.html"
# [18] "https://hz.lianjia.com/ershoufang/103108832328.html"
# [19] "https://hz.lianjia.com/ershoufang/103109578729.html"
# [20] "https://hz.lianjia.com/ershoufang/103109450043.html"
# [21] "https://hz.lianjia.com/ershoufang/103105364860.html"
# [22] "https://hz.lianjia.com/ershoufang/103109300256.html"
# [23] "https://hz.lianjia.com/ershoufang/103108452618.html"
# [24] "https://hz.lianjia.com/ershoufang/103107750847.html"
# [25] "https://hz.lianjia.com/ershoufang/103108517042.html"
# [26] "https://hz.lianjia.com/ershoufang/103109088590.html"
# [27] "https://hz.lianjia.com/ershoufang/103109554013.html"
# [28] "https://hz.lianjia.com/ershoufang/103109603726.html"
# [29] "https://hz.lianjia.com/ershoufang/103109415991.html"
# [30] "https://hz.lianjia.com/ershoufang/103109101369.html"

html_attr()這步也是具體的信息抓取過(guò)程，但是針對(duì)屬性進(jìn)行提取。html_attr這個(gè)函數(shù)的輸入值：一個(gè)是read_node函數(shù)獲取的節(jié)點(diǎn)信息，另一個(gè)是屬性標(biāo)簽（位于其實(shí)標(biāo)簽內(nèi)）。連貫起來(lái)理解house_link <- web%>%html_nodes("div.info div.title a")%>%html_attr("href")，就是先獲取整個(gè)頁(yè)面信息，然后定位到所需信息所在的節(jié)點(diǎn)，再在返回的節(jié)點(diǎn)中提取href這個(gè)屬性值。

image

以上就是rvest的基本用法，具體可以查看R包的說(shuō)明文檔。學(xué)會(huì)以上幾個(gè)函數(shù)，基本上就可以抓取簡(jiǎn)單的靜態(tài)網(wǎng)頁(yè)了，其中稍有難度的就是html_node()函數(shù)中構(gòu)建節(jié)點(diǎn)表達(dá)式：CSS選擇器表達(dá)式和XPath表達(dá)式，這部分內(nèi)容在下一期做詳細(xì)介紹！

更多內(nèi)容可關(guān)注公共號(hào)“YJY技能修煉”~~~

往期回顧
R爬蟲(chóng)在工作中的一點(diǎn)妙用
 R爬蟲(chóng)必備基礎(chǔ)——HTML和CSS初識(shí)
R爬蟲(chóng)必備基礎(chǔ)——靜態(tài)網(wǎng)頁(yè)+動(dòng)態(tài)網(wǎng)頁(yè)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

R爬蟲(chóng)必備——rvest包的使用

R爬蟲(chóng)必備——rvest包的使用

read_html()獲取整個(gè)網(wǎng)頁(yè)信息

html_nodes()定位到相應(yīng)節(jié)點(diǎn)

html_text()獲取節(jié)點(diǎn)的文本內(nèi)容

html_attr()獲取節(jié)點(diǎn)的屬性信息

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

R爬蟲(chóng)必備——rvest包的使用

read_html()獲取整個(gè)網(wǎng)頁(yè)信息

html_nodes()定位到相應(yīng)節(jié)點(diǎn)

html_text()獲取節(jié)點(diǎn)的文本內(nèi)容

html_attr()獲取節(jié)點(diǎn)的屬性信息

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av