Week 1_Practice 1.2_Crawling Item Information from One Page

Some critical information has been crawled from a website. The website is as below:


The information we need is "item title", "image", "review number", "price", and "star". The result is shown here:


The general process for the web crawling could be described as below (from the course website) :

1) The html file could be read (r) or write (w) from open() function. There are two ways:?

(1) file = open('absolute or relative file path','r'); ? ?print(file.read()); ? ?file.close()

(2) with open('absolute or relative file path','r') as file: ? print(file.read())

2) A special, unique label information (i.e., css path) should be identified in the html file. The relevant commands are: inspect and copy selector. ?

2) One example of the css path looks like:?

"body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)"

? ? Note: "nth-child" should be changed for "nth-of-type(n)" in BeautifulSoap.?

3) The information, or css path, should be incorporated in soup.select('css path') to get the result list:

"stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')"

The "starts" is a list.?

4) In order to get a single result from the list, we could use zip() function and for "for" "in" structure, to iterate through the "zipped" lists:

"for title,image,review,price,star in zip(titles,images,reviews,prices,stars):"

5) Use get_text(), get('src'), or get("href") functions to retrieve the desired content from the tag.?

data = {

'title': title.get_text(), ? ? ? ? ? ? ? # 使用get_text()方法取出文本

'image': image.get('src'), ? ? ? ? # 使用get 方法取出帶有src的圖片鏈接

'review': review.get_text(),

'price': price.get_text(),

'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))*'★' ? ? ? ? ??

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# 使用find_all 統(tǒng)計(jì)有幾處是★的樣式 ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # 由于find_all()返回的結(jié)果是列表,我們?cè)偈褂胠en()方法去計(jì)算列表中的元素個(gè)數(shù),也就是星星的數(shù)量

}

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 11,116評(píng)論 0 23
  • 一事一日一生的反思 反思,是對(duì)一件事的反思;是對(duì)一天的反思;是對(duì)一生的反思。 自從學(xué)會(huì)寫簡(jiǎn)書寫印象筆記,我每天學(xué)會(huì)...
    信兒315閱讀 146評(píng)論 0 0
  • 2017.3.25 我可愛(ài)的孩子們,晚上好!剛剛還在喧鬧的你們,此時(shí)已經(jīng)進(jìn)入夢(mèng)鄉(xiāng),尤其是老二你,平時(shí)都要媽媽陪在身...
    來(lái)自過(guò)去的信閱讀 162評(píng)論 0 0
  • 自律必精進(jìn) 葛飛 中德安聯(lián)人壽保險(xiǎn)有限公司濟(jì)南SSC 【日精進(jìn)打卡第6天】 【知~學(xué)習(xí)】 《六項(xiàng)精進(jìn)》大綱 誦讀1...
    葛飛閱讀 281評(píng)論 0 0

友情鏈接更多精彩內(nèi)容