Intro

(Optional) Create virtual environment

prefer using python version 3
mkvirtualenv --python=/usr/bin/python3 python3

check pip version by pip --version to make sure python 3 is used

Steps

  • scrapy startproject name
  • scrapy genspider botname url

robotstxt in setting should be true to always crawl permitted pages and be a good web citizen

  • inside project folder scrapy crawl botname
  • test in shell
  • scrapy crawl botname -o xx.json or csv to see result

shell to debug and test

scrapy shell

  • test url is valid - fetch(url)
  • test valid html - view(response.body)

Alternative xpath testing tool
http://www.freeformatter.com/xpath-tester.html

Xpath docs

uses response from selector

selctor, as it is named, selects html content,
from scrapy.selector import Selector
Since this is a common operation, response.selector is shorten to .xpath()

Extra
css can also be used as selector, but xpath is the official way

//name or //* - relative select every instance of html tag name or all
text() - text content in unicode
'//name[1]' - python isolated selector for ('//name')[0], use either
. - extracting first instance of data that is not response, can also just omit //
@ - attribute grabbing

if itemprop exist, use it over class to extract

Tools to get xpath fast -

Paste_Image.png

https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • scrapy學(xué)習(xí)筆記(有示例版) 我的博客 scrapy學(xué)習(xí)筆記1.使用scrapy1.1創(chuàng)建工程1.2創(chuàng)建爬蟲(chóng)模...
    陳思煜閱讀 13,044評(píng)論 4 46
  • lesson 2 All the tables in the zoo database animals This ...
    赤樂(lè)君閱讀 1,234評(píng)論 0 0
  • 一切探究和追查都來(lái)源于我收藏電影票的特殊癖好。 幾年前的電影票字跡已經(jīng)模糊,為了更好的保存票根,讓回憶有據(jù)可查,根...
    半夏長(zhǎng)安閱讀 35,763評(píng)論 50 137
  • 剛畢業(yè)半年,在北京工作半年,忐忐忑忑的半年,從頭認(rèn)識(shí)自己的半年...... 大學(xué) 我的大學(xué)是荒廢的,在游戲中度過(guò),...
    賀韋閱讀 191評(píng)論 0 0
  • 每次沒(méi)想清楚就動(dòng)手,像今天下午就走了很多彎路,如果邊寫(xiě)邊做呢 寫(xiě)的時(shí)候幫助理清思路,做完可以再整體總結(jié)或者像小呆大...
    木子肆閱讀 221評(píng)論 0 0

友情鏈接更多精彩內(nèi)容