爬蟲流程:爬蟲的原理:url -> html -> model (洗數(shù)據(jù)) -> 分析
- 依賴的包
requests // 用于發(fā)送請(qǐng)求,獲取頁面信息
pyquery // pyquery庫是jQuery的Python實(shí)現(xiàn),將響應(yīng)內(nèi)容轉(zhuǎn)化為PyQuery對(duì)象,實(shí)現(xiàn)css選擇(分析頁面) - 獲取頁面數(shù)據(jù)
- 循環(huán)url
import os
import requests
from pyquery import PyQuery as pq
as 語句可以將包名稱簡化;
class Model(object):
def __repr__(self):
name = self.__class__.__name__
properties = ('{}=({})'.format(k, v) for k, v in self.__dict__.items())
s = '\n<{} \n {}>'.format(name, '\n '.join(properties))
return s
- 基類,用于調(diào)整爬取的數(shù)據(jù)結(jié)構(gòu),注意后面的
return,能夠返回真正的數(shù)據(jù),不然打印出來的類全是類型,socket第三章有截圖; -
__repr__()方法不用調(diào)用,print 輸出時(shí),自動(dòng)調(diào)用這個(gè)方法,也稱為魔法函數(shù); - 類屬性:
__class__.__name__: 返回類名
__dict__:返回屬性的字典集合 - () 的使用
- 三個(gè)
\n:

\n.png

join.png
字符串都有 join() 方法,參數(shù)時(shí)要連接的元素序列
class Movie(Model):
def __init__(self):
self.name = ''
self.score = 0
self.quote = ''
self.cover_url = ''
self.ranking = 0
定義屬性(字段),存儲(chǔ)數(shù)據(jù)。
def movie_from_div(div):
e = pq(div)
m = Movie()
m.name = e('.title').text()
m.score = e('.rating_num').text()
m.quote = e('.inq').text()
m.cover_url = e('img').attr('src')
m.ranking = e('.pic em').text()
return m
每次想要進(jìn)行 css 選擇,都需要用 eq() 進(jìn)行包裝。上一個(gè)是針對(duì)整個(gè)頁面,這個(gè)只是針對(duì) div 內(nèi)的元素;
文本的獲取用 .text() 方法
屬性的獲取用 .attr() 方法
如果目標(biāo)元素沒有 class 或 id 標(biāo)記,那么可以通過父元素向下查找
def movies_from_url(url):
r = requests.get(url)
page = r.content
e = pq(page)
items = e('.item')
movies = [movie_from_div(i) for i in items]
return movies
request.get() 下載 url 對(duì)應(yīng)的頁面,頁面內(nèi)容通過 content 屬性獲得頁面內(nèi)容(html),這兩步下載頁面。
pq(page) 獲得支持 css 語法的對(duì)象
def main():
url = 'https://movie.douban.com/top250'
movies = movies_from_url(url)
print('top250 movies', movies)
if __name__ == '__main__':
main()
通過觀察 url 規(guī)律,可以爬取多個(gè)頁面
def main():
# 在頁面上點(diǎn)擊下一頁, 觀察 url 變化, 找到規(guī)律
for i in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start={}'.format(i)
movies = movies_from_url(url)
print('top250 movies', movies)
基礎(chǔ)爬蟲之將數(shù)據(jù)保存至數(shù)據(jù)庫mongodb
import os
import requests
from pyquery import PyQuery as pq
from pymongo import MongoClient
class Model(object):
db = MongoClient().web16_4_pachong
def __repr__(self):
name = self.__class__.__name__
properties = ('{0} : ({1})'.format(k, v) for k, v in self.__dict__.items())
s = '\n<{0} \n {1}>'.format(name, '\n '.join(properties))
return s
def save(self):
name = self.__class__.__name__
_id = self.db[name].save(self.__dict__)
class Movie(Model):
@classmethod
def valid_names(cls):
names = [
# (字段名, 類型, 默認(rèn)值)
('name', str, ''),
('score', int, 0),
('quote', str, ''),
('cover_url', str, ''),
('ranking', int, 0),
]
return names
def movie_from_div(div):
e = pq(div)
m = Movie()
m.name = e('.title').text()
m.score = e('.rating_num').text()
m.quote = e('.inq').text()
m.cover_url = e('img').attr('src')
m.ranking = e('.pic em').text()
m.save()
return m
def movies_from_url(url):
r = requests.get(url)
page = r.content
e = pq(page)
items = e('.item')
movies = [movie_from_div(i) for i in items]
return movies
def main():
url = 'https://movie.douban.com/top250'
movies = movies_from_url(url)
print('top250 movies', movies)
if __name__ == '__main__':
main()