Python爬蟲入門(一)-爬取CSDN熱門博文

標(biāo)簽（空格分隔）： python

因畢業(yè)設(shè)計需要,所以開始學(xué)起了python,大部分都是邊寫邊學(xué),遇到問題就google,所以可能有很多寫的不好的地方,有好的意見請指出.不過主要分享的是遇到了自己不熟悉的東西,如何快速解決該問題的流程.

1.目的

寫的目的是抓取csdn博客首頁的一些文章的標(biāo)題,連接,作者,摘要信息,如下圖,然后存放到自己的數(shù)據(jù)庫中,為了版權(quán)問題,所以不抓取內(nèi)容了,畢業(yè)設(shè)計使用的時候直接超鏈接過去.

Paste_Image.png

接下來分析csdn博客的首頁http://blog.csdn.net/,打開F12后,可以查看到文章內(nèi)容是直接在html中返回,并沒有單獨的數(shù)據(jù)接口,因此想要獲取數(shù)據(jù)的第一步就是獲取該頁面內(nèi)容.那么大概流程就如下清晰了:

Paste_Image.png

2.獲取頁面內(nèi)容

csdn博客的首頁是根據(jù)模板生成的,沒有數(shù)據(jù)對應(yīng)的接口,所以直接獲取頁面即可.代碼就很簡單了
首先建立一個common模塊,用于保存一些公共信息,也方便后期加入抓取其他網(wǎng)站文章.
Common.py

# coding=utf-8
# 偽裝瀏覽器請求,不加header的話請求會報403,因為未被識別成瀏覽器
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}

csdn_url = 'http://blog.csdn.net/?page='

接下來發(fā)送請求,獲取需要解析的頁面

    url = Common.csdn_url + str(x+1)
    print('請求的地址為: ' + url)
    req = request.Request(url, headers= Common.headers)
    result = request.urlopen(req).read().decode('utf-8')
    # result即請求到的csdn頁面.

3.解析頁面

解析就用到了beautifulSoup4庫,官網(wǎng)地址:點擊進(jìn)入,針對使用的話屬于跟著文檔走一遍就夠用的庫,具體就不在這里闡述,畢竟本文只是想表達(dá)一個應(yīng)對問題的流程.

3.1 模型

首先因為最終要保存到數(shù)據(jù)庫,所以對數(shù)據(jù)庫表要有一個映射,也就是Model層.


class Article:
    '數(shù)據(jù)庫文章類'

    def __init__(self, raw):
        self.id = raw.get('id')
        self.title = raw.get('title')
        self.user_id = raw.get('userId')
        self.summary = raw.get('summary')
        self.html_content = raw.get('htmlContent')
        self.content = raw.get('content')
        self.keyword = raw.get('keyword')
        self.viewcount = 0
        self.likecount = 0
        self.catelog_id = 0 #需要審核時手動指定
        self.is_top = raw.get('isTop')
        self.is_show = raw.get('isShow')
        self.createdate = raw.get('createdate')
        self.modifydate = raw.get('modifydate')
        self.source = raw.get('source')
        self.source_link = raw.get('sourceLink')
        self.source_author = raw.get('sourceAuthor')

    def __hash__(self):
        return hash(self.id)

    def __eq__(self, other):
        return self.id == other.id

分析:
self指的是該類實例對象本身,個人理解為java類中的this,可以直接訪問類中的屬性.
def __init__(self, raw)相當(dāng)于Java中的構(gòu)造函數(shù),執(zhí)行Article(raw)的話會自動調(diào)用該init方法.
def __hash__(self)和def __eq__(self, other),按照流程圖該類會用一個set集合存儲,目的是自動根據(jù)文章id去重,那么就要了解到python中set集合是如何去重的.這一點和Java差不多,hash值決定了桶的位置,eq決定是否重復(fù)..

3.2 解析頁面內(nèi)容

解析一般根據(jù)頁面內(nèi)容,分析,然后操作dom樹拿到自己想要的東西.這里解析后放到一個字典中,字典的基本結(jié)構(gòu)就是鍵值對.最后再放到set集合中.

def dealwith(result):
    soup = BeautifulSoup(result,'html.parser')
    blogs = soup.find_all('dl',class_='blog_list clearfix')
    # 獲取當(dāng)前頁文章
    for blog in blogs:
        article = dict() #創(chuàng)建字典
        article['userId'] = -1 #代表是機器人
        article['isShow'] = 0 #暫時不展示,后臺審核
        article['isTop'] = 0 #不置頂
        dt=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        article['createdate'] = dt #創(chuàng)建時間
        article['modifydate'] = dt #更新時間
        article['source'] = 2 #表示來自csdn

        # 解析作者信息
        nickname = blog.dt.contents[2].string
        article['sourceAuthor'] = nickname

        # 解析文章內(nèi)容
        content = blog.dd
        for child in content.children:
            if re.search("tracking-ad",str(child)):
                sourceLink = child.a['href']
                articleId = sourceLink[sourceLink.rfind('/')+1 : len(sourceLink)]
                article['id'] = '2'+articleId #csdn文章id
                article['sourceLink'] = sourceLink #原文鏈接
                article['title'] = child.a.string #原文標(biāo)題
                continue
            if re.search("blog_list_c",str(child)):
                article['summary'] = child.string #原文摘要
                continue
        # 添加到集合中
        articleList.add(ArticleModel.Article(article))

4.持久化到數(shù)據(jù)庫

保存到數(shù)據(jù)庫中使用到了pymysql庫,該庫的使用可以參考菜鳥教程:點擊進(jìn)入.
如下使用contextlib建立一個上下文管理器mysqlTemplate().具體執(zhí)行一會再分析
ArticleService.py

import pymysql
import contextlib

host='127.0.0.1'
port=3306
user='root'
passwd='123456'
db='aust'
charset='utf8'

@contextlib.contextmanager
def mysqlTemplate():
    conn = pymysql.connect(host=host, port=port, user=user, passwd=passwd, db=db, charset=charset)
    cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)
    try:
        yield cursor
    finally:
        conn.commit()
        cursor.close()
        conn.close()

然后寫一個執(zhí)行方法:

# 保存到數(shù)據(jù)庫,已經(jīng)有的數(shù)據(jù)則直接跳過
def saveToDB():
    sql = "REPLACE INTO `article` VALUES (%s,%s, %s, %s, %s,%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    with ArticleService.mysqlTemplate() as cursor:
        #讀取出set集合
        for x in articleList:
            #執(zhí)行更新操作
            row_count = cursor.execute(sql,
                        (x.id,x.title,x.user_id,x.summary,x.html_content,x.content,x.keyword,x.viewcount,x.likecount,
                         x.catelog_id,x.is_top,x.is_show,x.createdate,x.modifydate,x.source,x.source_link,x.source_author))
            print(row_count)

分析:
首先使用了@contextlib.contextmanager定義了一個上下文管理器mysqlTemplate(),該函數(shù)中是獲取數(shù)據(jù)庫連接,然后yield cursor,最后再關(guān)閉連接,那么該yield cursor就是主要邏輯代碼,也就對應(yīng)著下面的saveToDB()函數(shù).

首先yield會把該函數(shù)轉(zhuǎn)變?yōu)樯善?那么yield之前的算是一次next()調(diào)用,之后的也算是next()調(diào)用,那么中間的就是with產(chǎn)生的調(diào)用.

整個流程就是先獲取到數(shù)據(jù)庫連接,然后執(zhí)行到了yield,該函數(shù)終端,轉(zhuǎn)而執(zhí)行with里面的函數(shù),with的as獲取到了yield的返回值也就是cursor對象,with執(zhí)行完畢后之前終端的函數(shù)繼續(xù)執(zhí)行,關(guān)閉數(shù)據(jù)庫資源等.

這里的mysql語句使用了replace into **這樣的話,已經(jīng)收集過得文章就不會重復(fù)收集到數(shù)據(jù)庫中了.

那么到此我的需求也就實現(xiàn)了,剩下的就是實際中遇到的一些異常,相信google都能解決掉的.

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python爬蟲入門(一)-爬取CSDN熱門博文