bb视频.com,久久久久久久爱

? ? ? ?最近一周在鉆研如何利用新聞數(shù)據(jù)進行量化投資，在正式進行文本挖掘以及開發(fā)策略前，首當其沖的當然要準備好數(shù)據(jù)?！熬W(wǎng)絡爬蟲”、“數(shù)據(jù)抓取”這樣字眼已經(jīng)臭大街，說難不難，做精不易。如果忽略數(shù)據(jù)抓取代價的話，數(shù)據(jù)永遠在那里，只要被爬網(wǎng)站服務器不刪除數(shù)據(jù)，總會請求到數(shù)據(jù)。但是如果只是自以為是的覺得自己很厲害就錯了，一般服務器都會有anti-spider，但是大部分情況下，anti-spider的需求是不能影響到網(wǎng)站正常使用，也就是一個網(wǎng)站的功能性需求一定高于反爬蟲需求。

? ? ? ?下面看一下怎么去爬取新浪網(wǎng)和每經(jīng)網(wǎng)的上市公司新聞數(shù)據(jù)。

? ? ? ?在爬取數(shù)據(jù)前，還是準備好數(shù)據(jù)庫比較方便。這里我偏好非關系型數(shù)據(jù)庫，優(yōu)劣就不多說。這里我選擇MongoDB，如果你習慣了可視化管理數(shù)據(jù)的方式，當然不能錯過Robomongo。

? ? ? ?接下來先看一下兩個網(wǎng)站的頁面結構：

? ? ? ?單線程抓取速度肯定比不上多線程，但是協(xié)程抓取和多線程抓取上又不能完全分得出優(yōu)劣。協(xié)程雖然是輕量級的線程，但到達一定數(shù)量后，仍然會造成服務器崩潰出錯，比如下面這種“cannot watch more than 1024 sockets”的問題。最好的方法通過限制協(xié)程并發(fā)數(shù)量來解決此類問題。

? ? ? ?多進程就更不用想了，占用內(nèi)存大，啟動時間特別漫長。新浪網(wǎng)響應速度還是杠杠的，同時一個頁面的字節(jié)數(shù)也大，這也就意味著，在這種情況下，多線程比單線程的優(yōu)勢會明顯很多。下面是爬取新浪上市公司歷史新聞的代碼：

# -*- coding: utf-8 -*-

"""

Created on Mon Jan 22 10:01:40 2018

@author: Damon

import time

import re

import requests

import gevent

from gevent import monkey,pool

monkey.patch_all()

from concurrent import futures

from bs4 import BeautifulSoup

from pymongo import MongoClient

classWebCrawlFromSina(object):

? ?def__init__(self,*arg,**kwarg):

? ? ? ?self.totalPages = arg[0] #totalPages

? ? ? ?self.Range = arg[1] #Range

? ? ? ?self.ThreadsNum = kwarg['ThreadsNum']

? ? ? ?self.dbName = kwarg['dbName']

? ? ? ?self.colName = kwarg['collectionName']

? ? ? ?self.IP = kwarg['IP']

? ? ? ?self.PORT = kwarg['PORT']

? ? ? ?self.Porb = .5

? ?defcountchn(self,string):

? ? ? ?pattern = re.compile(u'[\u1100-\uFFFDh]+?')

? ? ? ?result = pattern.findall(string)

? ? ? ?chnnum = len(result)

? ? ? ?possible = chnnum/len(str(string))

? ? ? ?return (chnnum, possible)

? ?defgetUrlInfo(self,url): #get body text and key words

? ? ? ?respond = requests.get(url)

? ? ? ?respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding

? ? ? ?bs = BeautifulSoup(respond.text, "lxml")

? ? ? ?meta_list = bs.find_all('meta')

? ? ? ?span_list = bs.find_all('span')

? ? ? ?part = bs.find_all('p')

? ? ? ?article = ''

? ? ? ?date = ''

? ? ? ?summary = ''

? ? ? ?keyWords = ''

? ? ? ?stockCodeLst = ''

? ? ? ?for meta in meta_list:

? ? ? ? ? ?if 'name' in meta.attrs and meta['name'] == 'description':

? ? ? ? ? ? ? ?summary = meta['content']

? ? ? ? ? ?elif 'name' in meta.attrs and meta['name'] == 'keywords':

? ? ? ? ? ? ? ?keyWords = meta['content']

? ? ? ? ? ?if summary != '' and keyWords != '':

? ? ? ? ? ? ? ?break

? ? ? ?for span in span_list:

? ? ? ? ? ?if 'class' in span.attrs:

? ? ? ? ? ? ? ?if span['class'] == ['date'] or span['class'] == ['time-source']:

? ? ? ? ? ? ? ? ? ?string = span.text.split()

? ? ? ? ? ? ? ? ? ?for dt in string:

? ? ? ? ? ? ? ? ? ? ? ?if dt.find('年') != -1:

? ? ? ? ? ? ? ? ? ? ? ? ? ?date += dt.replace('年','-').replace('月','-').replace('日',' ')

? ? ? ? ? ? ? ? ? ? ? ?elif dt.find(':') != -1:

? ? ? ? ? ? ? ? ? ? ? ? ? ?date += dt

? ? ? ? ? ? ? ? ? ?break

? ? ? ? ? ?if 'id' in span.attrs and span['id'] == 'pub_date':

? ? ? ? ? ? ? ?string = span.text.split()

? ? ? ? ? ? ? ?for dt in string:

? ? ? ? ? ? ? ? ? ?if dt.find('年') != -1:

? ? ? ? ? ? ? ? ? ? ? ?date += dt.replace('年','-').replace('月','-').replace('日',' ')

? ? ? ? ? ? ? ? ? ?elif dt.find(':') != -1:

? ? ? ? ? ? ? ? ? ? ? ?date += dt

? ? ? ? ? ? ? ?break

? ? ? ?for span in span_list:

? ? ? ? ? ?if 'id' in span.attrs and span['id'].find('stock_') != -1:

? ? ? ? ? ? ? ?stockCodeLst += span['id'][8:] + ' '

? ? ? ?for paragraph in part:

? ? ? ? ? ?chnstatus = self.countchn(str(paragraph))

? ? ? ? ? ?possible = chnstatus[1]

? ? ? ? ? ?if possible > self.Porb:

? ? ? ? ? ? ? article += str(paragraph)

? ? ? ?while article.find('<') != -1 and article.find('>') != -1:

? ? ? ? ? ? ?string = article[article.find('<'):article.find('>')+1]

? ? ? ? ? ? ?article = article.replace(string,'')

? ? ? ?while article.find('\u3000') != -1:

? ? ? ? ? ? ?article = article.replace('\u3000','')

? ? ? ?article = ' '.join(re.split(' +|\n+', article)).strip()

? ? ? ?return summary, keyWords, date, stockCodeLst, article

? ?defGenPagesLst(self):

? ? ? ?PageLst = []

? ? ? ?k = 1

? ? ? ?while k+self.Range-1 <= self.totalPages:

? ? ? ? ? ?PageLst.append((k,k+self.Range-1))

? ? ? ? ? ?k += self.Range

? ? ? ?if k+self.Range-1 < self.totalPages:

? ? ? ? ? ?PageLst.append((k,self.totalPages))

? ? ? ?return PageLst

? ?defCrawlCompanyNews(self,startPage,endPage):

? ? ? ?self.ConnDB()

? ? ? ?AddressLst = self.extractData(['Address'])[0]

? ? ? ?if AddressLst == []:

? ? ? ? ? ?urls = []

? ? ? ? ? ?url_Part_1 = 'http://roll.finance.sina.com.cn/finance/zq1/ssgs/index_'

? ? ? ? ? ?url_Part_2 = '.shtml'

? ? ? ? ? ?for pageId in range(startPage,endPage+1):

? ? ? ? ? ? ? ?urls.append(url_Part_1 + str(pageId) + url_Part_2)

? ? ? ? ? ?for url in urls:

? ? ? ? ? ? ? ?print(url)

? ? ? ? ? ? ? ?resp = requests.get(url)

? ? ? ? ? ? ? ?resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding

? ? ? ? ? ? ? ?bs = BeautifulSoup(resp.text, "lxml")

? ? ? ? ? ? ? ?a_list = bs.find_all('a')

? ? ? ? ? ? ? ?for a in a_list:

? ? ? ? ? ? ? ? ? ?if 'href' in a.attrs and a.string and \

? ? ? ? ? ? ? ? ? ?a['href'].find('http://finance.sina.com.cn/stock/s/') != -1:

? ? ? ? ? ? ? ? ? ? ? ?summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])

? ? ? ? ? ? ? ? ? ? ? ?while article == '' and self.Prob >= .1:

? ? ? ? ? ? ? ? ? ? ? ? ? ?self.Prob -= .1

? ? ? ? ? ? ? ? ? ? ? ? ? ?summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])

? ? ? ? ? ? ? ? ? ? ? ?self.Prob =.5

? ? ? ? ? ? ? ? ? ? ? ?if article != '':

? ? ? ? ? ? ? ? ? ? ? ? ? ?data = {'Date' : date,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Address' : a['href'],

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Title' : a.string,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Keywords' : keyWords,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Summary' : summary,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Article' : article,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'RelevantStock' : stockCodeLst}

? ? ? ? ? ? ? ? ? ? ? ? ? ?self._collection.insert_one(data)

? ? ? ?else:

? ? ? ? ? ?urls = []

? ? ? ? ? ?url_Part_1 = 'http://roll.finance.sina.com.cn/finance/zq1/ssgs/index_'

? ? ? ? ? ?url_Part_2 = '.shtml'

? ? ? ? ? ?for pageId in range(startPage,endPage+1):

? ? ? ? ? ? ? ?urls.append(url_Part_1 + str(pageId) + url_Part_2)

? ? ? ? ? ?for url in urls:

? ? ? ? ? ? ? ?print(' ', url)

? ? ? ? ? ? ? ?resp = requests.get(url)

? ? ? ? ? ? ? ?resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding

? ? ? ? ? ? ? ?bs = BeautifulSoup(resp.text, "lxml")

? ? ? ? ? ? ? ?a_list = bs.find_all('a')

? ? ? ? ? ? ? ?for a in a_list:

? ? ? ? ? ? ? ? ? ?if 'href' in a.attrs and a.string and \

? ? ? ? ? ? ? ? ? ?a['href'].find('http://finance.sina.com.cn/stock/s/') != -1:

? ? ? ? ? ? ? ? ? ? ? ?if a['href'] not in AddressLst:

? ? ? ? ? ? ? ? ? ? ? ? ? ?summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])

? ? ? ? ? ? ? ? ? ? ? ? ? ?while article == '' and self.Prob >= .1:

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?self.Prob -= .1

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])

? ? ? ? ? ? ? ? ? ? ? ? ? ?self.Prob =.5

? ? ? ? ? ? ? ? ? ? ? ? ? ?if article != '':

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?data = {'Date' : date,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Address' : a['href'],

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Title' : a.string,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Keywords' : keyWords,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Summary' : summary,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Article' : article,

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'RelevantStock' : stockCodeLst}

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?self._collection.insert_one(data)

? ?defConnDB(self):

? ? ? ?Conn = MongoClient(self.IP, self.PORT)

? ? ? ?db = Conn[self.dbName]

? ? ? ?self._collection = db.get_collection(self.colName)

? ?defextractData(self,tag_list):

? ? ? ?data = []

? ? ? ?for tag in tag_list:

? ? ? ? ? ?exec(tag + " = self._collection.distinct('" + tag + "')")

? ? ? ? ? ?exec("data.append(" + tag + ")")

? ? ? ?return data

? ?defsingle_run(self):

? ? ? ?page_ranges_lst = self.GenPagesLst()

? ? ? ?for ind, page_range in enumerate(page_ranges_lst):

? ? ? ? ? ?self.CrawlCompanyNews(page_range[0],page_range[1])

? ?defcoroutine_run(self):

? ? ? ?jobs = []

? ? ? ?page_ranges_lst = self.GenPagesLst()

? ? ? ?for page_range in page_ranges_lst:

? ? ? ? ? ?jobs.append(gevent.spawn(self.CrawlCompanyNews,page_range[0],page_range[1]))

? ? ? ?gevent.joinall(jobs)

? ?defmulti_threads_run(self,**kwarg):

? ? ? ?page_ranges_lst = self.GenPagesLst()

? ? ? ?print(' Using ' + str(self.ThreadsNum) + ' threads for collecting news ... ')

? ? ? ?with futures.ThreadPoolExecutor(max_workers=self.ThreadsNum) as executor:

? ? ? ? ? ?future_to_url = {executor.submit(self.CrawlCompanyNews,page_range[0],page_range[1]) : \

? ? ? ? ? ? ? ? ? ? ? ? ? ? ind for ind, page_range in enumerate(page_ranges_lst)} ?

if __name__ == '__main__':

? ?t1 = time.time()

? ?WebCrawl_Obj = WebCrawlFromSina(5000,100,ThreadsNum=4,IP="localhost",PORT=27017,\

? ? ? ?dbName="Sina_Stock",collectionName="sina_news_company")

? ?WebCrawl_Obj.coroutine_run() #Obj.single_run() #Obj.multi_threads_run()

? ?t2 = time.time()

? ?print(' running time:', t2 - t1)

countchn函數(shù)是用來統(tǒng)計中文比例的標簽，超過self.Porb值(這里設置0.5)的標簽則認為是正文(Article)；getUrlInfo函數(shù)是用來獲取頁面信息，包括時間、地址、標題、關鍵詞、摘要、正文以及相關股票代碼；GenPagesLst函數(shù)是生成以tuple為元素的列表，每個元素=(開始抓取的頁面下標，結束抓取頁面的下標)，這樣是為了方便多線程抓取；single_run就是單線程跑的函數(shù)，coroutine_run是利用gevent庫的協(xié)程抓取函數(shù)，multi_threads_run是利用futures庫抓取的函數(shù)。因為在抓取的過程中很容易因為對方服務器中止連接而停止，或者很久都沒有響應等情況，但是又不想在重啟程序的時候無腦爬取冗余的數(shù)據(jù)，所以在啟動的時候，可以先把數(shù)據(jù)庫中的Address標簽數(shù)據(jù)或者Date數(shù)據(jù)給獲取下來，然后在插進新爬取數(shù)據(jù)前，先對比一下是否存在重復的Address或者Date，再選擇是否要插進新數(shù)據(jù)。運行如下圖：

在爬取每經(jīng)網(wǎng)的時候出了點小叉子，重新?lián)疙撁娴臅r候出現(xiàn)了各種連接中斷的問題。即便是連接沒問題，爬取下來的數(shù)據(jù)，很多只有標題，沒有正文和時間。一開始以為是自己寫的代碼沒抓取到，后來才發(fā)現(xiàn)是被反爬了。

? ? ? ?所以在代碼中得分別記錄成功抓取以及未成功抓取的url，然后當然是持續(xù)不斷的繼續(xù)抓取，直到革命勝利。這種持續(xù)不斷的訪問同一個鏈接的循環(huán)，很容易會被對方的服務器鎖死IP，如果真這樣嗝屁了，得換IP玩了。所以每循環(huán)一定次數(shù)，最好就sleep一定時間，當然如果覺得不麻煩，sleep一個隨機數(shù)的間隔（看起來比較人為的樣子），再繼續(xù)抓取。我這里直接就sleep個1秒鐘繼續(xù)爬。

? ? ? ?一開始先多線程調(diào)用CrawlCompanyNews函數(shù)，然后統(tǒng)計一下返回一個沒有抓取到信息的url_lst_withoutNews，再傳進ReCrawlNews函數(shù)，單線程重新逐個抓取。最后抓取的樣子如下圖：

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

抓取上市公司歷史新聞數(shù)據(jù)

抓取上市公司歷史新聞數(shù)據(jù)

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

抓取上市公司歷史新聞數(shù)據(jù)

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av