久视频一二三区,久久久无码人妻精品,欧美婷婷在线国产

寫了兩個(gè)爬蟲
本質(zhì)上其實(shí)是跟一個(gè)爬蟲一樣的

爬蟲1：獲取所有手機(jī)號(hào)的鏈接，并且存在數(shù)據(jù)庫里
一共116頁

import pymongo
import requests
from bs4 import BeautifulSoup

client=pymongo.MongoClient('localhost',27017)
walden=client['walden']
phone_num_link=walden['phone_num_link']


def get_info_frompage(startpage,endpage):
    infos=[]
    for i in range(startpage,endpage+1):
        # http://bj.58.com/shoujihao/pn1/
        oriurl="http://bj.58.com/shoujihao/pn"+str(i)+"/"
        wb_data_ori = requests.get(oriurl)
        soup_ori = BeautifulSoup(wb_data_ori.text,'lxml')
        titles=soup_ori.select("strong.number")
        links=soup_ori.select("#infolist > div > ul > div.boxlist > ul > li > a.t")
        j=0
        for title,link in zip(titles,links) :
            if link.get('href').find("jump.zhineng")<0 :#小于0代表找不到“jump.zhineng”
                data={
                    "title":title.text,
                    "link":link.get('href')
                }
                phone_num_link.insert_one(data)
                j+=1
            else:
                continue
        print(j)

    return infos

get_info_frompage(1,116)

爬蟲2：從數(shù)據(jù)庫里將鏈接取出來，根據(jù)鏈接逐一爬取手機(jī)號(hào)信息
并且將信息存到數(shù)據(jù)庫中。
做了一下404頁面校驗(yàn)，不過好像沒遇到404
一共3480條數(shù)據(jù)


import pymongo
import requests
from bs4 import BeautifulSoup

client=pymongo.MongoClient('localhost',27017)
walden=client['walden']
phone_num_link=walden['phone_num_link']
num_info=walden['num_info']


def get_one_info(url):
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')
    if soup.select("head > script")[0].get('src').find("http://j1.58cdn.com.cn/js/404/topbar404.js")>0:#如果是404頁面（即找到了404.js）
        return 0
    else:
        title=soup.select("#main > div.col.detailPrimary.mb15 > div.col_sub.mainTitle > h1")[0].text
        price=soup.select("#main > div.col.detailPrimary.mb15 > div.col_sub.sumary > ul > li:nth-of-type(1) > div.su_con > span")[0].text
        host=soup.select("#divContacter1 > ul > ul > li > a")[0].text
        # number_address=soup.select("span.c_999.f12")[0].text if soup.find_all('span','c_999') else None

        data={
            'title':title.strip()[0:11],
            'price':price.strip(),
            'host':host.strip(),
            # 'number_address':number_address.replace(" ","").replace("\t","").replace("\n","")[:-10][7:],#三個(gè)replace分別刪除了空格、制表符、換行符，后面兩個(gè)分別是刪去末尾10個(gè)字符，刪去開頭7個(gè)字符

        }
        # print(data)
        return data

# get_one_info("http://bj.58.com/shoujihao/26120773122616x.shtml")
for item in phone_num_link.find():
    info=get_one_info(item['link'])
    # print(info)
    num_info.insert_one(info)

結(jié)果如下

Paste_Image.png

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

pyhton實(shí)戰(zhàn)作業(yè)2_2

pyhton實(shí)戰(zhàn)作業(yè)2_2

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

pyhton實(shí)戰(zhàn)作業(yè)2_2

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av