需求
開始是需要把省份的名稱,省份編碼經(jīng)緯度導入數(shù)據(jù)庫,為后面接口提供數(shù)據(jù)。
需要爬取的經(jīng)緯度地址:(因為開始就找到這個)
思路
先通過WebDriver把頁面爬取下來,然后觀察結構解析需要的表格部分,最后把爬取下來的數(shù)據(jù)用excel保存再導入數(shù)據(jù)庫
前期準備:
- 安裝Selenium WebDriver
pip install selenium
Selenium WebDriver提供了各種語言的編程接口,來進行Web自動化開發(fā)。
安裝完成后,運行python解釋器,執(zhí)行命令import selenium,如果沒有異常,則表示安裝成功了,如下所示
image.png - 下載瀏覽器的驅動
chrom瀏覽器的web driver(chromedriver.exe),可以在下面網(wǎng)址訪問:
http://npm.taobao.org/mirrors/chromedriver/
firefox(火狐瀏覽器)的web driver (geckodriver.exe)在這里訪問:
https://github.com/mozilla/geckodriver/releases
其他瀏覽器驅動可以見下面列表:
Edge:https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver
Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/
下載對應版本:

下載BeautifulSoup
BeautifulSoup4是一個HTML/XML的解析器,主要的功能是解析和提取HTML/XML的數(shù)據(jù)。和lxml庫一樣。
BeautifulSoup4用來解析HTML比較簡單,API使用非常人性化,支持CSS選擇器,是Python標準庫中的HTML解析器,也支持lxml解析器。
pip install beautifulsoup4下載openpyxl
OpenPyXl是一個Python的模塊 可以用來處理excle表格
安裝:
直接pip install openpyxl就可以
實現(xiàn)步驟
- 先引入需要模塊
from selenium import webdriver
from bs4 import BeautifulSoup
from openpyxl.workbook import Workbook
from openpyxl.writer.excel import ExcelWriter
- 指定chrom驅動頁面最大化
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") # 這是我的驅動地址自己改改
driver.maximize_window()
- get 方法 打開指定網(wǎng)址
driver.get(
"https://blog.csdn.net/abcmaopao/article/details/79554904")
html = driver.execute_script("return document.documentElement.outerHTML")
- 通過beautifulSoup解析
使用BeautifulSoup類解析這段代碼,獲取一個BeautifulSoup的對象,然后按照標準格式輸出。
soup = BeautifulSoup(html, 'lxml')
- 獲取市級的excel表格
if(soup):
# 創(chuàng)建工作簿獲取當前工作表sheet然后取個名字
wb = Workbook()
ws = wb.active
ws.title = u'省份經(jīng)緯度'
# list用來保存數(shù)據(jù)
list=[]
# 遍歷表的的每一行,然后把每一行的每一列變成一個數(shù)組
# 再把這個數(shù)組壓入list中
for tr in soup.find_all('tr'):
col = []
for td in tr.find_all('td'):
col.append(td.get_text())
# ws.cell(row=i, column=j).value = td.get_text()
list.append(col)
print(list)
# 輸出看看然后導入excel表格
i = 0
for r in list:
if(i==0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
elif(i>0 and int(r[1])%10000!=0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
# 保存
wb.save('市級.xlsx')
print("保存成功!")
driver.quit()

保存出來的表結構
完整代碼
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from openpyxl.workbook import Workbook
from openpyxl.writer.excel import ExcelWriter
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
driver.maximize_window()
driver.get(
"https://blog.csdn.net/abcmaopao/article/details/79554904")
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'lxml')
if(soup):
wb = Workbook()
ws = wb.active
ws.title = u'省份經(jīng)緯度'
list=[]
for tr in soup.find_all('tr'):
col = []
for td in tr.find_all('td'):
col.append(td.get_text())
# ws.cell(row=i, column=j).value = td.get_text()
list.append(col)
print(list)
i = 0
for r in list:
if(i==0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
elif(i>0 and int(r[1])%10000!=0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
# j += 1
# i+=1
wb.save('市級.xlsx')
print("保存成功!")
driver.quit()
抽取全國各省市的DataV.GeoAtlas json地圖數(shù)據(jù)
然后現(xiàn)在需要把[全國地圖json api] (http://datav.aliyun.com/tools/atlas/#&lat=30.316551722910077&lng=104.20306438764393&zoom=3.5) 下載到本地,但這次要省級的
一樣的思路把省級的行政編碼爬取下來

也就是把elif(i>0 and int(r[1])%10000!=0):改成elif(i>0 and int(r[1])%10000==0):
然后這次變成讀取每一個省份的編碼,動態(tài)爬取json保存
完整代碼
from openpyxl.workbook import Workbook
from openpyxl import load_workbook
def getJson(code):
path = "https://geo.datav.aliyun.com/areas/bound/geojson?code="
driver.get(path + code+'_full')
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'lxml')
print(soup.get_text())
if (soup.get_text()):
f = open(code + '_full.json', 'w',encoding='utf-8')
f.write(soup.get_text())
f.close()
print("保存成功" + code)
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
driver.maximize_window()
code = "100000"
getJson(code)
wb = load_workbook('test1.xlsx')["省份經(jīng)緯度"]
print(wb.rows)
list=[]
i = 0
for row in wb.rows:
if(i>0):
chil = []
print(row[1].value)
code = row[1].value
getJson(code)
i += 1
driver.quit()
