知識基礎(chǔ)
- Pandas包基礎(chǔ):pd.read_csv
- 正則表達(dá)式基礎(chǔ)
報告自動化對數(shù)據(jù)的質(zhì)量有著更高的要求,但是實際情況中出現(xiàn)錯漏是非常正常的,而我們不僅僅應(yīng)該在出現(xiàn)問題后修復(fù)bug,在最開始就應(yīng)該做好盡可能嚴(yán)格的規(guī)定并作出意外情況的報告和處理。
讀取CSV文件
csv文件是我們常用的數(shù)據(jù)源,在此我們以csv文件為例
首先我們可以查看要讀取數(shù)據(jù)內(nèi)容
import pandas as pd
import numpy as np
# 可以發(fā)現(xiàn)第8行才是頭部,于是設(shè)置header參數(shù)
data = pd.read_csv('data.csv', header=7, index_col=0)
data.head()

對讀取目標(biāo)列進(jìn)行格式規(guī)定
data.dtypes
Product Name object
Brand object
Price object
Category object
Rank object
Sales object
Revenue object
Reviews int64
Rating object
Seller object
LQS object
ASIN object
Link object
dtype: object
可以看到在列:Price, Rank, Sales, Revenue, Reviews, Rating, LQS都應(yīng)該是數(shù)值,但是只有Review列被默認(rèn)讀取為數(shù)值
使用dtype進(jìn)行格式規(guī)定
dtype = {'#':int,
'Product Name':str,
'Brand':str,
'Price':float,
'Category':str,
'Rank':int,
'Sales':int,
'Revenue':int,
'Reviews':int,
'Rating':float,
'Seller':str,
'LQS':int,
'ASIN':str,
'Link':str
}
try:
data = pd.read_csv('data.csv', dtype=dtype, header=7, index_col=0)
except BaseException as e:
print(e)
invalid literal for int() with base 10: '1,067'
可以看到使用dtype并不能直接忽略非數(shù)字符號進(jìn)行轉(zhuǎn)換,我們需要更強的格式規(guī)定
使用converters進(jìn)行格式轉(zhuǎn)化
import re
# 使用正則表達(dá)式進(jìn)行數(shù)字提取
def str2num(string):
if not isinstance(string, str):
string = str(string)
string = string.replace(',','')
regular_expression = '\d+\.?\d*'
pattern = re.compile(regular_expression)
match = pattern.search(string)
if match:
return float(match.group())
else:
return float('nan')
converters = {'Price':str2num,
'Rank':str2num,
'Rating':str2num,
'Sales':str2num,
'Revenue':str2num,
'Reviews':str2num
}
try:
data = pd.read_csv('data.csv', converters=converters, header=7, index_col=0)
except BaseException as e:
print(e)
data.head()

把不同的數(shù)據(jù)處理函數(shù)解耦,分別把str2num放入tools模塊,數(shù)據(jù)讀取放入datapipeline模塊