Pandas 學(xué)習(xí)
-
基本命令
- 導(dǎo)包
import pandas
- 數(shù)據(jù)的導(dǎo)入文件
pd.read_csv(filename) # 導(dǎo)入csv格式文件中的數(shù)據(jù)
pd.read_table(filename) # 導(dǎo)入有分隔符的文本 (如TSV) 中的數(shù)據(jù)
pd.read_excel(filename) # 導(dǎo)入Excel格式文件中的數(shù)據(jù)
pd.read_sql(query, connection_object) # 導(dǎo)入SQL數(shù)據(jù)表/數(shù)據(jù)庫中的數(shù)據(jù)
pd.read_json(json_string) # 導(dǎo)入JSON格式的字符,URL地址或者文件中的數(shù)據(jù)
pd.read_html(url) # 導(dǎo)入經(jīng)過解析的URL地址中包含的數(shù)據(jù)框 (DataFrame) 數(shù)據(jù)
pd.read_clipboard() # 導(dǎo)入系統(tǒng)粘貼板里面的數(shù)據(jù)
pd.DataFrame(dict) # 導(dǎo)入Python字典 (dict) 里面的數(shù)據(jù),其中key是數(shù)據(jù)框的表頭,value是數(shù)據(jù)框的內(nèi)容。
-
數(shù)據(jù)的檢查和查看
查看數(shù)據(jù)框的前幾行
chipo.head(n) # n為查看的大小,默認(rèn)為5查看數(shù)據(jù)框的后幾行
chipo.tail(n)查看數(shù)據(jù)的索引,數(shù)據(jù)類型及內(nèi)存信息
chipo.info()查看數(shù)據(jù)的行列大小
chipo.shape[1] # shape返回的是一個(gè)元祖,第0位為行大小,第1位為列大小查看所有列名 返回值為Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')chipo.columns查看索引 返回值為RangeIndex(start=0, stop=4622, step=1)
chipo.index查詢每個(gè)獨(dú)特?cái)?shù)據(jù)值出現(xiàn)次數(shù)統(tǒng)計(jì)
chipo.item_name.value_counts() # 每個(gè)商品出現(xiàn)的次數(shù) -
數(shù)據(jù)的分類
- groupby
drinks.groupby('continent').beer_servings.mean() # 將酒銷售量按照大洲分類,求每個(gè)大洲銷售的平均值 -
合并
- concat()
all_data_col = pd.concat([data1, data2], axis = 1) # axis = 1 則是按照列的維度合并,默認(rèn)按照行的維度合并- merge()
pd.merge(all_data, data3, on='subject_id')常用參數(shù)參考自官網(wǎng)
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
left: use only keys from left frame, similar to a SQL left outer join; preserve key order
right: use only keys from right frame, similar to a SQL right outer join; preserve key order
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
on : (label or list)
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on : (label or list, or array-like)
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : (label or list, or array-like)
Field names to join on in right DataFrame or vector/list of vectors per left_on docs
left_index : (boolean, default False)
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
right_index : (boolean, default False)
Use the index from the right DataFrame as the join key. Same caveats as left_index
sort : (boolean, default False)
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword) -
Apply 聚合函數(shù)
apply 是 pandas 庫的一個(gè)很重要的函數(shù),多和 groupby 函數(shù)一起用,也可以直接用于 DataFrame 和 Series 對(duì)象。主要用于數(shù)據(jù)聚合運(yùn)算,可以很方便的對(duì)分組進(jìn)行現(xiàn)有的運(yùn)算和自定義的運(yùn)算。
def fix_century(x): year = x.year - 100 if x.year > 1989 else x.year return datetime.date(year, x.month, x.day) data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century) # 數(shù)據(jù)中Yr_Mo_Dy 按照fix_century進(jìn)行替換 -
resample 重新采樣函數(shù)
Pandas中的resample,重新采樣,是對(duì)原樣本重新處理的一個(gè)方法,是一個(gè)對(duì)常規(guī)時(shí)間序列數(shù)據(jù)重新采樣和頻率轉(zhuǎn)換的便捷的方法。
常用參數(shù)rule : 偏移量(隔多長時(shí)間取一次值)
例如:'30s','3T','BM'