1.loc和iloc:
df = pd.Dataframe(numpy.arange(16).reshape((4,4)),
??index=['first','second','third','fourth'],columns=['a','b','c','d'])

取第一、第三行:
- loc取index的具體值,可以是df.loc[0,2] (未設置index時)也可以是df.loc['test1','third'];
- iloc為行號(從0開始)而非index值,只接受integer,df.iloc[0,2]
切片寫法取前幾行:
loc[0:2]、loc[:2]、loc['first':'third']都是取前三行,iloc[0:2]、iloc[:2]取前兩行
取第二列:df.loc[:,['b']],df.iloc[:,[1]]
取前三行、前兩列:df.loc['first':'third',['a','b']],df.iloc[:3,:2]
篩選a列大于4的行的c列值:df.loc[df['a']>4,'c']

dataframe取差集:A-B
先把B表append到A,再用去重函數(shù)刪除(選擇不保留模式),得到的結果就是兩表的差集A-B:
dfC=dfA.append(dfB).drop_duplicates(keep=False)df拼接:
pd.cnocat([df1,df2],axis=0),0-從下方縱向增加記錄;1-從右側增加字段datacompy的abs_tol和rel_tol
datacompy.Compare(..., ans_tol=0.5)表示abs(a-b)小于0.5;
datacompy.Compare(..., rel_tol=1e-10)表示abs(a / b - 1)小于10^-10groupby
- 分組并計算
df2 = df1.groupby(key, as_index=False).agg({'test1': ['mean'], 'test2': ['sum']})
注意:數(shù)據(jù)列內如果存在不同類型數(shù)據(jù)無法sum,該列會被丟棄,因此groupby之前最好fillna。 - 分組并count另一列distinct數(shù)
df2 = df1.groupby(['test1', 'test2'],as_index=False)['test3'].nunique()
6.條件判斷
- 篩選某些列值后,設置指定列值:
df.loc[(df['test1'] < a) & (df['test2'] == 'b'), ['test2']] = 0 - query查詢
status= 'Success'
df1 = df.query('(score == 99) and (result== @status)') - 獲取某列值不在某列表的數(shù)據(jù):
df=df.loc[~df['test1'].isin(listA)]
7.批量rename
第2列之后的所有列名,添加前綴
new_name = [(name, 'pre_' + name) for name in df.iloc[:, 2:].columns.values]
df.rename(column=dict(new_name), inplace=True)
8.判斷值不等的ne方法
?df.ne(other, anxis='columns', level=None)
?df1 = pd.Dataframe({"A":[14,4,5,4,1], "B":[5,2,54,3,2], "C":[20,20,7,3,8], "D":[14,3,6,2,6]})
?ser = pd.Series([3, 2, 4, 5, 6])
?df1.ne(ser, axis=0) 表示df1的ABCD每一列都跟ser比較,相等的為False不等的為True,
?以此為結果輸出結構和df1相同的df,值全為True、False
9.查找最后一個非0數(shù)據(jù)的index
idx = (df.shape[0] - df.ne(0).values[::-1].argmax(0)).tolist()[-1] - 1