Pandas_Select_Data_query()
查詢數(shù)據(jù)除了[ ]、loc、iloc、where外,還有另外一個(gè)簡(jiǎn)潔高效的查詢方法——query()。
了解了一下相關(guān)操作,簡(jiǎn)直驚呆了,完全就是英語白話,容易掌握,語句也短,稱為短小精悍一點(diǎn)兒都不為過。
學(xué)了之后,再也不用寫辣么長(zhǎng)的查詢語句了。
本文就簡(jiǎn)單介紹一下query()方法的相關(guān)操作。
import pandas as pd
import numpy as np
?
data = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))
data
out:
a b c d
0 -0.055029 1.376917 -0.228314 1.595987
1 -0.259330 -0.114194 1.252481 0.386451
2 0.873330 -1.279337 2.390891 -0.044016
3 -1.190554 -1.359401 -0.191798 1.742165
4 -0.750102 0.143094 0.742452 -1.577230
pandas一般做法
data[(data.a < data.b) & (data.c > data.d)]
out:
a b c d
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
data.loc[(data.a < data.b) & (data.c > data.d)]
out:
a b c d
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
使用query()
data.query('(a < b) and (c > d)')
out:
a b c d
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
data.query('a < b < c')
out:
a b c d
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
query()可以使用索引
直接使用'index':
data.query('index > 2')
out:
a b c d
i
3 -1.190554 -1.359401 -0.191798 1.742165
4 -0.750102 0.143094 0.742452 -1.577230
也可以先命名仔使用:
data.index.name= 'i'
data.query('i > 2')
out:
a b c d
i
3 -1.190554 -1.359401 -0.191798 1.742165
4 -0.750102 0.143094 0.742452 -1.577230
如果索引的名稱與列名稱重疊,則列名稱優(yōu)先。
data.index.name= 'a'
data.query('a > .5')
out:
a b c d
a
2 0.87333 -1.279337 2.390891 -0.044016
MultiIndex query()語法
companies = np.random.choice(['apple', 'huawei'], size = 6)
products = np.random.choice(['mobile', 'pad'], size = 6)
companies
out:
array(['apple', 'apple', 'huawei', 'huawei', 'huawei', 'huawei'],
dtype='<U6')
products
out:
array(['pad', 'mobile', 'mobile', 'mobile', 'pad', 'pad'], dtype='<U6')
index = pd.MultiIndex.from_arrays([companies, products], names=['company', 'product'])
data_mul = pd.DataFrame(np.random.randn(6, 3), index=index)
data_mul
out:
0 1 2
company product
apple pad -0.968198 -0.619911 1.134541
mobile -0.608740 -0.909656 -0.027498
huawei mobile 1.016521 1.882203 -1.001023
mobile 0.841595 0.065712 1.732188
pad 0.540195 -0.762794 0.549758
pad -2.613445 -0.387707 -0.780770
data_mul.query('product == "mobile"')
out:
0 1 2
company product
apple mobile -0.608740 -0.909656 -0.027498
huawei mobile 1.016521 1.882203 -1.001023
mobile 0.841595 0.065712 1.732188
data_mul.query('company == "huawei"')
out:
0 1 2
company product
huawei mobile 1.016521 1.882203 -1.001023
mobile 0.841595 0.065712 1.732188
pad 0.540195 -0.762794 0.549758
pad -2.613445 -0.387707 -0.780770
如果MultiIndex未命名的級(jí)別,您可以使用特殊名稱引用它們:
data_mul.index.names = [None, None]
data_mul.query('ilevel_0 == "huawei"')
out:
0 1 2
huawei mobile 1.016521 1.882203 -1.001023
mobile 0.841595 0.065712 1.732188
pad 0.540195 -0.762794 0.549758
pad -2.613445 -0.387707 -0.780770
data_mul.query('ilevel_1 == "pad"')
out:
0 1 2
apple pad -0.968198 -0.619911 1.134541
huawei pad 0.540195 -0.762794 0.549758
pad -2.613445 -0.387707 -0.780770
query()與pandas語法比較
- 完全類似numpy的語法;
- 可以直接刪除括號(hào);
- 使用英語而不是符號(hào);
- 最大程度接近自然語言。
有括號(hào)
data.query('(a < b) and (c > d)')
out:
a b c d
a
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
無括號(hào)
data.query('a < b and c > d')
out:
a b c d
a
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
使用符號(hào)
data.query('a < b & c > d')
out:
a b c d
a
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
使用英語
data.query('a < b or c > d')
out:
a b c d
a
0 -0.055029 1.376917 -0.228314 1.595987
1 -0.259330 -0.114194 1.252481 0.386451
2 0.873330 -1.279337 2.390891 -0.044016
4 -0.750102 0.143094 0.742452 -1.577230
接近自然語言
data.query('a < b < c')
out:
a b c d
a
1 -0.259330 -0.114194 1.252481 0.386451
4 -0.750102 0.143094 0.742452 -1.577230
使用 in 與 not in 操作符
df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
'c': np.random.randint(5, size=12),
'd': np.random.randint(9, size=12)})
df
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
5 c b 3 3
6 d b 4 1
7 d b 1 1
8 e c 4 6
9 e c 4 3
10 f c 2 5
11 f c 0 0
in
df.query('a in b')
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
5 c b 3 3
一般方法
df[df.a.isin(df.b)]
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
5 c b 3 3
not in
df.query('a not in b')
out:
a b c d
6 d b 4 1
7 d b 1 1
8 e c 4 6
9 e c 4 3
10 f c 2 5
11 f c 0 0
一般方法:
df[~df.a.isin(df.b)]
out:
a b c d
6 d b 4 1
7 d b 1 1
8 e c 4 6
9 e c 4 3
10 f c 2 5
11 f c 0 0
與其他表達(dá)式結(jié)合
df.query('a in b and c < d')
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
一般方法:
df[(df.a.isin(df.b)) & (df.c < df.d)]
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
== 運(yùn)算符與list的特殊用法
==、!=相當(dāng)于in 、not in。
== 運(yùn)算符
df.query('a == ["a", "b", "c"]')
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
5 c b 3 3
df.query('a in ["a", "b", "c"]')
out:
a b c d
0 a a 3 8
1 a a 2 6
2 b a 4 6
3 b a 4 8
4 c b 2 7
5 c b 3 3
!= 運(yùn)算符
df.query('a != ["a", "b", "c"]')
out:
a b c d
6 d b 4 1
7 d b 1 1
8 e c 4 6
9 e c 4 3
10 f c 2 5
11 f c 0 0
df.query('a not in ["a", "b", "c"]')
out:
a b c d
6 d b 4 1
7 d b 1 1
8 e c 4 6
9 e c 4 3
10 f c 2 5
11 f c 0 0
布爾運(yùn)算符
可以使用單詞not或~運(yùn)算符否定布爾表達(dá)式。
df['bools'] = df.d > df.d.mean()
df.query('bools')
out:
a b c d bools
0 a a 3 8 True
1 a a 2 6 True
2 b a 4 6 True
3 b a 4 8 True
4 c b 2 7 True
8 e c 4 6 True
10 f c 2 5 True
df.query('d > d.mean()')
out:
a b c d bools
0 a a 3 8 True
1 a a 2 6 True
2 b a 4 6 True
3 b a 4 8 True
4 c b 2 7 True
8 e c 4 6 True
10 f c 2 5 True
df.query('not bools')
out:
a b c d bools
5 c b 3 3 False
6 d b 4 1 False
7 d b 1 1 False
9 e c 4 3 False
11 f c 0 0 False
df.query('~bools')
out:
a b c d bools
5 c b 3 3 False
6 d b 4 1 False
7 d b 1 1 False
9 e c 4 3 False
11 f c 0 0 False
?