上次說到了pandas中常用的一些選擇以及切片的方法,總結(jié)一下,主要有以下幾種:
data['A’]
data[['A','B’]]
data.iloc[:,1] #截取第二列數(shù)據(jù),iloc只能用數(shù)字截取, Select row by integer location
data.loc[2] #截取index label為2的第二行數(shù)據(jù),loc只能用label來截取,Select row by label
data.loc[:,'A':'B’] #截取截取從header為’A’到’B’的列
data.ix[:, ‘A’:'B’] #截取從header為’A’到’B’的列
data.ix[:,0:2] #截取前兩列數(shù)據(jù)
有一點(diǎn)要注意:
In pandas version 0.20.0 and above, ix is deprecated and the use of loc and iloc is encouraged instead.
在pandas最新的documentation里,loc和iloc是比較推薦使用的,在stack overflow上有個(gè)問題專門問了這三種截取方法的區(qū)別,這里引用一下一遍我們深入了解他們的區(qū)別:
loc works on labels in the index.
iloc works on the positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
舉個(gè)栗子,如果一個(gè)data frame, 它的index label是mixed type,既包含數(shù)字類型,又包含文本類型,那么ix既可用位置數(shù)字去截取,也可以用label去截取, 但是一定要記住,ix是優(yōu)先label的,如果label不存在,就會(huì)用位置數(shù)字去截取 (僅僅在mixed type下適用,一旦我們的label是固定類型,那么ix和loc的作用就完全一樣了)
接下來我想說一說數(shù)據(jù)過濾的問題, 這個(gè)問題也是我在數(shù)據(jù)預(yù)處理的時(shí)候經(jīng)常遇到的問題,我們先來創(chuàng)建一個(gè)數(shù)據(jù)表, 包含一些城市的信息
import pandas as pd
data = pd.read_excel('rhythm.xlsx')
print data
Out[23]:
A B C D city house_price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
2 3 2 3 4 NaN 5000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
5 6 2 3 4 Atlanta 20000
6 7 2 3 4 Tokyo 130000
7 8 2 3 4 #NAME 30000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
2 Kunming is also called the Spring city due to ...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
5 404 not found
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
7 Mumbai (/m?m?ba?/; also known as Bombay, the o...
我們可以看到,這個(gè)dataset有很多問題,比如city列有NaN數(shù)據(jù)缺失的問題,information列有’page not found’的錯(cuò)誤信息,我們要怎么剔除這些信息呢?
第一問:如果我想把city列含有NaN的行去掉,要怎么做呢?
drop_nan = data.dropna(subset=['city'])
print drop_nan
output:
A B C D city house price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
5 6 2 3 4 Atlanta 20000
6 7 2 3 4 Tokyo 130000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
5 404 not found
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
可以看到,我們city缺失的那一行已經(jīng)完全消失了,dropna就是專門用于過濾空值的函數(shù),一般來說它有兩種形式:
data.dropna(how='any') #to drop if any value in the row has a nan 只要任何一列包含有na值,就會(huì)完全刪除那一行
data.dropna(how='all') #to drop if all values in the row are nan 如果所有列的值都是na,才刪除那一行
在這里,需要注意行和列的區(qū)別
第二問:我想把東京和亞特蘭大那兩行刪掉
drop_value = data[data.city.str.contains('Tokyo', 'Atlanta') == False]
#or
drop_value = data[~data.city.str.contains('Tokyo', 'Atlanta',na=False)]
output:
A B C D city house price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
2 3 2 3 4 NaN 5000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
2 Kunming is also called the Spring city due to ...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
第三問,我想把house price 大于等于70000小于等于14000的行提取出來:
在做這一步之前,有個(gè)很重要的問題需要我們先解決了,仔細(xì)觀察我們這個(gè)數(shù)據(jù)集,就會(huì)發(fā)現(xiàn)“house price”這個(gè)header是包含有空格的,這對(duì)我們之后的數(shù)據(jù)處理會(huì)造成很大麻煩,所以我們需要先把這個(gè)空格替換為”_"
data.columns = [c.replace(' ', '_') for c in data.columns]
print data.columns.values
output:
[u'A' u'B' u'C' u'D' u'city' u'house_price' u'information']
接下來就可以進(jìn)行數(shù)字的范圍截取了:
select_range = data[(data.house_price >= 70000) & (data.house_price <= 140000)]
output:
A B C D city house_price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
3 4 2 3 4 New York 140000
6 7 2 3 4 Tokyo 130000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
3 New York is a state in the northeastern United...
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
第四問:我想把information那一列中,包含“page not found”, “404 not found”這些文本的行刪除,并且,如果city一列中,開頭的第一個(gè)文本是符號(hào)”#",那么也將這一行刪除
ignore_list = ['404 not found','page not found']
remove_start_with = data[~data['city'].str.startswith('#', na=False)
& ~data['information'].str.contains('|'.join(ignore_list), na=True)]
print remove_start_with
A B C D city house_price \
1 2 2 3 4 Shanghai 120000
2 3 2 3 4 NaN 5000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
6 7 2 3 4 Tokyo 130000
information
1 Shanghai is a Chinese city located on the east...
2 Kunming is also called the Spring city due to ...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
第五問,我想把截取information文本長(zhǎng)度大于30個(gè)詞的行
len_filter = data[data.information.str.len() >= 30]
print len_filter
從上面的各種例子來看,我們就會(huì)發(fā)現(xiàn),pandas在截取數(shù)據(jù),過濾數(shù)據(jù)的時(shí)候功能強(qiáng)大,且寫出的代碼可讀性極高。我會(huì)繼續(xù)復(fù)習(xí)pandas相關(guān)的知識(shí),在下一篇文章中說談?wù)勅绾斡胮andas內(nèi)聯(lián)結(jié)兩個(gè)數(shù)據(jù)集。