說說Python中的pandas模塊(2)

上次說到了pandas中常用的一些選擇以及切片的方法,總結(jié)一下,主要有以下幾種:

data['A’]

data[['A','B’]]

data.iloc[:,1] #截取第二列數(shù)據(jù),iloc只能用數(shù)字截取, Select row by integer location

data.loc[2] #截取index label為2的第二行數(shù)據(jù),loc只能用label來截取,Select row by label

data.loc[:,'A':'B’] #截取截取從header為’A’到’B’的列

data.ix[:, ‘A’:'B’] #截取從header為’A’到’B’的列
data.ix[:,0:2] #截取前兩列數(shù)據(jù)

有一點(diǎn)要注意:
In pandas version 0.20.0 and above, ix is deprecated and the use of loc and iloc is encouraged instead.

在pandas最新的documentation里,loc和iloc是比較推薦使用的,在stack overflow上有個(gè)問題專門問了這三種截取方法的區(qū)別,這里引用一下一遍我們深入了解他們的區(qū)別:

loc works on labels in the index. 
iloc works on the positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.

舉個(gè)栗子,如果一個(gè)data frame, 它的index label是mixed type,既包含數(shù)字類型,又包含文本類型,那么ix既可用位置數(shù)字去截取,也可以用label去截取, 但是一定要記住,ix是優(yōu)先label的,如果label不存在,就會(huì)用位置數(shù)字去截取 (僅僅在mixed type下適用,一旦我們的label是固定類型,那么ix和loc的作用就完全一樣了)

接下來我想說一說數(shù)據(jù)過濾的問題, 這個(gè)問題也是我在數(shù)據(jù)預(yù)處理的時(shí)候經(jīng)常遇到的問題,我們先來創(chuàng)建一個(gè)數(shù)據(jù)表, 包含一些城市的信息

import pandas as pd

data = pd.read_excel('rhythm.xlsx')
print data

Out[23]:
   A  B  C  D      city  house_price  \
0  1  2  3  4   Beijing        70000   
1  2  2  3  4  Shanghai       120000   
2  3  2  3  4       NaN         5000   
3  4  2  3  4  New York       140000   
4  5  2  3  4  Brasilia        50000   
5  6  2  3  4   Atlanta        20000   
6  7  2  3  4     Tokyo       130000   
7  8  2  3  4     #NAME        30000   

                                         information  
0                                     page not found  
1  Shanghai is a Chinese city located on the east...  
2  Kunming is also called the Spring city due to ...  
3  New York is a state in the northeastern United...  
4  Brasília (Portuguese pronunciation: [b?a?zilj?...  
5                                      404 not found  
6  Tokyo (Japanese: [to?kjo?] ( listen), English ...  
7  Mumbai (/m?m?ba?/; also known as Bombay, the o...  

我們可以看到,這個(gè)dataset有很多問題,比如city列有NaN數(shù)據(jù)缺失的問題,information列有’page not found’的錯(cuò)誤信息,我們要怎么剔除這些信息呢?

第一問:如果我想把city列含有NaN的行去掉,要怎么做呢?
drop_nan = data.dropna(subset=['city'])
print drop_nan

output:
  A  B  C  D      city  house price  \
0  1  2  3  4   Beijing        70000   
1  2  2  3  4  Shanghai       120000   
3  4  2  3  4  New York       140000   
4  5  2  3  4  Brasilia        50000   
5  6  2  3  4   Atlanta        20000   
6  7  2  3  4     Tokyo       130000   

                                         information  
0                                     page not found  
1  Shanghai is a Chinese city located on the east...  
3  New York is a state in the northeastern United...  
4  Brasília (Portuguese pronunciation: [b?a?zilj?...  
5                                      404 not found  
6  Tokyo (Japanese: [to?kjo?] ( listen), English ...  

可以看到,我們city缺失的那一行已經(jīng)完全消失了,dropna就是專門用于過濾空值的函數(shù),一般來說它有兩種形式:

data.dropna(how='any') #to drop if any value in the row has a nan 只要任何一列包含有na值,就會(huì)完全刪除那一行

data.dropna(how='all') #to drop if all values in the row are nan 如果所有列的值都是na,才刪除那一行

在這里,需要注意行和列的區(qū)別

第二問:我想把東京和亞特蘭大那兩行刪掉
drop_value = data[data.city.str.contains('Tokyo', 'Atlanta') == False]

#or

drop_value = data[~data.city.str.contains('Tokyo', 'Atlanta',na=False)]

output:
A  B  C  D      city  house price  \
0  1  2  3  4   Beijing        70000   
1  2  2  3  4  Shanghai       120000   
2  3  2  3  4       NaN         5000   
3  4  2  3  4  New York       140000   
4  5  2  3  4  Brasilia        50000   

                                         information  
0                                     page not found  
1  Shanghai is a Chinese city located on the east...  
2  Kunming is also called the Spring city due to ...  
3  New York is a state in the northeastern United...  
4  Brasília (Portuguese pronunciation: [b?a?zilj?...  
第三問,我想把house price 大于等于70000小于等于14000的行提取出來:

在做這一步之前,有個(gè)很重要的問題需要我們先解決了,仔細(xì)觀察我們這個(gè)數(shù)據(jù)集,就會(huì)發(fā)現(xiàn)“house price”這個(gè)header是包含有空格的,這對(duì)我們之后的數(shù)據(jù)處理會(huì)造成很大麻煩,所以我們需要先把這個(gè)空格替換為”_"

data.columns = [c.replace(' ', '_') for c in data.columns]

print data.columns.values

output:
[u'A' u'B' u'C' u'D' u'city' u'house_price' u'information']

接下來就可以進(jìn)行數(shù)字的范圍截取了:

select_range = data[(data.house_price >= 70000) & (data.house_price <= 140000)]

output:
   A  B  C  D      city  house_price  \
0  1  2  3  4   Beijing        70000   
1  2  2  3  4  Shanghai       120000   
3  4  2  3  4  New York       140000   
6  7  2  3  4     Tokyo       130000   

                                         information  
0                                     page not found  
1  Shanghai is a Chinese city located on the east...  
3  New York is a state in the northeastern United...  
6  Tokyo (Japanese: [to?kjo?] ( listen), English ...  
第四問:我想把information那一列中,包含“page not found”, “404 not found”這些文本的行刪除,并且,如果city一列中,開頭的第一個(gè)文本是符號(hào)”#",那么也將這一行刪除

ignore_list = ['404 not found','page not found']

remove_start_with = data[~data['city'].str.startswith('#', na=False)
                       & ~data['information'].str.contains('|'.join(ignore_list), na=True)]

print remove_start_with

A  B  C  D      city  house_price  \
1  2  2  3  4  Shanghai       120000   
2  3  2  3  4       NaN         5000   
3  4  2  3  4  New York       140000   
4  5  2  3  4  Brasilia        50000   
6  7  2  3  4     Tokyo       130000   

                                         information  
1  Shanghai is a Chinese city located on the east...  
2  Kunming is also called the Spring city due to ...  
3  New York is a state in the northeastern United...  
4  Brasília (Portuguese pronunciation: [b?a?zilj?...  
6  Tokyo (Japanese: [to?kjo?] ( listen), English ...
第五問,我想把截取information文本長(zhǎng)度大于30個(gè)詞的行

len_filter = data[data.information.str.len() >= 30]

print len_filter

從上面的各種例子來看,我們就會(huì)發(fā)現(xiàn),pandas在截取數(shù)據(jù),過濾數(shù)據(jù)的時(shí)候功能強(qiáng)大,且寫出的代碼可讀性極高。我會(huì)繼續(xù)復(fù)習(xí)pandas相關(guān)的知識(shí),在下一篇文章中說談?wù)勅绾斡胮andas內(nèi)聯(lián)結(jié)兩個(gè)數(shù)據(jù)集。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容