(八)pandas知識學習3-python數(shù)據(jù)分析與機器學習實戰(zhàn)(學習筆記)

文章原創(chuàng),最近更新:2018-05-3

引言:關于series的介紹

這這里為了方便大家可以學習series這個案例,將fandango_score_comparison.csv這個文件以百度網(wǎng)盤分享給大家,鏈接: https://pan.baidu.com/s/1U6z7OvXK75L1AGm1vYlN4w 密碼: qe1a

課程來源: python數(shù)據(jù)分析與機器學習實戰(zhàn)-唐宇迪

dataframe是相當于矩陣,series是相當于矩陣的一行,series類型由一組數(shù)據(jù)及與之相關的數(shù)據(jù)索引組成.
比如以下一個小的案例:

import pandas as pd
a=pd.Series([9,8,7,6])
a
Out[19]: 
0    9
1    8
2    7
3    6
dtype: int64

以下是關于電影的一個評分以及相關的數(shù)據(jù).我們觀察以下用series結構有沒有什么特別之處?

import pandas as pd

fandango=pd.read_csv('fandango_score_comparison.csv')

series_film = fandango['FILM']

type(series_film)
Out[85]: pandas.core.series.Series

通過上面可以看出從fandango是個Datafram,然后將fandango其中的一列['FILM']拿出來,fandango['FILM']變成了Series.

在Series進行定位,與Datafram有什么區(qū)別呢?

其實都是一樣的用法,通過索引和切片的方式.

series_film = fandango['FILM']
series_film[0:5]
Out[84]: 
0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object

series_rt = fandango['RottenTomatoes']
series_rt[0:5]
Out[87]: 
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64

新建一個Series結構應該怎么辦?

首先我們查看series.values的結構.發(fā)現(xiàn)結果是一個ndarray.即就是從series每一個值拿出來,每個值就是ndarray.這就說明了,dataframe里面的結構是series,series里面的結構是ndarray.其實pandas是封裝在numpy的基礎之上的.

很多操作就是把numpy組合形成便利的條件,pandas與numpy很多操作都是互通的.

film_names=series_film.values

type(film_names)
Out[89]: numpy.ndarray

下面的操作是創(chuàng)建一個series出來,在pandas當中要將series導進來.

from pandas  import Series

Series的字符串表現(xiàn)形式為:索引在左邊,值在右邊。由于我們沒有為數(shù)據(jù)指定索引,于是會自動創(chuàng)建一個0到N-1(N為數(shù)據(jù)的長度)的整數(shù)型索引。你可以通過Series 的values和index屬性獲取其數(shù)組表示形式和索引對象:

與普通NumPy數(shù)組相比,你可以通過索引的方式選取Series中的單個或一組值:

案例創(chuàng)建一個series,在這個結構當中,一個電影名字,對應其中一個媒體的評分值等于多少.

from pandas  import Series
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names)

series_custom[['Minions (2015)', 'Leviathan (2014)']]
Out[96]: 
Minions (2015)      54
Leviathan (2014)    99
dtype: int64

series如何排序?

reindex更多的不是修改pandas對象的索引,而只是修改索引的順序,如果修改的索引不存在就會使用默認的None代替此行。且不會修改原數(shù)組,要修改需要使用賦值語句。

首先提取電影的名稱,即是將index提取成列表.

original_index = series_custom.index.tolist()

original_index
Out[110]: 
['Avengers: Age of Ultron (2015)',
 'Cinderella (2015)',
 'Ant-Man (2015)',
 'Do You Believe? (2015)',
 'Hot Tub Time Machine 2 (2015)',
 ....
 'Mr. Holmes (2015)',
 "'71 (2015)",
 'Two Days, One Night (2014)',
 'Gett: The Trial of Viviane Amsalem (2015)',
 'Kumiko, The Treasure Hunter (2015)']

對電影的名稱進行排序.排序后的結果如下:

sorted_index = sorted(original_index)

sorted_index
Out[112]: 
["'71 (2015)",
 '5 Flights Up (2015)',
 'A Little Chaos (2015)',
 'A Most Violent Year (2014)',
 'About Elly (2015)',
....
 'What We Do in the Shadows (2015)',
 'When Marnie Was There (2015)',
 "While We're Young (2015)",
 'Wild Tales (2014)',
 'Woman in Gold (2015)']

用reindex函數(shù),根據(jù)排序后的電影名稱修改series_custom的索引順序,具體如下:

sorted_by_index = series_custom.reindex(sorted_index)

sorted_by_index
Out[114]: 
'71 (2015)                                         97
5 Flights Up (2015)                                52
A Little Chaos (2015)                              40
A Most Violent Year (2014)                         90
About Elly (2015)                                  97
....
When Marnie Was There (2015)                       89
While We're Young (2015)                           83
Wild Tales (2014)                                  96
Woman in Gold (2015)                               52
Length: 146, dtype: int64

如何用對series的索引以及值進行排序?

用sort_index()對索引進行排序,得到sc2

sc2 = series_custom.sort_index()

sc2
Out[116]: 
'71 (2015)                                         97
5 Flights Up (2015)                                52
A Little Chaos (2015)                              40
A Most Violent Year (2014)                         90
About Elly (2015)                                  97
....
What We Do in the Shadows (2015)                   96
When Marnie Was There (2015)                       89
While We're Young (2015)                           83
Wild Tales (2014)                                  96
Woman in Gold (2015)                               52
Length: 146, dtype: int64

用sort_values()對值進行排序,得到sc3

sc3 = series_custom.sort_values()

sc3
Out[118]: 
Paul Blart: Mall Cop 2 (2015)                    5
Hitman: Agent 47 (2015)                          7
Hot Pursuit (2015)                               8
Fantastic Four (2015)                            9
Taken 3 (2015)                                   9
....
Song of the Sea (2014)                          99
Phoenix (2015)                                  99
Selma (2014)                                    99
Seymour: An Introduction (2015)                100
Gett: The Trial of Viviane Amsalem (2015)      100
Length: 146, dtype: int64

如何對2個series進行相加?

對于兩個維度一樣的series,相加之后就會得到一個新的series.如果維度一樣,對應位置相加,如果維度不一樣,直接是分別相加的要給操作.


通過用add函數(shù)將2個series_custom進行相加.

series_custom
Out[123]: 
Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
....
Mr. Holmes (2015)                                  87
'71 (2015)                                         97
Two Days, One Night (2014)                         97
Gett: The Trial of Viviane Amsalem (2015)         100
Kumiko, The Treasure Hunter (2015)                 87
Length: 146, dtype: int64

np.add(a,b)等價于a+b,相加的結果如下:

np.add(series_custom, series_custom)#等價于series_custom+series_custom
Out[124]: 
Avengers: Age of Ultron (2015)                    148
Cinderella (2015)                                 170
Ant-Man (2015)                                    160
Do You Believe? (2015)                             36
Hot Tub Time Machine 2 (2015)                      28
....
Mr. Holmes (2015)                                 174
'71 (2015)                                        194
Two Days, One Night (2014)                        194
Gett: The Trial of Viviane Amsalem (2015)         200
Kumiko, The Treasure Hunter (2015)                174
Length: 146, dtype: int64

用np.sin()對series求sin值

np.sin(series_custom)
Out[126]: 
Avengers: Age of Ultron (2015)                   -0.985146
Cinderella (2015)                                -0.176076
Ant-Man (2015)                                   -0.993889
Do You Believe? (2015)                           -0.750987
Hot Tub Time Machine 2 (2015)                     0.990607
....
Mr. Holmes (2015)                                -0.821818
'71 (2015)                                        0.379608
Two Days, One Night (2014)                        0.379608
Gett: The Trial of Viviane Amsalem (2015)        -0.506366
Kumiko, The Treasure Hunter (2015)               -0.821818
Length: 146, dtype: float64

求series_custom的最大值,用np.max()進行計算

np.max(series_custom)
Out[127]: 100

判斷series_custom中大于50的數(shù)

series_custom > 50
Out[128]: 
Avengers: Age of Ultron (2015)                     True
Cinderella (2015)                                  True
Ant-Man (2015)                                     True
Do You Believe? (2015)                            False
Hot Tub Time Machine 2 (2015)                     False
....
Mr. Holmes (2015)                                  True
'71 (2015)                                         True
Two Days, One Night (2014)                         True
Gett: The Trial of Viviane Amsalem (2015)          True
Kumiko, The Treasure Hunter (2015)                 True
Length: 146, dtype: bool

查找series_custom中大于50的數(shù)

series_greater_than_50
Out[130]: 
Avengers: Age of Ultron (2015)                                             74
Cinderella (2015)                                                          85
Ant-Man (2015)                                                             80
The Water Diviner (2015)                                                   63
Top Five (2014)                                                            86
....
Mr. Holmes (2015)                                                          87
'71 (2015)                                                                 97
Two Days, One Night (2014)                                                 97
Gett: The Trial of Viviane Amsalem (2015)                                 100
Kumiko, The Treasure Hunter (2015)                                         87
Length: 94, dtype: int64

查找series_custom中>50,<75的數(shù)

criteria_one = series_custom > 50

criteria_two = series_custom < 75

both_criteria = series_custom[criteria_one & criteria_two]

both_criteria
Out[134]: 
Avengers: Age of Ultron (2015)                                            74
The Water Diviner (2015)                                                  63
Unbroken (2014)                                                           51
Southpaw (2015)                                                           59
Insidious: Chapter 3 (2015)                                               59
The Man From U.N.C.L.E. (2015)                                            68
....
Woman in Gold (2015)                                                      52
The Last Five Years (2015)                                                60
Jurassic World (2015)                                                     71
Minions (2015)                                                            54
Spare Parts (2015)                                                        52
dtype: int64

如何使2個series的index相同?如何進行計算?

index相同,兩個value會在相對應的位置進行計算,會得到一個新的series

rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])

rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])

rt_mean = (rt_critics + rt_users)/2

rt_mean
Out[138]: 
FILM
Avengers: Age of Ultron (2015)                    80.0
Cinderella (2015)                                 82.5
Ant-Man (2015)                                    85.0
Do You Believe? (2015)                            51.0
Hot Tub Time Machine 2 (2015)                     21.0
....
Inside Out (2015)                                 94.0
Mr. Holmes (2015)                                 82.5
'71 (2015)                                        89.5
Two Days, One Night (2014)                        87.5
Gett: The Trial of Viviane Amsalem (2015)         90.5
Kumiko, The Treasure Hunter (2015)                75.0
Length: 146, dtype: float64

如何指定一個索引?

set_index函數(shù)拓展:
DataFrame可以通過set_index方法,可以設置單索引和復合索引。
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
append添加新索引,drop為False,inplace為True時,索引將會還原為列

fandango的index是0-146.

fandango=pd.read_csv('fandango_score_comparison.csv')
fandango.index
Out[149]: RangeIndex(start=0, stop=146, step=1)

通過set_index,將0-146更改為'FILM'這一列的值為索引,結果如下:

fandango_films = fandango.set_index('FILM', drop=False)
fandango_films.index
Out[140]: 
Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
       'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
       'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
       'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
       ...
       'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
       'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
       'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
       'Gett: The Trial of Viviane Amsalem (2015)',
       'Kumiko, The Treasure Hunter (2015)'],
      dtype='object', name='FILM', length=146)

對指定索引進行切片

一個數(shù)值型可以進行切片選擇,對str之間用冒號:,安裝字典的排列,比如a:c,代表a,b,c進行排列的.會將對應索引的行所有的數(shù)據(jù)都可以拿出來.與數(shù)值做索引的方法是類似的.

案例:切片從"Avengers: Age of Ultron (2015)"到"Hot Tub Time Machine 2 (2015)"的行.

fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]與fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]等價

fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
Out[147]: 
                                                          FILM  \
FILM                                                             
Avengers: Age of Ultron (2015)  Avengers: Age of Ultron (2015)   
Cinderella (2015)                            Cinderella (2015)   
Ant-Man (2015)                                  Ant-Man (2015)   
Do You Believe? (2015)                  Do You Believe? (2015)   
Hot Tub Time Machine 2 (2015)    Hot Tub Time Machine 2 (2015)   


                                RT_user_norm         ...           IMDB_norm  \
FILM                                                 ...                       
Avengers: Age of Ultron (2015)           4.3         ...                3.90   
Cinderella (2015)                        4.0         ...                3.55   
Ant-Man (2015)                           4.5         ...                3.90   
Do You Believe? (2015)                   4.2         ...                2.70   
Hot Tub Time Machine 2 (2015)            1.4         ...                2.55   

                                RT_norm_round  RT_user_norm_round  \

                                Fandango_Difference  
FILM                                                 
Avengers: Age of Ultron (2015)                  0.5  
Cinderella (2015)                               0.5  
Ant-Man (2015)                                  0.5  
Do You Believe? (2015)                          0.5  
Hot Tub Time Machine 2 (2015)                   0.5  

[5 rows x 22 columns]

相類似的小練習:

#查找一個索引對應的行
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']
#查找三個索引對應的行
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]

如何更改數(shù)據(jù)類型?

通過dtypes函數(shù)查詢dataframe每行的數(shù)據(jù)類型,得到結果如下:

import numpy as np

types = fandango_films.dtypes

types
Out[158]: 
FILM                           object
RottenTomatoes                  int64
RottenTomatoes_User             int64
Metacritic                      int64
Metacritic_User               float64
....
IMDB_norm_round               float64
Metacritic_user_vote_count      int64
IMDB_user_vote_count            int64
Fandango_votes                  int64
Fandango_Difference           float64
dtype: object

獲取數(shù)據(jù)類型是float64的索引

float_columns = types[types.values == 'float64'].index

float_columns
Out[160]: 
Index(['Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Fandango_Difference'],
      dtype='object')

通過獲得的float64的索引,以此得到對應索引中所有行的數(shù)據(jù)

float_df = fandango_films[float_columns]

float_df
Out[162]: 
                                                Metacritic_User  IMDB  \
FILM                                                                    
Avengers: Age of Ultron (2015)                              7.1   7.8   
Cinderella (2015)                                           7.5   7.1   
Ant-Man (2015)                                              8.1   7.8   
Do You Believe? (2015)                                      4.7   5.4   
Hot Tub Time Machine 2 (2015)                               3.4   5.1   
The Water Diviner (2015)                                    6.8   7.2   
Irrational Man (2015)                                       7.6   6.9   
Top Five (2014)                                             6.8   6.5   
Shaun the Sheep Movie (2015)                                8.8   7.4   
Love & Mercy (2015)                                         8.5   7.8   
Far From The Madding Crowd (2015)                           7.5   7.2   
Black Sea (2015)                                            6.6   6.4   
Leviathan (2014)                                            7.2   7.7   
Unbroken (2014)                                             6.5   7.2   
The Imitation Game (2014)                                   8.2   8.1   
Taken 3 (2015)                                              4.6   6.1   
Ted 2 (2015)                                                6.5   6.6   
Southpaw (2015)                                             8.2   7.8   
Night at the Museum: Secret of the Tomb (2014)              5.8   6.3   
Pixels (2015)                                               5.3   5.6   
McFarland, USA (2015)                                       7.2   7.5   
Insidious: Chapter 3 (2015)                                 6.9   6.3   
The Man From U.N.C.L.E. (2015)                              7.9   7.6   
Run All Night (2015)                                        7.3   6.6   
Trainwreck (2015)                                           6.0   6.7   
Selma (2014)                                                7.1   7.5   
Ex Machina (2015)                                           7.9   7.7   
Still Alice (2015)                                          7.8   7.5   
Wild Tales (2014)                                           8.8   8.2   
The End of the Tour (2015)                                  7.5   7.9   
                                                            ...  
Clouds of Sils Maria (2015)                                     0.1  
Testament of Youth (2015)                                       0.1  
Infinitely Polar Bear (2015)                                    0.1  
Phoenix (2015)                                                  0.1  
The Wolfpack (2015)                                             0.1  
The Stanford Prison Experiment (2015)                           0.1  
Tangerine (2015)                                                0.1  
Magic Mike XXL (2015)                                           0.1  
Home (2015)                                                     0.1  
The Wedding Ringer (2015)                                       0.1  
Woman in Gold (2015)                                            0.1  
The Last Five Years (2015)                                      0.1  
Mission: Impossible a€“ Rogue Nation (2015)                     0.1  
Amy (2015)                                                      0.1  
Jurassic World (2015)                                           0.0  
Minions (2015)                                                  0.0  
Max (2015)                                                      0.0  
Paul Blart: Mall Cop 2 (2015)                                   0.0  
The Longest Ride (2015)                                         0.0  
The Lazarus Effect (2015)                                       0.0  
The Woman In Black 2 Angel of Death (2015)                      0.0  
Danny Collins (2015)                                            0.0  
Spare Parts (2015)                                              0.0  
Serena (2015)                                                   0.0  
Inside Out (2015)                                               0.0  
Mr. Holmes (2015)                                               0.0  
'71 (2015)                                                      0.0  
Two Days, One Night (2014)                                      0.0  
Gett: The Trial of Viviane Amsalem (2015)                       0.0  
Kumiko, The Treasure Hunter (2015)                              0.0  

[146 rows x 15 columns]

通過std()函數(shù),對每個指標都進行計算標準差

deviations = float_df.apply(lambda x: np.std(x))

deviations
Out[165]: 
Metacritic_User               1.505529
IMDB                          0.955447
Fandango_Stars                0.538532
Fandango_Ratingvalue          0.501106
RT_norm                       1.503265
RT_user_norm                  0.997787
Metacritic_norm               0.972522
Metacritic_user_nom           0.752765
IMDB_norm                     0.477723
RT_norm_round                 1.509404
RT_user_norm_round            1.003559
Metacritic_norm_round         0.987561
Metacritic_user_norm_round    0.785412
IMDB_norm_round               0.501043
Fandango_Difference           0.152141
dtype: float64

相類似的小練習:

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.std(x), axis=1)
最后編輯于
?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容