DataFrame 表示矩陣數(shù)據(jù)表,有行索引和列索引。
構(gòu)建方式
In [43]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
...: 'year' : [2000, 2001, 2002, 2001, 2001, 2003],
...: 'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
In [44]: frame = pd.DataFrame(data)
In [45]: frame
Out[45]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2001 2.9
5 Nevada 2003 3.2
對(duì)于大型 DataFrame,head 方法只選出前5行
In [46]: frame.head()
Out[46]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2001 2.9
指定順序
In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[47]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2001 Nevada 2.9
5 2003 Nevada 3.2
傳的列不在字典中
In [49]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
...: index=['one', 'two', 'three', 'four', 'five', 'six'])
In [50]: frame2
Out[50]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2001 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
某一列可以按字典型標(biāo)記或?qū)傩詸z索為 Series
In [51]: frame2['state']
Out[51]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
In [52]: frame2.year
Out[52]:
one 2000
two 2001
three 2002
four 2001
five 2001
six 2003
Name: year, dtype: int64
行也可以通過位置或特殊屬性 loc 進(jìn)行選取
In [53]: frame2.loc['three']
Out[53]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列的引用是可以修改的
In [54]: frame2['debt'] = 16.5
In [55]: frame2
Out[55]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2001 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
In [56]: frame2['debt'] = np.arange(6.)
In [57]: frame2
Out[57]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2001 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
將Series賦值給一列
In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [59]: frame2['debt'] = val
In [60]: frame2
Out[60]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2001 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
del 刪除某一列
In [61]: frame2['eastern'] = frame2.state == 'Ohio'
In [62]: frame2
Out[62]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2001 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
In [63]: del frame2['eastern']
In [64]: frame2.columns
Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
對(duì)Series的修改會(huì)映射到DaraFrame中,如果要復(fù)制,應(yīng)顯示使用Series的copy方法
另一種數(shù)據(jù)形式
In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
...: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [66]: frame3 = pd.DataFrame(pop)
In [67]: frame3
Out[67]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
調(diào)換行和列
In [68]: frame3.T
Out[68]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
如果顯示指明索引,則內(nèi)部的字典的鍵不會(huì)被排序
In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[69]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
包含Series的字典也可以用于構(gòu)造DataFrame
In [70]: pdata = {'Ohio': frame3['Ohio'][: -1],
...: 'Nevada': frame3['Nevada'][: 2]}
In [71]: pd.DataFrame(pdata)
Out[71]:
Ohio Nevada
2000 1.5 NaN
2001 1.7 2.4
索引和列擁有name屬性
In [72]: frame3.index.name = 'year'
In [73]: frame3.columns.name = 'state'
In [74]: frame3
Out[74]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [75]: frame3.values
Out[75]:
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
自動(dòng)選擇適合所有列的類型
In [77]: frame2.values
Out[77]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2001, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
索引對(duì)象
在構(gòu)造Series或DataFrame時(shí),使用的任意數(shù)組或標(biāo)簽序列都可以在內(nèi)部轉(zhuǎn)換為索引對(duì)象
In [78]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [79]: index = obj.index
In [80]: index
Out[80]: Index(['a', 'b', 'c'], dtype='object')
In [81]: index[1:]
Out[81]: Index(['b', 'c'], dtype='object')
In [82]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-82-a452e55ce13b> in <module>
----> 1 index[1] = 'd'
c:\users\a\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
3881
3882 def __setitem__(self, key, value):
-> 3883 raise TypeError("Index does not support mutable operations")
3884
3885 def __getitem__(self, key):
TypeError: Index does not support mutable operations
In [83]:
In [83]: labels = pd.Index(np.arange(3))
In [84]: labels
Out[84]: Int64Index([0, 1, 2], dtype='int64')
In [85]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [86]: obj2
Out[86]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [87]: obj2.index is labels
Out[87]: True
索引對(duì)象是不可變的
In [89]: frame3.columns
Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [90]: 'Ohio' in frame3.columns
Out[90]: True
In [91]: 2003 in frame3.columns
Out[91]: False
In [88]: frame3
Out[88]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [89]: frame3.columns
Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [90]: 'Ohio' in frame3.columns
Out[90]: True
In [91]: 2003 in frame3.columns
Out[91]: False
In [92]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [93]: dup_labels
Out[93]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')