訪問數(shù)據(jù)是使用本書所介紹的這些工具的第一步。我會(huì)著重介紹pandas的數(shù)據(jù)輸入與輸出，雖然別的庫中也有不少以此為目的的工具。

輸入輸出通?？梢詣澐譃閹讉€(gè)大類：讀取文本文件和其他更高效的磁盤存儲(chǔ)格式，加載數(shù)據(jù)庫中的數(shù)據(jù)，利用Web API操作網(wǎng)絡(luò)資源。

6.1 讀寫文本格式的數(shù)據(jù)

pandas提供了一些用于將表格型數(shù)據(jù)讀取為DataFrame對(duì)象的函數(shù)。表6-1對(duì)它們進(jìn)行了總結(jié)，其中read_csv和read_table可能會(huì)是你今后用得最多的。

讀取的函數(shù)

這里我們給出這些函數(shù)的大致功能，就是把test data變?yōu)閐ataframe。這些函數(shù)的一些可選參數(shù)有以下幾類：

Indexing（索引）
能把返回的一列或多列作為一個(gè)dataframe。另外也可以選擇從文件中獲取列名或完全不獲取列名
Type inference and data conversion(類型推測和數(shù)據(jù)轉(zhuǎn)換)
這個(gè)包括用戶自己定義的轉(zhuǎn)換類型和缺失值轉(zhuǎn)換
Datetime parsing（日期解析）
包含整合能力，可以把多列中的時(shí)間信息整合為一列
Iterating（迭代）
支持對(duì)比較大的文件進(jìn)行迭代
Unclean data issues（未清洗的數(shù)據(jù)問題）
跳過行或柱腳，評(píng)論，或其他一些小東西，比如csv中的逗號(hào)

因?yàn)楝F(xiàn)實(shí)中的數(shù)據(jù)非常messy（雜亂），所以有一些數(shù)據(jù)加載函數(shù)（特別是read_csv）的選項(xiàng)也變得越來越多。對(duì)于眾多參數(shù)感覺不知所措是正常的（read_csv有超過50個(gè)參數(shù)）。具體的可以去看pandas官網(wǎng)給出的例子。

一些函數(shù)，比如pandas.read_csv實(shí)現(xiàn)type inference，因?yàn)閏olumn data type不是數(shù)據(jù)類型的一種。這意味著我們沒有必要指定哪些columns是數(shù)值，哪些是整數(shù)，哪些是字符串。其他一些數(shù)據(jù)格式，比如HDF5，數(shù)據(jù)類型是在格式里的。
先來一個(gè)CSV文件熱熱身（CSV文件指的是用逗號(hào)隔開數(shù)據(jù)的文件）：

In [8]: !cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

筆記：這里，我用的是Unix的cat shell命令將文件的原始內(nèi)容打印到屏幕上。如果你用的是Windows，你可以使用type達(dá)到同樣的效果。

由于該文件以逗號(hào)分隔，所以我們可以使用read_csv將其讀入一個(gè)DataFrame：

In [9]: df = pd.read_csv('examples/ex1.csv')

In [10]: df
Out[10]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

我們還可以使用read_table，并指定分隔符：

In [11]: pd.read_table('examples/ex1.csv', sep=',')
Out[11]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

一個(gè)文件不會(huì)總是有header row(頁首行)，考慮下面的文件：

In [12]: !cat examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

讀入該文件的辦法有兩個(gè)。你可以讓pandas為其分配默認(rèn)的列名，也可以自己定義列名：

In [13]: pd.read_csv('examples/ex2.csv', header=None)
Out[13]: 
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

In [14]: pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Out[14]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

假設(shè)你希望將message列做成DataFrame的索引。你可以明確表示要將該列放到索引的位置上，也可以通過index_col參數(shù)指定"message"：

In [15]: names = ['a', 'b', 'c', 'd', 'message']

In [16]: pd.read_csv('examples/ex2.csv', names=names, index_col='message')
Out[16]: 
         a   b   c   d
message               
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

如果希望將多個(gè)列做成一個(gè)層次化索引，只需傳入由列編號(hào)或列名組成的列表即可：

In [17]: !cat examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [18]: parsed = pd.read_csv('examples/csv_mindex.csv',
   ....:                      index_col=['key1', 'key2'])

In [19]: parsed
Out[19]: 
           value1  value2
key1 key2                
one  a          1       2
     b          3       4
     c          5       6
     d          7       8
two  a          9      10
     b         11      12
     c         13      14
     d         15      16

有些情況下，有些表格可能不是用固定的分隔符去分隔字段的（比如空白符或其它模式）?？纯聪旅孢@個(gè)文本文件：

In [20]: list(open('examples/ex3.txt'))
Out[20]: 
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

雖然可以手動(dòng)對(duì)數(shù)據(jù)進(jìn)行規(guī)整，這里的字段是被數(shù)量不同的空白字符間隔開的。這種情況下，你可以傳遞一個(gè)正則表達(dá)式作為read_table的分隔符?？梢杂谜齽t表達(dá)式表達(dá)為\s+，于是有：

In [21]: result = pd.read_table('examples/ex3.txt', sep='\s+')

In [22]: result
Out[22]: 
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

因?yàn)榱忻刃械臄?shù)量少，所以read_table推測第一列應(yīng)該是dataframe的index。

這個(gè)解析器功能有很多其他參數(shù)能幫你解決遇到文件格式異常的問題（可以見之后的表格）。比如，我們要跳過第一、三、四行，使用skiprows:

In [23]: !cat examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [24]: pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])
Out[24]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

對(duì)于缺失值，pandas使用一些sentinel value(標(biāo)記值)來代表，比如NA和NULL：

In [25]: !cat examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
In [26]: result = pd.read_csv('examples/ex5.csv')

In [27]: result
Out[27]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

In [28]: pd.isnull(result)
Out[28]: 
   something      a      b      c      d  message
0      False  False  False  False  False     True
1      False  False  False   True  False    False
2      False  False  False  False  False    False

na_values可以用一個(gè)列表或集合的字符串表示缺失值：

In [29]: result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])

In [30]: result
Out[30]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

我們還可以給不同的column設(shè)定不同的缺失值標(biāo)記符，這樣的話需要用到dict：

In [31]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

In [32]: pd.read_csv('examples/ex5.csv', na_values=sentinels)
Out[32]:
something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       NaN  5   6   NaN   8   world
2     three  9  10  11.0  12     NaN

表6-2列出了pandas.read_csv和pandas.read_table常用的選項(xiàng)。

pandas.read_csv和pandas.read_table中常用的選項(xiàng)

1.逐塊讀取文本文件

在處理很大的文件時(shí)，或找出大文件中的參數(shù)集以便于后續(xù)處理時(shí)，你可能只想讀取文件的一小部分或逐塊對(duì)文件進(jìn)行迭代。
對(duì)于一些比較大的文件，我們想要一次讀取一小部分，或者每次迭代一小部分。在我們看一個(gè)比較大的文件前，先設(shè)置一下pandas中顯示的數(shù)量：

In [33]: pd.options.display.max_rows = 10
#然后有
In [34]: result = pd.read_csv('examples/ex6.csv')

In [35]: result
Out[35]: 
           one       two     three      four key
0     0.467976 -0.038649 -0.295344 -1.824726   L
1    -0.358893  1.404453  0.704965 -0.200638   B
2    -0.501840  0.659254 -0.421691 -0.057688   G
3     0.204886  1.074134  1.388361 -0.982404   R
4     0.354628 -0.133116  0.283763 -0.837063   Q
...        ...       ...       ...       ...  ..
9995  2.311896 -0.417070 -1.409599 -0.515821   L
9996 -0.479893 -0.650419  0.745152 -0.646038   E
9997  0.523331  0.787112  0.486066  1.093156   K
9998 -0.362559  0.598894 -1.843201  0.887292   G
9999 -0.096376 -1.012999 -0.657431 -0.573315   0
[10000 rows x 5 columns]

如果只想讀取幾行（避免讀取整個(gè)文件），通過nrows進(jìn)行指定即可：

In [36]: pd.read_csv('examples/ex6.csv', nrows=5)
Out[36]: 
        one       two     three      four key
0  0.467976 -0.038649 -0.295344 -1.824726   L
1 -0.358893  1.404453  0.704965 -0.200638   B
2 -0.501840  0.659254 -0.421691 -0.057688   G
3  0.204886  1.074134  1.388361 -0.982404   R
4  0.354628 -0.133116  0.283763 -0.837063   Q

要逐塊讀取文件，可以指定chunksize（行數(shù)）：

In [874]: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)

In [875]: chunker
Out[875]: <pandas.io.parsers.TextParser at 0x8398150>

pandas返回的TextParser object能讓我們根據(jù)chunksize每次迭代文件的一部分。比如，我們想要迭代ex6.csv, 計(jì)算key列的值的綜合：

chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

然后有：

In [40]: tot[:10]
Out[40]: 
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParser有一個(gè)get_chunk方法，能返回任意大小的數(shù)據(jù)片段：

chunker = pd.read_csv('../examples/ex6.csv', chunksize=1000)

chunker.get_chunk(10)

2 Writing Data to Text Format (寫入數(shù)據(jù)到文本格式)

數(shù)據(jù)也可以被輸出為分隔符格式的文本。我們?cè)賮砜纯粗白x過的一個(gè)CSV文件：

In [41]: data = pd.read_csv('examples/ex5.csv')

In [42]: data
Out[42]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

利用DataFrame的to_csv方法，我們可以將數(shù)據(jù)寫到一個(gè)以逗號(hào)分隔的文件中：

In [43]: data.to_csv('examples/out.csv')

In [44]: !cat examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

當(dāng)然，還可以使用其他分隔符（由于這里直接寫出到sys.stdout，所以僅僅是打印出文本結(jié)果而已）：

In [45]: import sys

In [46]: data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

缺失值在輸出結(jié)果中會(huì)被表示為空字符串。你可能希望將其表示為別的標(biāo)記值：

In [47]: data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

如果沒有設(shè)置其他選項(xiàng)，則會(huì)寫出行和列的標(biāo)簽。當(dāng)然，它們也都可以被禁用：

In [48]: data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

此外，你還可以只寫出一部分的列，并以你指定的順序排列：

In [49]: data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0

Series也有一個(gè)to_csv方法：

In [50]: dates = pd.date_range('1/1/2000', periods=7)

In [51]: ts = pd.Series(np.arange(7), index=dates)

In [52]: ts.to_csv('examples/tseries.csv')

In [53]: !cat examples/tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

3. 處理分隔符格式

對(duì)于大部分磁盤中的表格型數(shù)據(jù)，用pandas.read_table就能解決。不過，有時(shí)候一些人工的處理也是需要的。

當(dāng)然，有時(shí)候，一些格式不正確的行能會(huì)把read_table絆倒。為了展示一些基本用法，這里先考慮一個(gè)小的CSV文件：

In [54]: !cat examples/ex7.csv
"a","b","c"
"1","2","3"
"1","2","3"

對(duì)于單個(gè)字符的分隔符，可以使用python內(nèi)建的csv方法。只要給csv.reader一個(gè)打開的文件即可：

import csv
f = open('examples/ex7.csv')

reader = csv.reader(f)

對(duì)這個(gè)reader進(jìn)行迭代將會(huì)為每行產(chǎn)生一個(gè)元組（并移除了所有的引號(hào)）：對(duì)這個(gè)reader進(jìn)行迭代將會(huì)為每行產(chǎn)生一個(gè)元組（并移除了所有的引號(hào)）：

In [56]: for line in reader:
   ....:     print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']

現(xiàn)在，為了使數(shù)據(jù)格式合乎要求，你需要對(duì)其做一些整理工作。我們一步一步來做。首先，讀取文件到一個(gè)多行的列表中：

In [57]: with open('examples/ex7.csv') as f:
   ....:     lines = list(csv.reader(f))

然后，我們將這些行分為標(biāo)題行和數(shù)據(jù)行：

In [58]: header, values = lines[0], lines[1:]

然后，我們可以用字典構(gòu)造式和zip(*values)，后者將行轉(zhuǎn)置為列，創(chuàng)建數(shù)據(jù)列的字典：

In [59]: data_dict = {h: v for h, v in zip(header, zip(*values))}

In [60]: data_dict
Out[60]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV文件的形式有很多。只需定義csv.Dialect的一個(gè)子類即可定義出新格式（如專門的分隔符、字符串引用約定、行結(jié)束符等）：

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
reader = csv.reader(f, dialect=my_dialect)

各個(gè)CSV語支的參數(shù)也可以用關(guān)鍵字的形式提供給csv.reader，而無需定義子類：

reader = csv.reader(f, delimiter='|')

可用的選項(xiàng)（csv.Dialect的屬性）及其功能如表6-3所示。

CSV可用的選項(xiàng)

4.JSON數(shù)據(jù)

JSON (short for JavaScript Object Notation)已經(jīng)是發(fā)送HTTP請(qǐng)求的標(biāo)準(zhǔn)數(shù)據(jù)格式了。
JSON(JavaScript Object Notation, JS 對(duì)象簡譜) 是一種輕量級(jí)的數(shù)據(jù)交換格式。它基于 ECMAScript (歐洲計(jì)算機(jī)協(xié)會(huì)制定的js規(guī)范)的一個(gè)子集，采用完全獨(dú)立于編程語言的文本格式來存儲(chǔ)和表示數(shù)據(jù)。簡潔和清晰的層次結(jié)構(gòu)使得 JSON 成為理想的數(shù)據(jù)交換語言。易于人閱讀和編寫，同時(shí)也易于機(jī)器解析和生成，并有效地提升網(wǎng)絡(luò)傳輸效率

5 XML and HTML: Web Scraping (網(wǎng)絡(luò)爬取)

python有很多包用來讀取和寫入HTML和XML格式。比如:lxml, Beautiful Soup, html5lib。其中l(wèi)xml比較快，其他一些包則能更好的處理一些復(fù)雜的HTML和XML文件。

pandas有一個(gè)內(nèi)建的函數(shù)，叫read_html, 這個(gè)函數(shù)利用lxml和Beautiful Soup這樣的包來自動(dòng)解析HTML，變?yōu)镈ataFrame。這里我們必須要先下載這些包才能使用read_html:

conda install lxml
pip install beautifulsoup4 html5lib

pandas.read_html有一些選項(xiàng)，默認(rèn)條件下，它會(huì)搜索、嘗試解析<table>標(biāo)簽內(nèi)的的表格數(shù)據(jù)。結(jié)果是一個(gè)列表的DataFrame對(duì)象：

In [73]: tables = pd.read_html('examples/fdic_failed_bank_list.html')

In [74]: len(tables)
Out[74]: 1

In [75]: failures = tables[0]

In [76]: failures.head()
Out[76]: 
                      Bank Name             City  ST   CERT  \
0                   Allied Bank         Mulberry  AR     91   
1  The Woodbury Banking Company         Woodbury  GA  11297   
2        First CornerStone Bank  King of Prussia  PA  35312   
3            Trust Company Bank          Memphis  TN   9956   
4    North Milwaukee State Bank        Milwaukee  WI  20364   
                 Acquiring Institution        Closing Date       Updated Date  
0                         Today's Bank  September 23, 2016  November 17, 2016  
1                          United Bank     August 19, 2016  November 17, 2016  
2  First-Citizens Bank & Trust Company         May 6, 2016  September 6, 2016  
3           The Bank of Fayette County      April 29, 2016  September 6, 2016  
4  First-Citizens Bank & Trust Company      March 11, 2016      June 16, 2016

因?yàn)閒ailures有許多列，pandas插入了一個(gè)換行符\。

這里，我們可以做一些數(shù)據(jù)清洗和分析（后面章節(jié)會(huì)進(jìn)一步講解），比如計(jì)算按年份計(jì)算倒閉的銀行數(shù)：

In [77]: close_timestamps = pd.to_datetime(failures['Closing Date'])

In [78]: close_timestamps.dt.year.value_counts()
Out[78]: 
2010    157
2009    140
2011     92
2012     51
2008     25
       ... 
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, Length: 15, dtype: int64

利用lxml.objectify解析XML

6.2 二進(jìn)制數(shù)據(jù)格式

實(shí)現(xiàn)數(shù)據(jù)的高效二進(jìn)制格式存儲(chǔ)最簡單的辦法之一是使用Python內(nèi)置的pickle序列化。pandas對(duì)象都有一個(gè)用于將數(shù)據(jù)以pickle格式保存到磁盤上的to_pickle方法：

In [87]: frame = pd.read_csv('examples/ex1.csv')

In [88]: frame
Out[88]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

In [89]: frame.to_pickle('examples/frame_pickle')

你可以通過pickle直接讀取被pickle化的數(shù)據(jù)，或是使用更為方便的pandas.read_pickle：

In [90]: pd.read_pickle('examples/frame_pickle')
Out[90]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

注意：pickle只推薦用于短期存儲(chǔ)。因?yàn)檫@種格式無法保證長期穩(wěn)定；比如今天pickled的一個(gè)文件，可能在庫文件更新后無法讀取。

python還支持另外兩種二進(jìn)制數(shù)據(jù)格式：HDF5和MessagePack。下面會(huì)介紹一個(gè)HDF5，但是我們鼓勵(lì)你多嘗試一個(gè)不同的文件格式，看看他們能有多快，是否符合你數(shù)據(jù)分析的要求。另外一些可用的存儲(chǔ)格式有：bcolz 和 Feather。

1.使用HDF5格式

HDF5格式是用來存儲(chǔ)大量的科學(xué)數(shù)組數(shù)據(jù)的。這種格式還能用于其他一些語言。其中HDF表示hierarchical data format。每一個(gè)HDF5格式能春初多個(gè)數(shù)據(jù)集，并支持metadata。

元數(shù)據(jù)(meta data)——“data about data” 關(guān)于數(shù)據(jù)的數(shù)據(jù)，一般是結(jié)構(gòu)化數(shù)據(jù)（如存儲(chǔ)在數(shù)據(jù)庫里的數(shù)據(jù)，規(guī)定了字段的長度、類型等）。元數(shù)據(jù)是指從信息資源中抽取出來的用于說明其特征、內(nèi)容的結(jié)構(gòu)化的數(shù)據(jù)(如題名,版本、出版數(shù)據(jù)、相關(guān)說明,包括檢索點(diǎn)等)，用于組織、描述、檢索、保存、管理信息和知識(shí)資源。

HDF5 支持多種壓縮模式的on-the-fly compression（即時(shí)壓縮），能讓數(shù)據(jù)中一些重復(fù)的部分存儲(chǔ)地更有效。HDF5對(duì)于處理大數(shù)據(jù)集是一個(gè)很好的選擇，因?yàn)樗粫?huì)把所有數(shù)據(jù)一次性讀取到內(nèi)存里，我們可以從很大的數(shù)組中有效率地讀取一小部分。
能用PyTables或h5py來訪問HDF5數(shù)據(jù)，pandas也有提供一個(gè)high-level的交互界面。HDFStore類像dict一樣能用來處理low-level細(xì)節(jié)：

In [92]: frame = pd.DataFrame({'a': np.random.randn(100)})

In [93]: store = pd.HDFStore('mydata.h5')

In [94]: store['obj1'] = frame

In [95]: store['obj1_col'] = frame['a']

In [96]: store
Out[96]: 
<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5
/obj1                frame        (shape->[100,1])                               
        
/obj1_col            series       (shape->[100])                                 
        
/obj2                frame_table  (typ->appendable,nrows->100,ncols->1,indexers->
[index])
/obj3                frame_table  (typ->appendable,nrows->100,ncols->1,indexers->
[index])

HDF5文件中的對(duì)象可以通過與字典一樣的API進(jìn)行獲?。?/p>

In [97]: store['obj1']
Out[97]: 
           a
0  -0.204708
1   0.478943
2  -0.519439
3  -0.555730
4   1.965781
..       ...
95  0.795253
96  0.118110
97 -0.748532
98  0.584970
99  0.152677
[100 rows x 1 columns]

HDFStore支持兩種存儲(chǔ)模式，'fixed'和'table'。后者通常會(huì)更慢，但是支持使用特殊語法進(jìn)行查詢操作：

In [98]: store.put('obj2', frame, format='table')

In [99]: store.select('obj2', where=['index >= 10 and index <= 15'])
Out[99]: 
           a
10  1.007189
11 -1.296221
12  0.274992
13  0.228913
14  1.352917
15  0.886429

In [100]: store.close()

pandas.read_hdf函數(shù)可以快捷使用這些工具：

In [101]: frame.to_hdf('mydata.h5', 'obj3', format='table')

In [102]: pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])
Out[102]: 
          a
0 -0.204708
1  0.478943
2 -0.519439
3 -0.555730
4  1.965781

2 Reading Microsoft Excel Files（讀取微軟的excel文件）

pandas支持讀取表格型數(shù)據(jù)（excel 2003或更高）文件，使用ExcelFile class或pandas.read_excel函數(shù)。這些工具需要一些附加的包xlrd和openpyxl來分別讀取XLS和XLSX文件。你可以通過pip或conda來安裝。

使用ExcelFile，創(chuàng)建一個(gè)instance，通過給xls或xlsx一個(gè)路徑：

In [7]:  xlsx = pd.ExcelFile(r'C:\Users\yujiawen\Desktop\新建 Microsoft Excel 工作表.xlsx')

In [8]: pd.read_excel(xlsx, 'Sheet1')
Out[8]:
         J       JL         JK  Unnamed: 3     固定     斜率      轉(zhuǎn)移電子
0  4.41204  4.78128  57.131401         NaN  28.15  11.55  2.437229

如果要讀取一個(gè)文件中的多個(gè)表單，創(chuàng)建ExcelFile會(huì)更快，但你也可以將文件名傳遞到pandas.read_excel：

In [9]: frame = pd.read_excel(r'C:\Users\yujiawen\Desktop\新建 Microsoft Excel 工作表.xlsx', 'Sheet1')

In [10]: frame
Out[10]:
         J       JL         JK  Unnamed: 3     固定     斜率      轉(zhuǎn)移電子
0  4.41204  4.78128  57.131401         NaN  28.15  11.55  2.437229

如果要將pandas數(shù)據(jù)寫入為Excel格式，你必須首先創(chuàng)建一個(gè)ExcelWriter，然后使用pandas對(duì)象的to_excel方法將數(shù)據(jù)寫入到其中：

In [108]: writer = pd.ExcelWriter('examples/ex2.xlsx')

In [109]: frame.to_excel(writer, 'Sheet1')

In [110]: writer.save()

你還可以不使用ExcelWriter，而是傳遞文件的路徑到to_excel：

In [111]: frame.to_excel('examples/ex2.xlsx')

6.3 Interacting with Web APIs (網(wǎng)絡(luò)相關(guān)的API交互)

API（Application Programming Interface,應(yīng)用程序編程接口）是一些預(yù)先定義的函數(shù)，目的是提供應(yīng)用程序與開發(fā)人員基于某軟件或硬件得以訪問一組例程的能力，而又無需訪問源碼，或理解內(nèi)部工作機(jī)制的細(xì)節(jié)。
許多網(wǎng)站都有一些通過JSON或其他格式提供數(shù)據(jù)的公共API。通過Python訪問這些API的辦法有不少。一個(gè)簡單易用的辦法（推薦）是requests包（http://docs.python-requests.org）。
找到github里pandas最新的30個(gè)issues，制作一個(gè)GET HTTP request, 通過使用requests包：

In [11]: import pandas as pd
    ...: import numpy as np
    ...:
In [12]: import requests

In [13]:

In [13]:  url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [14]: resp = requests.get(url)

In [15]: resp
Out[15]: <Response [200]>

response的json方法能返回一個(gè)dict，包含可以解析為python object的JSON：

In [16]: data = resp.json()

In [17]:  data[0]['title']
Out[17]: 'API: Remove CalendarDay'
In [18]: data[0]
Out[18]:
{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/24330',
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/24330/labels{/name}',
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/24330/comments',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/24330/events',
 'html_url': 'https://github.com/pandas-dev/pandas/pull/24330',
 'id': 392011696,
 'node_id': 'MDExOlB1bGxSZXF1ZXN0MjM5Mzc2MDk4',
 'number': 24330,
 'title': 'API: Remove CalendarDay',
 'user': {'login': 'mroeschke',
  'id': 10647082,
  'node_id': 'MDQ6VXNlcjEwNjQ3MDgy',
  'avatar_url': 'https://avatars0.githubusercontent.com/u/10647082?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/mroeschke',
  'html_url': 'https://github.com/mroeschke',
  'followers_url': 'https://api.github.com/users/mroeschke/followers',
  'following_url': 'https://api.github.com/users/mroeschke/following{/other_user}',
  'gists_url': 'https://api.github.com/users/mroeschke/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/mroeschke/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/mroeschke/subscriptions',
  'organizations_url': 'https://api.github.com/users/mroeschke/orgs',
  'repos_url': 'https://api.github.com/users/mroeschke/repos',
  'events_url': 'https://api.github.com/users/mroeschke/events{/privacy}',
  'received_events_url': 'https://api.github.com/users/mroeschke/received_events',
  'type': 'User',
  'site_admin': False},
 'labels': [],
 'state': 'open',
 'locked': False,
 'assignee': None,
 'assignees': [],
 'milestone': {'url': 'https://api.github.com/repos/pandas-dev/pandas/milestones/55',
  'html_url': 'https://github.com/pandas-dev/pandas/milestone/55',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/milestones/55/labels',
  'id': 3228419,
  'node_id': 'MDk6TWlsZXN0b25lMzIyODQxOQ==',
  'number': 55,
  'title': '0.24.0',
  'description': '',
  'creator': {'login': 'jorisvandenbossche',
   'id': 1020496,
   'node_id': 'MDQ6VXNlcjEwMjA0OTY=',
   'avatar_url': 'https://avatars2.githubusercontent.com/u/1020496?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/jorisvandenbossche',
   'html_url': 'https://github.com/jorisvandenbossche',
   'followers_url': 'https://api.github.com/users/jorisvandenbossche/followers',
   'following_url': 'https://api.github.com/users/jorisvandenbossche/following{/other_user}',
   'gists_url': 'https://api.github.com/users/jorisvandenbossche/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/jorisvandenbossche/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/jorisvandenbossche/subscriptions',
   'organizations_url': 'https://api.github.com/users/jorisvandenbossche/orgs',
   'repos_url': 'https://api.github.com/users/jorisvandenbossche/repos',
   'events_url': 'https://api.github.com/users/jorisvandenbossche/events{/privacy}',
   'received_events_url': 'https://api.github.com/users/jorisvandenbossche/received_events',
   'type': 'User',
   'site_admin': False},
  'open_issues': 33,
  'closed_issues': 1611,
  'state': 'open',
  'created_at': '2018-03-29T12:00:12Z',
  'updated_at': '2018-12-18T06:22:35Z',
  'due_on': '2018-12-31T08:00:00Z',
  'closed_at': None},
 'comments': 3,
 'created_at': '2018-12-18T06:16:42Z',
 'updated_at': '2018-12-18T06:56:49Z',
 'closed_at': None,
 'author_association': 'MEMBER',
 'pull_request': {'url': 'https://api.github.com/repos/pandas-dev/pandas/pulls/24330',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/24330',
  'diff_url': 'https://github.com/pandas-dev/pandas/pull/24330.diff',
  'patch_url': 'https://github.com/pandas-dev/pandas/pull/24330.patch'},
 'body': '- [x] tests added / passed\r\n- [x] passes `git diff upstream/master -u -- "*.py" | flake8 --diff`\r\n\r\nGiven that we want to ship 0.24.0 soon and that converting `\'D\'` and `Day` to always act as calendar day warrants more game-planning, this PR just simply removes `CalendarDay` and reverts `\'D\'` and `Day` to their prior behavior. '}

data中的每一個(gè)元素都是一個(gè)dict，這個(gè)dict就是在github上找到的issue頁面上的信息。我們可以把data傳給DataFrame并提取感興趣的部分：

In [19]: issues = pd.DataFrame(data, columns=['number', 'title',
    ...:                                     'labels', 'state'])
    ...: issues
    ...:
Out[19]:
    number                                              title                                             labels state
0    24330                            API: Remove CalendarDay                                                 []  open
1    24329  BUG: Timestamp(Timestamp(Ambiguous time)) modi...  [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...  open
2    24328   Fixed PEP8 errors in doc/source/whatsnew/v0.15.*                                                 []  open
3    24327  ERR: Improve error message for cut with infini...  [{'id': 42670965, 'node_id': 'MDU6TGFiZWw0MjY3...  open
4    24326  Fixtures making IntNA tests difficult to run i...                                                 []  open
5    24325         pandas.Series docstrings dtype information  [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...  open
6    24324  REF/TST: Add more pytest idiom to resample/tes...                                                 []  open
7    24323  API: Remove 'codes' parameter from MultiIndex ...                                                 []  open
8    24322  DOC: Fixes flake8 issues in whatsnew v0.13.* #...  [{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...  open
9    24321  Doc on pandas.read_parquet says path is a stri...  [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...  open
10   24319     BUG: Ensure incomplete stata files are deleted                                                 []  open
11   24318  hash_pandas_object fails on empty dataframe wi...  [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...  open
12   24314        Numpy 'inf' values cause pandas.cut to fail  [{'id': 42670965, 'node_id': 'MDU6TGFiZWw0MjY3...  open
13   24312  Segmentation fault while loading csv into data...  [{'id': 31932467, 'node_id': 'MDU6TGFiZWwzMTkz...  open
14   24310       BUG: Series.__repr__ crashing with tzlocal()                                                 []  open
15   24305  DOC/API: PeriodIndex sub/add with Integers: up...  [{'id': 35818298, 'node_id': 'MDU6TGFiZWwzNTgx...  open
16   24303       DOC: Fix flake8 issues with whatsnew v0.18.*  [{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...  open
17   24302    Series.value_counts: Preserve original ordering                                                 []  open
18   24301                    EA ops alignment with DataFrame                                                 []  open
19   24295  Elegant way to generate indexable sliding wind...                                                 []  open
20   24294  BUG: Fix index bug due to parse_time_string GH...  [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...  open
21   24293  standardize signature for Index reductions, im...  [{'id': 35818298, 'node_id': 'MDU6TGFiZWwzNTgx...  open
22   24288  DOC: Fix docstrings with the sections in the w...  [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...  open
23   24282                 WIP: decorator for ops boilerplate  [{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...  open
24   24280  DOC: Fix docstrings with the sections in the w...  [{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...  open
25   24279                Added extra info section to EA repr  [{'id': 849023693, 'node_id': 'MDU6TGFiZWw4NDk...  open
26   24278  Add an "extra info" section to the base Extens...  [{'id': 849023693, 'node_id': 'MDU6TGFiZWw4NDk...  open
27   24275        GH24241 make Categorical.map transform nans  [{'id': 78527356, 'node_id': 'MDU6TGFiZWw3ODUy...  open
28   24274  BLD: for C extension builds on mac, target mac...  [{'id': 129350, 'node_id': 'MDU6TGFiZWwxMjkzNT...  open
29   24271  sort_index not sorting when multi-index made b...                                                 []  open

通過一些體力活，我們可以構(gòu)建一些高層級(jí)的界面，讓web API直接返回DataFrame格式，以便于分析。

6.4 Interacting with Databases(與數(shù)據(jù)庫的交互)

如果在工作中，大部分?jǐn)?shù)據(jù)并不會(huì)以text或excel的格式存儲(chǔ)。最廣泛使用的是SQL-based的關(guān)系型數(shù)據(jù)庫（SQL Server，PostgreSQL，MySQL）。選擇數(shù)據(jù)庫通常取決于性能，數(shù)據(jù)整合性，實(shí)際應(yīng)用的可擴(kuò)展性。

讀取SQL到DataFrame非常直觀，pandas中有一些函數(shù)能簡化這個(gè)過程。舉個(gè)例子，這里創(chuàng)建一個(gè)SQLite數(shù)據(jù)庫，通過使用python內(nèi)建的sqlite3 driver:

In [20]: import sqlite3
    ...: import pandas as pd
    ...:
    ...:

In [21]: query = """
    ...: CREATE TABLE test
    ...: (a VARCHAR(20), b VARCHAR(20),
    ...:  c REAL,        d INTEGER
    ...: );"""

In [22]: con = sqlite3.connect('mydata.sqlite')

In [23]: con.execute(query)
Out[23]: <sqlite3.Cursor at 0x1aa05e626c0>

In [24]: con.commit()

然后插入幾行數(shù)據(jù)：

In [25]: data = [('Atlanta', 'Georgia', 1.25, 6),
    ...:         ('Tallahassee', 'Florida', 2.6, 3),
    ...:         ('Sacramento', 'California', 1.7, 5)]
    ...:

In [26]: stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [27]: con.executemany(stmt, data)
Out[27]: <sqlite3.Cursor at 0x1aa02d1f340>

In [28]: con.commit()

從表中選取數(shù)據(jù)時(shí)，大部分Python SQL驅(qū)動(dòng)器（PyODBC、psycopg2、MySQLdb、pymssql等）都會(huì)返回一個(gè)元組列表：

In [29]: cursor = con.execute('select * from test')

In [30]: rows = cursor.fetchall()

In [31]: rows
Out[31]:
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

你可以將這個(gè)元組列表傳給DataFrame構(gòu)造器，但還需要列名（位于光標(biāo)的description屬性中）：

In [32]: cursor.description
Out[32]:
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [33]: pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
Out[33]:
             a           b     c  d
0      Atlanta     Georgia  1.25  6
1  Tallahassee     Florida  2.60  3
2   Sacramento  California  1.70  5

我們不希望每次詢問數(shù)據(jù)庫的時(shí)候都重復(fù)以上步驟，這樣對(duì)計(jì)算機(jī)很不好(mung,【計(jì)】逐步對(duì)計(jì)算機(jī)系統(tǒng)或文件做小改動(dòng)導(dǎo)致大的損害)。SQLAlchemy計(jì)劃是一個(gè)六星的Python SQL工具箱，它能抽象出不同SQL數(shù)據(jù)庫之間的不同。pandas有一個(gè)read_sql函數(shù)，能讓我們從SQLAlchemy connection從讀取數(shù)據(jù)。這里我們用SQLAlchemy連接到同一個(gè)SQLite數(shù)據(jù)庫，并從之前創(chuàng)建的表格讀取數(shù)據(jù)：

In [135]: import sqlalchemy as sqla

In [136]: db = sqla.create_engine('sqlite:///mydata.sqlite')

In [137]: pd.read_sql('select * from test', db)
Out[137]: 
             a           b     c  d
0      Atlanta     Georgia  1.25  6
1  Tallahassee     Florida  2.60  3
2   Sacramento  California  1.70  5

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

12-16、17第06章數(shù)據(jù)加載、存儲(chǔ)與文件格式

12-16、17第06章數(shù)據(jù)加載、存儲(chǔ)與文件格式

6.1 讀寫文本格式的數(shù)據(jù)

1.逐塊讀取文本文件

2 Writing Data to Text Format (寫入數(shù)據(jù)到文本格式)

3. 處理分隔符格式

4.JSON數(shù)據(jù)

5 XML and HTML: Web Scraping (網(wǎng)絡(luò)爬取)

利用lxml.objectify解析XML

6.2 二進(jìn)制數(shù)據(jù)格式

1.使用HDF5格式

2 Reading Microsoft Excel Files（讀取微軟的excel文件）

6.3 Interacting with Web APIs (網(wǎng)絡(luò)相關(guān)的API交互)

6.4 Interacting with Databases(與數(shù)據(jù)庫的交互)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

12-16、17第06章 數(shù)據(jù)加載、存儲(chǔ)與文件格式

6.1 讀寫文本格式的數(shù)據(jù)

1.逐塊讀取文本文件

2 Writing Data to Text Format (寫入數(shù)據(jù)到文本格式)

3. 處理分隔符格式

4.JSON數(shù)據(jù)

5 XML and HTML: Web Scraping (網(wǎng)絡(luò)爬取)

利用lxml.objectify解析XML

6.2 二進(jìn)制數(shù)據(jù)格式

1.使用HDF5格式

2 Reading Microsoft Excel Files（讀取微軟的excel文件）

6.3 Interacting with Web APIs (網(wǎng)絡(luò)相關(guān)的API交互)

6.4 Interacting with Databases(與數(shù)據(jù)庫的交互)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

12-16、17第06章數(shù)據(jù)加載、存儲(chǔ)與文件格式