概述

該系列是關(guān)于《利用Python進(jìn)行數(shù)據(jù)分析》的學(xué)習(xí)筆記，對(duì)應(yīng) 引言 > 來自bit.ly的1.usa.gov數(shù)據(jù) 部分。

來自bit.ly的1.usa.gov數(shù)據(jù)

注意：該部分使用了《利用Python進(jìn)行數(shù)據(jù)分析》的數(shù)據(jù)，可以去pydata-book - github下載數(shù)據(jù)。另外，使用Spyder進(jìn)行該次實(shí)驗(yàn)，記得將文件目錄設(shè)置為對(duì)應(yīng)的pydata-book。

下列代碼用來載入數(shù)據(jù)，了解數(shù)據(jù)格式：

# 數(shù)據(jù)集對(duì)應(yīng)路徑
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
# 顯示數(shù)據(jù)集的第一行
open(path).readline()
# 此處應(yīng)該可以看到一長串字符串，為JSON格式

下列代碼使用Python中內(nèi)置的JSON庫，將上述JSON格式的字符串轉(zhuǎn)化為Python的字典對(duì)象：

import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]  # 轉(zhuǎn)化為Python的字典對(duì)象
records[0]  # 檢查第一條記錄

對(duì)時(shí)區(qū)進(jìn)行計(jì)數(shù)（純Python代碼）

該部分用來對(duì)上述數(shù)據(jù)集中的時(shí)區(qū)字段（tz字段）進(jìn)行計(jì)數(shù)。

# 導(dǎo)入數(shù)據(jù)，并轉(zhuǎn)化為Python字典對(duì)象

# 下列代碼會(huì)報(bào)錯(cuò)，因?yàn)椴⒉皇撬杏涗浂加?tz'這個(gè)字段
# time_zones = [rec['tz'] for rec in records]

time_zones = [rec['tz'] for rec in records if 'tz' in rec]
# 檢查數(shù)據(jù)集中時(shí)區(qū)字段的前十個(gè)
time_zones[:10]

之后再使用純Python代碼對(duì)time_zones進(jìn)行計(jì)數(shù)，思路是遍歷字典慢慢數(shù)：

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

或者，使用Python的標(biāo)準(zhǔn)庫defaultdict :

from collections import defaultdict

def get_counts(sequence):
    counts = defaultdict(int)  # 所有的值均會(huì)被初始化為0
    for x in sequence:
        counts[x] += 1
    return counts

注意，defaultdict的特性可以參見The Python Standard Library。簡單來說，就是會(huì)用特定類型的默認(rèn)值初始化第一次出現(xiàn)的鍵。

之后，進(jìn)行測(cè)試：

counts = get_counts(time_zones)
counts['America/New_York']  # 檢查New York這個(gè)時(shí)區(qū)的計(jì)數(shù)情況

如果想得到時(shí)區(qū)計(jì)數(shù)的前十名及對(duì)應(yīng)的時(shí)區(qū)，可以使用如下代碼：

def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]

top_counts(counts)

當(dāng)然，你也可以使用Python標(biāo)準(zhǔn)庫collections.Counter來完成上述任務(wù)：

from collections import Counter
counts = Counter(time_zones)  # 計(jì)數(shù)
counts.most_common(10)  # 計(jì)數(shù)前十

注意，可以參見The Python Standard Library了解更多關(guān)于Counter的細(xì)節(jié)。

使用pandas對(duì)時(shí)區(qū)進(jìn)行計(jì)數(shù)

實(shí)現(xiàn)對(duì)時(shí)區(qū)的計(jì)數(shù)，只需要如下代碼：

from pandas import DataFrame, Series
import pandas as ps
import numpy as np

frame = DataFrame(records)
print frame  # 檢視frame，一長串。這被稱為frame的摘要視圖（summary view）

# frame['tz']返回的為Series對(duì)象
# Series對(duì)象的value_counts方法會(huì)對(duì)其進(jìn)行計(jì)數(shù)
tz_counts = frame['tz'].value_counts()

下面我們用這段數(shù)據(jù)生成一張圖片：

clean_tz = frame['tz'].fillna('Missing')  # 用`Missing`替換缺失值
clean_tz[clean_tz == ''] = 'Unknown'  # 用`Unknown`替換空白值

tz_counts = clean_tz.value_counts()
print tz_counts[:10]  # 檢視計(jì)數(shù)的前十項(xiàng)
tz_counts[:10].plot(kind='barh', rot=0)  # 繪圖，可視化的方式展示前十項(xiàng)

tz_counts.png

注意，Spyder中更改下設(shè)置，可以避免每次都要手動(dòng)引入NumPy及Matploylib。具體來說，將Spyder > Tools > Preferences > IPython console > Graphics中Support for graphics的Automatically load Pylab and NumPy modules勾選上。

為熟悉pandas，我們?cè)賮砜纯醋侄?code>a，該字段含有執(zhí)行URL短縮操作的瀏覽器、設(shè)備、應(yīng)用程序的相關(guān)信息：

frame['a'][1]
# u'GoogleMaps/RochesterNY'
frame['a'][59]
# u'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176'

我們現(xiàn)在來統(tǒng)計(jì)按Windows和非Windows用戶對(duì)時(shí)區(qū)統(tǒng)計(jì)信息進(jìn)行分解，為簡單起見，我們假定字段a中含有"Windows"就認(rèn)為該用戶為Windows用戶，反之就認(rèn)為為非Windows用戶。

cframe = frame[frame.a.notnull()]  # 剔除字段`a`為空的數(shù)據(jù)
# 根據(jù)字段`a`分為Windows用戶和非Windows用戶
operating_system = np.where(cframe['a'].str.contains('Windows'),
                                              'Windows', 'Not Windows')
# 根據(jù)時(shí)區(qū)和操作系統(tǒng)信息進(jìn)行分組
by_tz_os = cframe.groupby(['tz', operating_system])
# size對(duì)分組結(jié)果進(jìn)行計(jì)數(shù)
# unstack對(duì)計(jì)數(shù)結(jié)果進(jìn)行重塑
agg_counts = by_sz_os.size().unstack().fillna(0)

最后，我們來選取最常出現(xiàn)的時(shí)區(qū)：

# 用于按升序排列
indexer = agg_counts.sum(1).argsort()
count_subset = agg_counts.take(indexer)[-10:]
# 生成一張堆積條形圖
count_subset.plot(kind='barh', stacked=True)

# 生成對(duì)應(yīng)的比例圖
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

利用Python進(jìn)行數(shù)據(jù)分析 - 引言のusa.gov數(shù)據(jù)示例

利用Python進(jìn)行數(shù)據(jù)分析 - 引言のusa.gov數(shù)據(jù)示例

概述

來自bit.ly的1.usa.gov數(shù)據(jù)

對(duì)時(shí)區(qū)進(jìn)行計(jì)數(shù)（純Python代碼）

使用pandas對(duì)時(shí)區(qū)進(jìn)行計(jì)數(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

利用Python進(jìn)行數(shù)據(jù)分析 - 引言のusa.gov數(shù)據(jù)示例

概述

來自bit.ly的1.usa.gov數(shù)據(jù)

對(duì)時(shí)區(qū)進(jìn)行計(jì)數(shù)（純Python代碼）

使用pandas對(duì)時(shí)區(qū)進(jìn)行計(jì)數(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av