1. 提取文件中的時(shí)區(qū)并計(jì)數(shù)
有三種寫(xiě)法,雖然常用的是pandas,其實(shí)collections做起來(lái)也很快。
1.1 純Python代碼,提取并統(tǒng)計(jì)時(shí)區(qū)信息
1.2. 純Python代碼,應(yīng)用collections.Counter()模塊簡(jiǎn)寫(xiě)
1.3 用pandas處理,并用matplotlib.pyplot畫(huà)圖
1.1 純Python代碼,提取并統(tǒng)計(jì)時(shí)區(qū)信息
- 從文件中提取時(shí)區(qū)信息并變?yōu)榱斜?/li>
- 計(jì)算每個(gè)時(shí)區(qū)出現(xiàn)次數(shù)
- 排序并打印出現(xiàn)次數(shù)最高的n個(gè)時(shí)區(qū)。
# Uses Python3.6
import json
# extract the timezones from the file
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
# count the timezones appearance
def get_counts(sequence):
counts = dict()
for x in sequence:
counts[x] = counts.get(x,0) + 1
return counts
counts = get_counts(time_zones)
# compute and print the top appearance of the timezones and their counts.
def top_counts(count_dict, a ):
n = int(a)
value_key_pairs = [(count,tz) for tz,count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
print(top_counts(counts,3))
#output
[(400, 'America/Chicago'), (521, ''), (1251, 'America/New_York')]
1.2. 純Python代碼,應(yīng)用collections.Counter()模塊簡(jiǎn)寫(xiě)
用collections.Counters就能一鍵計(jì)數(shù)啦,十分方便。
import json
from collections import Counter
# extract the timezones from the file
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
# count the timezones appearance
counts = Counter(time_zones)
# compute and print the top appearance of the timezones and their counts.
print(counts.most_common(3))
1.3 用pandas處理,并用matplotlib.pyplot畫(huà)圖
# Input, uses python 3.6
import json
import pandas as pd
import matplotlib.pyplot as plt
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
# counts the appearance of the timezone
frame = pd.DataFrame(records)
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
print(tz_counts[:10])
# plot it and shows it
tz_counts[:10].plot(kind='barh',rot=0)
plt.show()
# Output
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Name: tz, dtype: int64

pandas-timezone.png
學(xué)習(xí)總結(jié):
- 取信息并組成列表,可以用
[ ]并在其中有簡(jiǎn)單的循環(huán)和條件判斷操作。 - 重用的代碼段寫(xiě)為函數(shù),方便調(diào)用。
- 如果沒(méi)接觸過(guò)collections ,可以看我的總結(jié) 如何使用python3 的 collections 模塊/庫(kù), Container datatypes
參考內(nèi)容:
《利用python進(jìn)行數(shù)據(jù)分析》Wes McKinney
示例代碼在github上。
https://github.com/wesm/pydata-book
可以下載個(gè)zip包到本地看,也可以用git clone下來(lái)。
pydata-book-2nd-edition.zip