1.讀取數(shù)據(jù)
enrollments.csv
daily_engagement.csv
project_submissions.csv
三個(gè)文件的數(shù)據(jù),并打印第一行
import unicodecsv
def read_csv(filename):
with open(filename,'rb') as f:
reader = unicodecsv.DictReader(f)
return list(reader)
enrollments = read_csv('enrollments.csv')
daily_engagement = read_csv('daily-engagement.csv')
project_submissions = read_csv('project-submissions.csv')
print enrollments[0]
print daily_engagement[0]
print project_submissions[0]
2.修正數(shù)據(jù)類(lèi)型
將字符串轉(zhuǎn)換為時(shí)間的函數(shù)
from datetime import datetime as dt
def parse_date(date):
if date =='':
return None
else:
return dt.strptime(date,'%Y-%m-%d')
將字符串轉(zhuǎn)換為整數(shù)的函數(shù)
def parse_maybe_int(i):
if i =='':
return None
else:
return int(i)
轉(zhuǎn)換enrollments中的數(shù)據(jù)類(lèi)型
for enrollment in enrollments:
enrollment['join_date'] = parse_date(enrollment['join_date'])
enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
enrollment['is_udacity'] = enrollment['is_udacity'] =='True'
enrollment['is_canceled'] = enrollment['is_canceled'] =='True'
轉(zhuǎn)換daily_engagement 中的數(shù)據(jù)類(lèi)型
for engagement_record in daily_engagement:
engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
engagement_record['num_courses_visited'] =int(float(engagement_record['num_courses_visited']))
engagement_record['total_minutes_visited'] =float(engagement_record['total_minutes_visited'])
engagement_record['lessons_completed'] =int(float(engagement_record['lessons_completed']))
engagement_record['projects_completed'] =int(float(engagement_record['projects_completed']))
轉(zhuǎn)換project_submissions 中的數(shù)據(jù)類(lèi)型
for project_submission in project_submissions:
project_submission['creation_date'] = parse_date(project_submission['creation_date'])
project_submission['completion_date'] = parse_date(project_submission['completion_date'])
3.找到 csv 中的總行數(shù)以及不重復(fù)學(xué)員的數(shù)量
對(duì)于你加載的每個(gè)文件(一共有三個(gè)),找到 csv 中的總行數(shù)以及不重復(fù)學(xué)員的數(shù)量。
找三個(gè)文件的總行數(shù)
enrollment_num_rows = len(enrollments) #1640
engagement_num_rows = len(daily_engagement) #136240
submission_num_rows = len(project_submissions) #3642
找enrollments中不重復(fù)學(xué)員的數(shù)量
unique_enrolled_students = set()
for enrollment in enrollments:
unique_enrolled_students.add(enrollment['account_key'])
enrollment_num_unique_students=len(unique_enrolled_students)
找daily_engagement中不重復(fù)學(xué)員的數(shù)量
unique_engaged_students = set()
for engagement_record in daily_engagement:
unique_engaged_students.add(engagement_record['acct'])
engagement_num_unique_students =len(unique_engaged_students)
找project_submissions中不重復(fù)學(xué)員的數(shù)量
submission_num_rows = len(project_submissions)
unique_submission_students = set()
for project_submission in project_submissions:
unique_submission_students.add(project_submission['account_key'])
submission_num_unique_students = len(unique_submission_students) # Replace this with your code
4.數(shù)據(jù)中的問(wèn)題
- more unique students in enrollment than engagement table
- colunm named
account_keyin two tables andacctin the third
fix:change column fromaccttoaccount_key
rename theacctcolumn to 'account_key' in thedaily_engagementtable
for engagement_record in daily_engagement:
engagement_record['account_key'] = engagement_record['acct']
del(engagement_record['acct'])
5. 編寫(xiě)函數(shù)查找三個(gè)csv文件中不重復(fù)學(xué)員的數(shù)量
def get_unique_students(data):
unique_students = set()
for data_point in data:
unique_students.add(data_point['account_key'])
return unique_students
unique_enrolled_students = get_unique_students(enrollments)
unique_engaged_students = get_unique_students(daily_engagement)
unique_project_submissions =get_unique_students(project_submissions)
print len(unique_enrolled_students) #1302
print len(unique_engaged_students) #1237
print len(unique_project_submissions) #743
6.缺失的參與記錄
Investigate first problem
why are students missing from daily_engagement?
1.identify surprising data poins
-- any enrollment record with no corresponding engagement data
2.print out one or a few surprising data points
for enrollment in enrollments:
student = enrollment['account_key']
if student not in unique_engaged_students:
print enrollment
break
結(jié)論:join_date=cancel_date,days_to_cancel=0
7. 核查更多問(wèn)題記錄
Investigating data problems
1.identify surprising data poins
2.print out one or a few surprising data points
3.fix any problems you find
-- more investigation may be necessary
-- or there might not be a problem
在上面,我們發(fā)現(xiàn)某些學(xué)生在注冊(cè)一天內(nèi)就注銷(xiāo)了賬號(hào),這并不算什么問(wèn)題,這解釋了為什么在engagement表中沒(méi)有該學(xué)生的信息,在隨后的分析中,可能要排除此類(lèi)學(xué)生,或者要知道此類(lèi)學(xué)生的存在以便防止代碼邊際問(wèn)題的產(chǎn)生
查找注冊(cè)表中注冊(cè)至少一天的學(xué)生,未出現(xiàn)在參與表中,并且不是在一天之內(nèi)就注銷(xiāo)的學(xué)生
num_problem_students = 0
for enrollment in enrollments:
student = enrollment['account_key']
if student not in unique_engaged_students and enrollment['days_to_cancel'] != 0:
num_problem_students +=1
print enrollment
print num_problem_students #3
打印出這些異常數(shù)據(jù)后,發(fā)現(xiàn)這三個(gè)問(wèn)題數(shù)據(jù)都是Udacity的測(cè)試賬號(hào),而這些賬號(hào)不一定會(huì)在daily_engagement表格中出現(xiàn),這就回答了我們的疑慮
8. 排除udacity測(cè)試賬號(hào)
找出enrollments中測(cè)試賬號(hào)
udacity_test_accounts = set()
for enrollment in enrollments:
if enrollment['is_udacity']:
udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts) #6
寫(xiě)函數(shù)刪除與測(cè)試賬號(hào)相關(guān)的所有數(shù)據(jù)
def remove_udacity_accounts(data):
non_udacity_accounts = []
for data_point in data:
if data_point['account_key'] not in udacity_test_accounts:
non_udacity_accounts.append(data_point)
return non_udacity_accounts
在三個(gè)表格中調(diào)用上面的函數(shù),看每個(gè)表中還有多少記錄
non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)
print len(non_udacity_enrollments) #1622
print len(non_udacity_engagement) #135656
print len(non_udacity_submissions) #3634
9.提煉問(wèn)題
only look at engagement from first week,and exclude students who cancel within a week
create a dictionary of students who either:
- haven't canceled yet(days_to_cancel is none)
- stayed enrolled more than 7 days (days_to_cancel >7)
key:account_key value:enrollment date
name:paid_students
paid_students = {}
for enrollment in non_udacity_enrollments:
if not enrollment['is_canceled'] or enrollment['days_to_cancel'] >7:
account_key = enrollment['account_key']
enrollment_date = enrollment['join_date']
paid_students[account_key] = enrollment_date
len(paid_students) #995
由于同一個(gè)學(xué)生可以注冊(cè)多次,那么上面保存的enrollment_date就是學(xué)生多個(gè)注冊(cè)日期中的任意一個(gè),在這種情況下,我們應(yīng)該保存最近的注冊(cè)日期,做出如下修改
paid_students = {}
for enrollment in non_udacity_enrollments:
if not enrollment['is_canceled'] or enrollment['days_to_cancel'] >7:
account_key = enrollment['account_key']
enrollment_date = enrollment['join_date']
if account_key not in paid_students or enrollment_date>paid_students[account_key]:
paid_students[account_key] = enrollment_date
len(paid_students)
10. 獲取第一周數(shù)據(jù)
找到paid_students中的學(xué)生,且參與時(shí)間utc_data距離enrollment_date不超過(guò)一周
列表paid_engagement_in_first_week
確定兩個(gè)間隔不超過(guò)一周的函數(shù)
def within_one_week(join_date,engagement_date):
time_delta = engagement_date - join_date
return time_delta.days <7
刪除免費(fèi)期注銷(xiāo)的學(xué)生
def remove_free_trial_cancels(data):
new_data = []
for data_point in data:
if data_point['account_key'] in paid_students:
new_data.append(data_point)
return new_data
在三個(gè)非udacity測(cè)試賬號(hào)的數(shù)據(jù)中調(diào)用上面的函數(shù),得到付費(fèi)的enrollments,付費(fèi)的engagement,付費(fèi)的submissions
paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagement)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)
print len(paid_enrollments) #1293
print len(paid_engagement) #134549
print len(paid_submissions) #3618
獲取第一周的付費(fèi)engagement
paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
account_key = engagement_record['account_key']
join_date = paid_students[account_key]
engagement_record_date = engagement_record['utc_date']
if within_one_week(join_date,engagement_record_date):
paid_engagement_in_first_week.append(engagement_record)
len(paid_engagement_in_first_week) #21580
11. 探索學(xué)員參與度
探索學(xué)員第一周上課的平均時(shí)間
1.對(duì)參與記錄進(jìn)行分組,使各組分別含有某學(xué)生的所有參與記錄
from collections import defaultdict #如果在字典中尋找不存在的key,就會(huì)得到空列表
for engagement_record in paid_engagement_in_first_week:
account_key = engagement_record['account_key']
engagement_by_account[account_key].append(engagement_record)
2.將各個(gè)學(xué)生的參與時(shí)間相加
total_minutes_by_account = {}
for account_key,engagement_for_student in engagement_by_account.items():
total_minutes = 0
for engagement_record in engagement_for_student:
total_minutes += engagement_record['total_minutes_visited']
total_minutes_by_account[account_key]=total_minutes
3.計(jì)算上面總數(shù)的平均數(shù)
import numpy as np
total_minutes = total_minutes_by_account.values()
print 'mean:',np.mean(total_minutes) #647.590173826
print 'standard deviation:',np.std(total_minutes) #1129.27121042
print 'Maximum:',np.max(total_minutes) #10568.1008673
print 'Minimum:',np.min(total_minutes) #0.0
12.調(diào)試數(shù)據(jù)分析代碼
Maximum>一周的總時(shí)間
上課分鐘數(shù)最多的哪個(gè)學(xué)生的數(shù)據(jù)出現(xiàn)了異常,需要先找到那個(gè)學(xué)生
student_with_max_minutes = None
max_minutes = 0
for student,total_minutes in total_minutes_by_account.items():
if total_minutes > max_minutes:
student_with_max_minutes = student
max_minutes = total_minutes
max_minutes
打印出這個(gè)學(xué)生的每條參與記錄
for engagement_record in paid_engagement_in_first_week:
if engagement_record['account_key'] == student_with_max_minutes:
print engagement_record
得到的條目數(shù)超過(guò)了7,并且數(shù)據(jù)點(diǎn)也不在一周的范圍內(nèi)
由此判斷within_one_week函數(shù)出現(xiàn)了問(wèn)題,engagement_date應(yīng)該在join_date之后
def within_one_week(join_date,engagement_date):
time_delta = engagement_date - join_date
return time_delta.days <7 and time_delta.days >= 0
修改within_one_week函數(shù)后,得到的Maximum=3564,這是合理的
13.第一周完成的課程數(shù)
total_lessons_by_account = {}
for account_key,engagement_for_student in engagement_by_account.items():
total_lessons = 0
for engagement_record in engagement_for_student:
total_lessons += engagement_record['total_lessons_visited']
total_lessons_by_account[account_key]=total_lessons
total_lessons = total_lessons_by_account.values()
print 'mean:',np.mean(total_lessons) #647.590173826
print 'standard deviation:',np.std(total_lessons) #1129.27121042
print 'Maximum:',np.max(total_lessons) #10568.1008673
print 'Minimum:',np.min(total_lessons) #0.0
使用函數(shù)解決這個(gè)問(wèn)題
1.按照賬戶對(duì)記錄進(jìn)行分組的函數(shù)
from collections import defaultdict
def group_data(data,key_name):
grouped_data = defaultdict(list)
for data_point in data:
key = data_point[key_name]
grouped_data[key].append(data_point)
return grouped_data
engagement_by_account = group_data(paid_engagement_in_first_week,'account_key')
2.將各個(gè)賬戶的總條目數(shù)值加總
def sum_grouped_items(grouped_data,field_name):
summed_data = {}
for key,data_points in grouped_data.items():
total = 0
for data_point in data_points:
total += data_point[field_name]
summed_data[key]=total
return summed_data
total_minutes_by_account=sum_grouped_items(engagement_by_account,'total_minutes_visited')
3.打印統(tǒng)計(jì)結(jié)果
def describe_data(data):
print 'mean:',np.mean(data)
print 'standard deviation:',np.std(data)
print 'Maximum:',np.max(data)
print 'Minimum:',np.min(data)
describe_data(total_minutes_by_account.values())
第一周完成的課程數(shù)
total_lessons_by_account=sum_grouped_items(engagement_by_account,'lessons_completed')
describe_data(total_lessons_by_account.values())
14.分析各學(xué)生上課的總天數(shù)
在數(shù)據(jù)中創(chuàng)建has_visited字段
for engagement_record in paid_engagement:
if engagement_record['num_courses_visited']>0:
engagement_record['has_visited'] =1
else:
engagement_record['has_visited']=0
計(jì)算各學(xué)生上課的總天數(shù)
days_visited_by_account = sum_grouped_items(engagement_by_account,'has_visited')
describe_data(days_visited_by_account.values())
15.劃分及格學(xué)員
創(chuàng)建通過(guò)項(xiàng)目的學(xué)生的集合
subway_lessons_key = ['746169184','3176718735']
pass_project_students = set()
for project_submission in paid_submissions:
lesson_key = project_submission['lesson_key']
assigned_rating = project_submission['assigned_rating']
if lesson_key in subway_lessons_key and (assigned_rating =='PASSED' or assigned_rating =='DISTINCTION'):
pass_project_students.add(project_submission['account_key'])
len(pass_project_students) #647
劃分通過(guò)項(xiàng)目和未通過(guò)項(xiàng)目的學(xué)生的參與記錄
passing_engagement = []
non_passing_engagement = []
for engagement_record in paid_engagement_in_first_week:
if engagement_record['account_key'] in pass_project:
passing_engagement.append(engagement_record)
else:
non_passing_engagement.append(engagement_record)
print len(passing_engagement) #4527
print len(non_passing_engagement) #2392
16.比較兩組學(xué)員
指標(biāo):
total_minutes_visited
total_lessons_visited
has_visited
將兩組學(xué)員按照account_key進(jìn)行匯總
passing_engagement_by_account = group_data(passing_engagement,'account_key')
non_passing_engagement_by_account = group_data(non_passing_engagement,'account_key')
指標(biāo)total_minutes_visited的對(duì)比_
passing_minutes=sum_grouped_items(passing_engagement_by_account,'total_minutes_visited')
non_passing_minutes=sum_grouped_items(non_passing_engagement_by_account,'total_minutes_visited')
print"passing engagement"
describe_data(passing_minutes.values())
print"non passing engagement"
describe_data(non_passing_minutes.values())
指標(biāo)total_lessons_visited的對(duì)比_
passing_lessons=sum_grouped_items(passing_engagement_by_account,'lessons_completed')
non_passing_lessons=sum_grouped_items(non_passing_engagement_by_account,'lessons_completed')
print"passing engagement"
describe_data(passing_lessons.values())
print"non passing engagement"
describe_data(non_passing_lessons.values())
指標(biāo)has_visited的對(duì)比_
passing_visits=sum_grouped_items(passing_engagement_by_account,'has_visited')
non_passing_visits=sum_grouped_items(non_passing_engagement_by_account,'has_visited')
print"passing engagement"
describe_data(passing_visits.values())
print"non passing engagement"
describe_data(non_passing_visits.values())
17.創(chuàng)建直方圖
可視化數(shù)據(jù)
盡管你知道各種指標(biāo)的均值、標(biāo)準(zhǔn)偏差、最大值和最小值,但是每個(gè)指標(biāo)都有更多值得一提的方面。是否有更多與最小值或最大值接近的值?什么是中位數(shù)?等等。
在此處使用直方圖可視化數(shù)據(jù),要比輸出更多統(tǒng)計(jì)數(shù)據(jù)更有意義。
創(chuàng)在 Python 中創(chuàng)建直方圖
要在 Python 中創(chuàng)建直方圖,你可以使用 Anaconda 隨附的 matplotlib 庫(kù)。以下代碼將使用被稱(chēng)為 data 的數(shù)據(jù)點(diǎn)示例列表來(lái)創(chuàng)建直方圖。
data = [1, 2, 1, 3, 3, 1, 4, 2]
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(data)
%matplotlib inline 這行代碼專(zhuān)門(mén)用于 IPython 筆記本,可使圖形呈現(xiàn)在你的筆記本而非新窗口中。如果你沒(méi)有使用 IPython 筆記本,你無(wú)需包含這行代碼,而是應(yīng)該在底部添加 plt.show() 這行代碼,以便圖形能夠呈現(xiàn)在新窗口中。
創(chuàng)建直方圖 of student data
我們?cè)谘芯?strong>通過(guò)和未通過(guò)地鐵項(xiàng)目考核的學(xué)員時(shí)用到了三個(gè)指標(biāo),現(xiàn)在就讓我們創(chuàng)建每個(gè)指標(biāo)的直方圖。也就是,你應(yīng)該創(chuàng)建 6 個(gè)直方圖。在這些直方圖中,對(duì)于通過(guò)和未通過(guò)地鐵項(xiàng)目考核的學(xué)員,兩者的圖形是否有很大的差異?
describe_data() 功能包括:數(shù)據(jù)的統(tǒng)計(jì)量、直方圖
%pylab inline
from matplotlib import pyplot as plt
def describe_data(data):
print 'mean:',np.mean(data)
print 'standard deviation:',np.std(data)
print 'Maximum:',np.max(data)
print 'Minimum:',np.min(data)
plt.hist(data)
六個(gè)直方圖
describe_data(passing_minutes.values())
describe_data(non_passing_minutes.values())
describe_data(passing_lessons.values())
describe_data(non_passing_lessons.values())
describe_data(passing_visits.values())
describe_data(non_passing_visits.values())
修正分組數(shù)量
要改變每個(gè)直方圖中分組的數(shù)量,請(qǐng)嘗試對(duì) hist 函數(shù)使用 bins 參數(shù)。你可以在此處找到有關(guān) hist 函數(shù)和參數(shù)的文檔。
18. 你的結(jié)果只是噪音嗎
tentative conclusion:
students who pass the subway project spend more minutes in the classroom during their first week
but is this a true difference,or due to noise in the data
you can check this using statistics
19.相關(guān)性不表明因果關(guān)系
correlation does not imply causation
correlation:
students who pass the first project are more likely to visit the classroom multiple times in their first week
causation:
dose visiting the classroom multiple times cause students to pass their project?
third factors that could cause visiting the classroom and passing projects:
--level of interest
--background knowledge
or this correlation could be because of causation
we just don't know
to find out,run on a/b test
20.基于眾多特征進(jìn)行預(yù)測(cè)
which students are likely to pass their first project?
could take a first pass using heuristics,but getting a really good prediction this way could be difficult
-- lots of different pieces of information to look at
-- these features can interact
machine learning can make predictions automatically
21.溝通
what findings are most interesting?
--difference in total minutes
-- difference in days visited
how will you present them?
--report average minutes
--show histograms(polish any visualizations
22.改善圖形 分享心得
添加標(biāo)簽和標(biāo)題
在 matplotlib 中,你可以使用 plt.xlabel("Label for x axis") 和 plt.ylabel("Label for y axis") 添加軸標(biāo)簽。對(duì)于直方圖,你通常僅需要一個(gè) x 軸標(biāo)簽,但其他類(lèi)型的圖形可能還需要 y 軸標(biāo)簽。你還可以使用 plt.title("Title of plot") 添加標(biāo)題。
使用 seaborn 美化繪圖
你可以使用 seaborn 庫(kù)自動(dòng)美化 matplotlib 圖形。該庫(kù)沒(méi)有自動(dòng)包含在 Anaconda 中,但是 Anaconda 自帶的包管理器可使你更加輕松地添加新庫(kù)。要使用這個(gè)被稱(chēng)為“conda”的包管理器,你應(yīng)該打開(kāi)命令提示符(在 PC 上)或終端行界面(在 Mac 或 Linux 上),然后鍵入命令 conda install seaborn。
如果你使用了與 Anaconda 不同的 Python 安裝程序,你的包管理器可能會(huì)有所不同。最常見(jiàn)的就是 pip 和 easy_install,你可以分別通過(guò) pip install seaborn 或 easy_install seaborn 命令來(lái)使用它們。
一旦你安裝了 seaborn,你就可以使用 import seaborn as sns 將其導(dǎo)入代碼的任何位置。這樣,你在此后創(chuàng)建的圖形就會(huì)自動(dòng)進(jìn)行美化。試一試吧!
seaborn 包還包括一些附加函數(shù),你可以用來(lái)創(chuàng)建在 matplotlib 中可能難以繪制的復(fù)雜圖形。我們不會(huì)在本課程中涉及此方面內(nèi)容,但是如果你想知道 seaborn 中有哪些函數(shù),你可以查閱文檔。
向圖形添加額外參數(shù)
你還將頻繁添加一些參數(shù)到圖形中,用來(lái)調(diào)整圖形的外觀。你可以在 hist 函數(shù)的文檔頁(yè)面查看可用參數(shù)??捎脕?lái)傳入圖形的一個(gè)常見(jiàn)參數(shù)就是 bins 參數(shù),可設(shè)置直方圖所使用的分組數(shù)量。例如,plt.hist(data, bins=20) 可以確保直方圖有 20 個(gè)分組。
改善你的一個(gè)圖形
使用這些方法至少改善你之前繪制的一個(gè)圖形。
分享心得
最后,確定你最想和他人交流的關(guān)于本節(jié)課的心得體會(huì)
解法代碼
import seaborn as sns
sns.set(color_codes=True)
plt.hist(non_passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' +
'for students who do not pass the subway project')
plt.hist(passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' +
'for students who pass the subway project')
特別注意:
如果你使用的是 seaborn 版本是 0.8 之后的話,你還需要額外加上 seaborn.set 函數(shù)才能啟用(文檔)。比如:
sns.set(color_codes=True)
23.數(shù)據(jù)分析與相關(guān)術(shù)語(yǔ)
data analysis and related terms
data science
-- similar to data analysis
-- more focused on building systems
-- may require more experience
data engineering
-- more focused on data wrangling
-- involves data storage and processing
big data
-- fuzzy term for 'a lot' of data
-- data analysts,scientist,and engineers and all work with big data