手機(jī)中的相機(jī)是深受大家喜愛的應(yīng)用之一,現(xiàn)在該手機(jī)廠商想要分析手機(jī)中的應(yīng)用(相機(jī))的活躍情況,需統(tǒng)計(jì)如下數(shù)據(jù):
某日活躍用戶(用戶id)在后續(xù)的一周內(nèi)的留存情況(計(jì)算次日留存用戶數(shù),3日留存用戶數(shù),7日留存用戶數(shù))
指標(biāo)定義:
?某日活躍用戶數(shù),某日活躍的去重用戶數(shù)。
?N日活躍用戶數(shù),某日活躍的用戶數(shù)在之后的第N日活躍用戶數(shù)。
?N日活躍留存率,N日留存用戶數(shù)/某日活躍用戶數(shù)
?例:登陸時(shí)間(20180501日)去重用戶數(shù)10000,這批用戶在20180503日仍有7000人活躍,則3日活躍留存率為7000/10000=70%


1、分別用Python、sql導(dǎo)入附件(相機(jī).xlsx)的數(shù)據(jù)進(jìn)入自己的Mysql當(dāng)中,要求字段符合、表名符合(可加后綴區(qū)分)。
import pandas as pd
import numpy as np
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
excel_tab = pd.read_excel('camera_app.xlsx', sheetname='Sheet1')
excel_tab.head()
engine=create_engine("mysql://root@localhost/camera_app_ana?charset=gbk")
excel_tab.to_sql("camear_app_ana_table", con = engine)
2、分析pd.to_sql中的if_exist參數(shù)可用哪些,分別是什么,有什么細(xì)節(jié)區(qū)別?認(rèn)為用哪種更好。
- fail 直接插入失敗
- append 在原來數(shù)據(jù)上面追加
- replace 刪除原來的數(shù)據(jù),插入新數(shù)據(jù)。
一般情況下append好一些
3、總結(jié)MYSQL的date_format、concat的用法,并以此數(shù)據(jù)舉例說明。
concat('a', 1, 'b')---->a1b
4、利用MYSQL語句解題,分步驟達(dá)到目的再一次性運(yùn)行,達(dá)到題目要求。
?(1)求活躍度計(jì)算
?(2)次日留存
?(3)延伸 次日、三日、七日留存
with tab_temp
as( select uid, dayno from camear_app_ana_table
where cast(app_name as char(64))='相機(jī)'
GROUP BY uid, dayno) ,-- 去重,只保留相機(jī)
temp1
as( select uid as uid1 from tab_temp where day(dayno)=1 ),
temp2
as( select uid as uid2 from tab_temp where day(dayno)=1+1 ),
temp3
as( select uid as uid3 from tab_temp where day(dayno)=1+2 ),
temp7
as( select uid as uid7 from tab_temp where day(dayno)=1+6 )
select count(uid1) as 活躍用戶數(shù),
count(uid2) as 次日留存數(shù) ,
count(uid3) as 三日留存數(shù),
count(uid7) as 七日留存數(shù),
count(uid2)/count(uid1) as 次日留存率,
count(uid3)/count(uid1) as 三日留存率,
count(uid7)/count(uid1) as 七日留存率
from ((temp1 left join temp2 on uid1=uid2) --注意連環(huán)join的寫法
left join temp3 on uid1=uid3)
left join temp7 on uid1=uid7;

5、用Python解題,要求函數(shù)封裝,利用調(diào)用函數(shù)計(jì)算。
方法一
目標(biāo)是轉(zhuǎn)為這種樣式

# 先是一堆預(yù)處理
cam_data = excel_tab[ excel_tab.app_name=='相機(jī)'].copy() # 篩選
cam_data = cam_data.groupby(['uid', 'dayno']).count().reset_index() # 去重
cam_data['day']=cam_data.dayno.apply(lambda x:x.day) # 抽取天 更直觀
cam_pure_data = cam_data[ ['uid', 'day']] # 只保留最核心的兩列
# groupby
grouped_uid = cam_pure_data.groupby('uid')
grouped_uid
# 把groupby 行轉(zhuǎn)列,搞成新的df
ret_df = pd.DataFrame(columns=range(1,11))
for uid, day_df in grouped_uid:
day_df_indexed = day_df.set_index('day')
# 轉(zhuǎn)化為天--》uid的映射df
#day
#3 3
#4 3
#10 3
day_df_indexed['uid'] = 1 # 只需要區(qū)分0 1 即可
day_series_indexed = day_df_indexed.uid # 轉(zhuǎn)為series
org_zero = pd.Series(range(1, 11), index=range(1, 11)) # 一共就10天 1日到10日
#print(day_series_indexed)
maped = org_zero.map(day_series_indexed) # 最神奇的一步,用了攝像機(jī)的天將會(huì)變?yōu)?, 沒用的不會(huì)變還是0
#print(maped)
ret_df.loc[uid] = maped;
ret_df.fillna(0, inplace=True)
print((ret_df))
結(jié)果如下:

有了這個(gè)表再計(jì)算 活躍留存都是輕輕松松。
def cal_from_df(day):
sub_df = ret_df[ [day, day+1, day+2, day+6] ].copy()
sub_df[day+1]=sub_df[day]*sub_df[day+1]
sub_df[day+2]=sub_df[day]*sub_df[day+2]
sub_df[day+6]=sub_df[day]*sub_df[day+6]
c1 = sub_df[day].sum()
c2 = sub_df[day+1].sum()
c3 = sub_df[day+2].sum()
c7 = sub_df[day+6].sum()
return c1, c2, c3, c7, c2/c1, c3/c1, c7/c1
print(cal_from_df(1))

方法二
def cal(day):
cam_data = excel_tab[ excel_tab.app_name=='相機(jī)'].copy() # 篩選
cam_data = cam_data.groupby(['uid', 'dayno']).count().reset_index() # 去重
cam_data['day']=cam_data.dayno.apply(lambda x:x.day) # 抽取天 更直觀
cam_pure_data = cam_data[ ['uid', 'day']] # 只保留最核心的兩列
#print(cam_pure_data)
set1 = cam_pure_data.loc[cam_pure_data['day']==day]
set2 = cam_pure_data.loc[cam_pure_data['day']==day+1]
set3 = cam_pure_data.loc[cam_pure_data['day']==day+2]
set7 = cam_pure_data.loc[cam_pure_data['day']==day+6]
c1 = set1.uid.count()
merged = pd.merge(set1, set2, how='left', on='uid')
c2 = merged.day_y.count()
merged = pd.merge(set1, set3, how='left', on='uid')
c3 = merged.day_y.count()
# print(merged)
# print(set1, set3)
merged = pd.merge(set1, set7, how='left', on='uid')
# print(merged)
# print(set1, set7)
c7 = merged.day_y.count()
return c1, c2, c3, c7, c2/c1, c3/c1, c7/c1
print(cal(1))
