數(shù)據(jù)解讀獨(dú)角獸企業(yè)“猿輔導(dǎo)”(第二部分)

上一次的內(nèi)容中,我們已經(jīng)把確定目標(biāo)數(shù)據(jù)獲取的部分完成了,接下來我們來講一下如何進(jìn)行數(shù)據(jù)處理數(shù)據(jù)展示吧。

第三部分:數(shù)據(jù)處理

數(shù)據(jù)處理是數(shù)據(jù)分析的核心部分,我也給它打了50%的權(quán)重。之前的數(shù)據(jù)獲取部分,我們拿到了1615條原始的課程數(shù)據(jù)。但是,這個(gè)數(shù)據(jù)中可能有無法使用和重復(fù)等情況,而且我們還要根據(jù)之后展示數(shù)據(jù)的要求進(jìn)行數(shù)據(jù)提取??梢?,我們的數(shù)據(jù)處理又可以分成兩步:數(shù)據(jù)清洗和數(shù)據(jù)提取。

一、數(shù)據(jù)清洗

我們還是使用python來對(duì)數(shù)據(jù)進(jìn)行清洗。在之前的數(shù)據(jù)獲取中我們拿到了三張表:總課程表、課程詳情表和教師表。

我們先來看一下總課程表:

可以看出,我們之前為了爬取課程詳情在課程表中增加了lessonurl列,在清洗時(shí)可以去掉。同時(shí)根據(jù)年級(jí)(grade)我們還可以增加一列學(xué)段(小學(xué)、初中、高中),可以為之后的展示提供更多的角度。

代碼如下

#導(dǎo)入模塊
import numpy as np
import pandas as pd

#打印出前5行,以確保數(shù)據(jù)運(yùn)行正常
lessonDf.head()

#選擇子集
lessonDf = lessonDf[['grade','channelid','lessonid','lessonname']]

#通過條件判斷篩選
#構(gòu)建查詢條件
querySer = lessonDf['lessonid'] != 'groups/'
#根據(jù)查詢條件篩選
lessonDf = lessonDf.loc[querySer, : ]
lessonDf.shape

#通過條件判斷篩選
#構(gòu)建查詢條件
querySer = lessonDf['lessonid'] != 'groups/'
#根據(jù)查詢條件篩選
lessonDf = lessonDf.loc[querySer, : ]
lessonDf.shape

#給channelid賦值
for i in range(0, 1389):
    channel = lessonDf.iloc[i,1]
    if channel == 1:
        lessonDf.iloc[i,1]='語文'
    elif channel == 2:
        lessonDf.iloc[i,1]='數(shù)學(xué)'
    elif channel == 3:
        lessonDf.iloc[i,1]='英語'
    elif channel == 201:
        lessonDf.iloc[i,1]='編程'
    elif channel == 4:
        lessonDf.iloc[i,1]='物理'
    elif channel == 5:
        lessonDf.iloc[i,1]='化學(xué)'
    elif channel == 6:
        lessonDf.iloc[i,1]='生物'
    elif channel == 7:
        lessonDf.iloc[i,1]='歷史'
    elif channel == 8:
        lessonDf.iloc[i,1]='地理'
    elif channel == 9:
        lessonDf.iloc[i,1]='政治'
    elif channel == 14:
        lessonDf.iloc[i,1]='道德與法制'
    else :
        pass

#創(chuàng)建studyphase列
lessonDf['studyphase'] = lessonDf['grade']

#給studyphase賦值
for i in range(0, 1389):
    phase = lessonDf.iloc[i,4]
    if 0< phase<=6:
        lessonDf.iloc[i,4]='小學(xué)'
    elif 6< phase <=9:
        lessonDf.iloc[i,4]='初中'
    elif phase > 9:
        lessonDf.iloc[i,4]='高中'
    else :
        pass

#去重
lessonDf = lessonDf.drop_duplicates()

#導(dǎo)出數(shù)據(jù)
lessonDf.to_excel('lesson.xls')

清洗結(jié)果如下:


同樣的,我們也對(duì)課程詳細(xì)表和教師表進(jìn)行去除臟數(shù)據(jù)、去重等操作,導(dǎo)出teacher和lessondetail兩張表:

二、數(shù)據(jù)提取

處理好數(shù)據(jù)之后,下一步就是根據(jù)實(shí)際的數(shù)據(jù)展示需求來提取數(shù)據(jù)了。參考下面的腦圖我們可以更好的理解應(yīng)當(dāng)提取哪些數(shù)據(jù):

這幅腦圖和第一部分爬取數(shù)據(jù)時(shí)的腦圖相比更加清晰,每個(gè)子主題都可以對(duì)應(yīng)一個(gè)提數(shù)需求。這里我將上面數(shù)據(jù)清洗得到的三張表導(dǎo)入Sequel Pro,使用SQL語句來提取所需的數(shù)據(jù)。

1、學(xué)科、學(xué)段、年級(jí)下課程數(shù)排名

(1)不同學(xué)科的課程數(shù)量
select channelid '學(xué)科類型', count(distinct lessonid) '課程數(shù)量' from lesson
group by channelid;

(2)不同學(xué)段的課程數(shù)量
select studyphase '學(xué)段', count(distinct lessonid) '課程數(shù)量' from lesson
group by studyphase;

(3)不同年級(jí)的課程數(shù)量
select grade '年級(jí)', count(distinct lessonid) '課程數(shù)量' from lesson
group by grade;

2、學(xué)科、學(xué)段、年級(jí)下課程報(bào)名人數(shù)排名

(1)不同年級(jí)、學(xué)科下報(bào)名人數(shù)
select grade '年級(jí)', channelid '學(xué)科', sum(signup_number) '報(bào)名人數(shù)' from lessondetail d
left join
(select distinct grade, channelid, lessonid
from lesson) t
on d.lesson_id = t.lessonid
group by grade, channelid

(2)不同學(xué)段下報(bào)名人數(shù)
select studyphase '學(xué)段', sum(signup_number) '報(bào)名人數(shù)' from lessondetail d
left join
(select distinct studyphase, lessonid
from lesson) t
on d.lesson_id = t.lessonid
group by studyphase

(3)不同年級(jí)下課程報(bào)名人數(shù)
select grade '年級(jí)', sum(signup_number) '報(bào)名人數(shù)' from lessondetail d
left join
(select distinct grade, lessonid from lesson) t
on d.lesson_id = t.lessonid
group by grade

3、學(xué)科、學(xué)段、年級(jí)下老師數(shù)量排名

老師總數(shù)量
select count(distinct teacher_id) from teacher;

(1)不同學(xué)科的老師數(shù)量
select channelid '學(xué)科', count(teacher_id) '教師數(shù)'
from (select distinct channelid, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by channelid

(2)不同學(xué)段的老師數(shù)量
select studyphase '學(xué)段', count(teacher_id) '教師數(shù)'
from (select distinct studyphase, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by studyphase

(3)不同年級(jí)的老師數(shù)量
select grade '年級(jí)' , count(teacher_id) '教師數(shù)'
from (select distinct grade, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by grade

(4)課程配師的分布情況
select teacher_num '課程配備教師個(gè)數(shù)', count(t.lesson_id) '課程數(shù)量' from(
select lesson_id, count(1) as teacher_num from teacher
group by lesson_id) t
group by teacher_num

(5)老師上的課程數(shù)分布
select teacher_id, ct '上課數(shù)' from(
select teacher_id , count(1) ct from teacher
group by teacher_id) t
order by t.ct desc

4、學(xué)科、學(xué)段、年級(jí)下營收情況排名

(1)不同學(xué)科的營收
select channelid '學(xué)科' , sum(profit) '營收' from
(#對(duì)學(xué)科下的課程id去重
select channelid, lesson_id, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by channelid, lesson_id) t
group by channelid

(2)不同學(xué)段的營收
select studyphase '學(xué)段' , sum(profit) '營收' from
(#對(duì)學(xué)段下的課程id去重
select studyphase, lesson_id, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by studyphase, lesson_id) t
group by studyphase

(3)不同年級(jí)+學(xué)科的營收
select grade '年級(jí)' ,channelid '學(xué)科', sum(profit) '營收' from
(#對(duì)年級(jí)下的課程id去重
select grade, lesson_id, channelid, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by grade, lesson_id, channelid) t
group by grade, channelid

(4)老師對(duì)營收的貢獻(xiàn)排名
select teacher_id '教師id' , studyphase '學(xué)段' , channelid '學(xué)科' , sum(signup_num) '學(xué)生數(shù)',
count(1) '代課數(shù)' , sum(profit) '營收' from
(select lesson_id, signup_num, profit from
(#對(duì)多老師的課程營收分配成相應(yīng)價(jià)格
select l.lesson_id, count(l.lesson_id), sum(signup_number)/count(l.lesson_id) signup_num,
price*signup_number/count(l.lesson_id) profit
from lessondetail l
join teacher t
on l.lesson_id = t.lesson_id
group by l.lesson_id) t1) t2
join teacher t
join lesson l
on t.lesson_id = t2.lesson_id
and t2.lesson_id = l.lessonid
group by teacher_id, studyphase, channelid

(5)課程對(duì)營收的貢獻(xiàn)排名
select lesson_id, sum(price*signup_number) profit
from lessondetail
group by lesson_id


以上SQL語句基本上滿足了大部分的提數(shù)需求。此外,還有一張表可以用來計(jì)算相關(guān)性矩陣:

合并表

select d.lesson_id, price, signup_number,
studyphase, grade, channelid,
price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid

第四部分、數(shù)據(jù)展示

經(jīng)過以上步驟的數(shù)據(jù)提取,我們可以根據(jù)需求來展示數(shù)據(jù)了。一個(gè)比較經(jīng)典的圖是這樣的:

考慮到我們需要展示的數(shù)據(jù)主要目的為比較和構(gòu)成,因此我選擇條形圖、柱狀體、餅圖等基本圖形來進(jìn)行展示。

由于篇幅有限,這里只展示一下相關(guān)性矩陣:

可見,課程的營收情況主要和報(bào)名人數(shù)以及年級(jí)成正相關(guān)。其他具體數(shù)據(jù)展示我會(huì)通過一個(gè)具體的報(bào)告進(jìn)行呈現(xiàn),敬請(qǐng)期待。

小結(jié)

本篇文章主要討論了對(duì)猿輔導(dǎo)網(wǎng)站獲取的數(shù)據(jù)進(jìn)行清洗和提數(shù)的過程。在清洗數(shù)據(jù)時(shí)利用python簡化流程,在提數(shù)過程中先使用Xmind確定提數(shù)需求,再使用SQL語句完成提數(shù)。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容