0、關(guān)鍵詞
Classroom, Sensing, Teacher, Instructor, Pedagogy, Computer Vision, Audio, Speech Detection, Machine Learning
1、鏈接
該論文來(lái)自卡內(nèi)基梅隆大學(xué)(CMU)的人機(jī)交互(Human-Computer Interaction Institute)研究團(tuán)隊(duì),主要工作由一作博士生Karan Ahuja完成,其個(gè)人官網(wǎng)上有其它相關(guān)類似工作。
論文鏈接:https://dl.acm.org/doi/abs/10.1145/3351229
論文/系統(tǒng)主頁(yè):https://theedusense.io/?or?https://www.edusense.io/
論文代碼:https://github.com/edusense/edusense?or?https://github.com/edusense/ClassroomDigitialTwins
自制公開思維導(dǎo)圖:https://www.processon.com/mindmap/602fa53ce401fd48f2ae17e6
論文提出的EduSense代表了第一個(gè)實(shí)時(shí)、現(xiàn)場(chǎng)評(píng)估和實(shí)際部署的大規(guī)模教室感知系統(tǒng),該系統(tǒng)產(chǎn)生了大量與有效教學(xué)相關(guān)的、理論上有動(dòng)機(jī)的視覺(jué)和音頻功能。(EduSense represents the first real-time, in-the-wild evaluated and practically-deployable classroom sensing system at scale that produces a plethora of theoretically-motivated visual and audio features correlated with effective instruction.)編者注:實(shí)際上,我們實(shí)驗(yàn)室早在2016年就開展了類似的工作,但技術(shù)路線并不相同,可惜未能發(fā)表類似的系統(tǒng)性文章。

2、主要內(nèi)容概述
※ Abstract
1) High-quality opportunities for professional development of?university?teachers?need?classroom data.
2) Currently, there is no?effective mechanism to give personalized formative feedback except manually.
3) This paper shows a?culmination of two years?of research:?EduSense (with visual and audio features)
4) EduSense is?the first?to unify?previous isolative features?into a cohesive, real-time, and practically-deployable system
※?Introduction
> 增加學(xué)生在課程中的投入度和參與度(engagement and participation)被證明可以有效提升學(xué)習(xí)產(chǎn)出;
> 與K-12的教師相比,大學(xué)教師一般僅僅是領(lǐng)域?qū)<?domain experts),而不擅長(zhǎng)如何教學(xué)生
> 正常且規(guī)律的教學(xué)反饋對(duì)教師提升教學(xué)技能很重要,想要習(xí)得教育學(xué)技巧(pedagogical skill)并不容易
> acquiring regular, accurate data on teaching practice is currently not scalable
> 當(dāng)今的教學(xué)反饋數(shù)據(jù)嚴(yán)重依賴專業(yè)人士的觀察(professional human observers),而這是非常昂貴的

> EduSense captures a wide variety of?classroom facets?shown to be actionable in the?learning science literature, at a scale and temporal fidelity?many orders of magnitude beyond what a traditional human observer?in a classroom can achieve.
> EduSense captures both?audio and video streams using?low-cost commodity hardware?that views both the instructor and students
> Detection:?hand raises, body pose, body accelerometry, and?speech acts.?Tabel-1 is the detail.
> EduSense是首個(gè)將之前所有眾多單個(gè)教學(xué)場(chǎng)景特征融合在一起的系統(tǒng)
>?EduSense力求做到兩件事:
????1)為教學(xué)者提供教育學(xué)相關(guān)的教室上課場(chǎng)景數(shù)據(jù)供其練習(xí)成長(zhǎng)
????2)成為一個(gè)可拓展的開放平臺(tái)
※ Related Systems
> There is an extensive?learning science literature on methods to improve instruction through training and feedback. [15] [26] [27] [32] [37] [38] [77] [78] (PS:枚舉的好像全是CMU出產(chǎn)的文章)
2.1 Instrumented Classrooms (儀器教室)
> 使用一些傳感器(如pressure sensors [2][58])收集課堂中學(xué)生的數(shù)據(jù),或者使用儀器測(cè)量教室的物理結(jié)構(gòu)。
????● adding computing to the tabletop (e.g., buttons, touchscreens, etc.)?or with response systems like "clickers" [1][12][20][21][68]
????●?low-cost printed responses using color markers [25], QR Codes [17] or ARTags [57]
> 使用可穿戴設(shè)備直接搜集精確的關(guān)于學(xué)生或教師的信號(hào)
????●?Affectiva’s wrist-worn Q sensor [62]?senses the wearer’s skin conductance, temperature and motion (via accelerometers)
????●?EngageMeter [32] used electroencephalography headsets to detect shifts in student engagement, alertness, and workload
????●?Instrument just the teacher, with e.g., microphones [19].
> 缺點(diǎn):帶來(lái)了社交障礙、審美損失和實(shí)際應(yīng)用的成本提升(carries a social, aesthetic and practical cost.)
2.2 Non-Invasive Class Sensing (非侵入式等級(jí)感應(yīng))
> 我們的初衷是使用盡量少的入侵式設(shè)備來(lái)最大化應(yīng)用價(jià)值。在眾多的非入侵式傳感器中,聲音和視覺(jué)(acoustic and visual)幾乎是課堂感知必備的
>?Speech
????●?[19] used an omnidirectional(全方位的) room microphone and head-mounted teacher microphone to automatically segment teacher and student speech events, as well as intervals of silence (such as after teacher questions).
????●?AwareMe [11], Presentation Sensei [46] and RoboCOP [75] (Oral presentation practice systems 口頭演講練習(xí)系統(tǒng))compute speech quality metrics, including?pitch variety,?pauses and fillers,?and?speaking rate.
>?Cameras and computer vision
????●?Early systems, such as [23],?targeted coarse tracking?of people in the classroom, in this case using?background subtraction?and?color histograms.
????●?Movement of students has also been?tracked?with?optical flow?algorithms, as was demonstrated in [54][63]
????●?Computer vision has also been applied to automatic detection of?hand raises, including classic methods such as?skin tone?and?edge detection?[41], as well as newer?deep learning techniques?[51](我們實(shí)驗(yàn)室的文章,linjiaojiao的舉手檢測(cè))
>?Face Detection
????●?It can not only be used to find and count students, but also estimate their head orientation, coarsely signaling their area of focus [63][73][80].
????●?Facial landmarks can offer a wealth of information about students'?affective state, such as?engagement?[76] and?frustration?[6][31][43], as well as detection of off-task behavior [7]
????●?The?Computer Expression Recognition Toolbox (CERT)?[52] is most widely used in these educational technology applications, though it is limited to videos of single students.
2.3 System Contribution
> 按例先踩一下上述提到的各種教室感知系統(tǒng):
????1)都是獨(dú)立發(fā)表各項(xiàng)孤立指標(biāo),且沒(méi)有在真實(shí)的大規(guī)模課堂場(chǎng)景下進(jìn)行過(guò)測(cè)試和驗(yàn)證
????2)各個(gè)系統(tǒng)都是針對(duì)單間教室配置單臺(tái)服務(wù)器,不能在校園層面大規(guī)模推廣
????3)這些文獻(xiàn)中的系統(tǒng)很少處理教學(xué)教育用途,因此沒(méi)有考慮到在復(fù)雜的教室場(chǎng)景中使用最新的大量取得突破發(fā)展的計(jì)算機(jī)視覺(jué)和深度學(xué)習(xí)技術(shù)
>?Thus, we believe EduSense is unique in putting together disparate advances from several fields into a comprehensive and scalable?system, paired with a holistic evaluation combining both controlled studies and months-long, real-world deployments.
※?EduSense System

3.1 Sensing
> Early system:depth cameras
> Current system:Lorex LNE8950AB cameras offer a 112° field of view and feature an integrated microphone, costing around $150 in single unit retail prices. It can capture 3840x2160 video (i.e., 4K) at 15 FPS with 16 kHz mono audio.

3.2 Compute
> Early system:
????● small Intel NUCs.?However, this hardware approach was?expensive to scale, deploy and maintain
????●?前期版本的系統(tǒng)是一個(gè)龐大而單一的(monolithic)C++應(yīng)用程序,不但容易遇到各種如依賴沖突和加入新模塊引起過(guò)載等軟件工程問(wèn)題,而且軟件的遠(yuǎn)程部署同樣是一個(gè)讓人頭疼的問(wèn)題。
????●?另外,這些C++版本的代碼很難和計(jì)算機(jī)視覺(jué)最常用的python語(yǔ)言相結(jié)合,即使強(qiáng)行合并,也是耗時(shí)且系統(tǒng)極不穩(wěn)定。這個(gè)舊版本的系統(tǒng)也因?yàn)楦鱾€(gè)組件模塊之間沒(méi)有相互隔離而很容易發(fā)生各種錯(cuò)誤或崩潰掉。
> Current system:
????●?新的系統(tǒng)使用了更加穩(wěn)定的IP cameras,配合布置在學(xué)校中心的服務(wù)器,兩者之間再通過(guò)RTSP協(xié)議實(shí)時(shí)傳輸音頻和視頻流,形成新的系統(tǒng)框架。
????●?The custom GPU-equipped EduSense server has 28 physical cores (56 cores with SMT), 196GB of RAM?and?nine NVIDIA 1080Ti GPUs
????●?新系統(tǒng)使用了docker容器技術(shù)(container-based virtualization),將各個(gè)模塊孤立開單獨(dú)執(zhí)行,docker的優(yōu)勢(shì)無(wú)需贅述。
3.3 Scene Parsing (Techniques)
> Multi-person body keypoint (joints) detection:?OpenPose (tested and tuned OpenPose parameters)
> Difficult Envoriment:high, wall-mounted (i.e., non-frontal) and slightly fish-eyed view.
> Algorithm:additional logic?to reduce false positive bodies (e.g., bodies too large or small); interframe persistent person IDs with hysteresis (tracking) using a combination of Euclidean distance and body inter-keypoint distance matching
> Speech:predict only silence and speech (Laput et al. [48].) +?An adaptive background noise filter

3.4 Featurization Modules
> 見(jiàn)圖1和圖3,特征化模塊主要利用檢測(cè)和識(shí)別算法的結(jié)果,將其按照教室中的指標(biāo)可視化,便于調(diào)用或debug時(shí)查看
> For details:open source code repository (http://www.EduSense.io).
? ??● Sit vs. Stand Detection:relative geometry of body keypoints(neck (1), hips (2), knees (2), and feet (2).)+?MLP classifier
? ??●?Hand Raise Detection:Use?eight body keypoints per body(neck (1), chest (1), shoulder (2), elbow (2), and wrist (2).)+?MLP classifier
? ??●?Upper Body Pose:eight body keypoints + multiclass MLP model(預(yù)測(cè)arms at rest, arms closed (e.g., crossed), and hands on face 見(jiàn)上圖5)
? ??●?Smile Detection:use ten mouth landmarks on the outer lip and ten landmarks on the inner lip + SVM for binary classification
? ??●?Mouth Open Detection:(As a potential, future way to identify speakers.) two features from [71] (left and right/mouth_width) + Binary SVM
? ??●?Head Orientation & Class Gaze:perspective-n-point algorithm [50] +?anthropometric face data [53] + OpenCV's calib3d module [8]
? ??●?Body Position & Classroom Topology:借助前面提到的人臉關(guān)鍵點(diǎn)和相機(jī)標(biāo)定,估測(cè)學(xué)生的位置,并將投影合成俯視視角(top-down view)圖像(PS:類似我們系統(tǒng)中的學(xué)生定位,這里更粗略,不檢測(cè)行列,也不涉及學(xué)生行為匹配)
? ??●?Synthetic Accelerometer:simply track the motion of bodies across frames +?3D head position +?delta X/Y/Z normalized by the elapsed time
? ??●?Student vs. Instructor Speech:sound and speech detector including 1) the?RMS?of the student-facing camera’s microphone (closest to the instructor), 2) the?RMS?of the instructor-facing camera’s microphone (closest to the students), and the ratio between the latter two values +?random forest classifier (目的是區(qū)分當(dāng)前的說(shuō)話聲來(lái)自學(xué)生還是老師,PS:區(qū)分教師音和學(xué)生音?)
? ??●?Speech Act Delimiting:Use?per-frame speech detection results???(PS:這里是要檢測(cè)不同的語(yǔ)音片段嗎?)

3.5 Training Data Capture
> 首先,各種指標(biāo)的實(shí)現(xiàn)需要大量標(biāo)注過(guò)的數(shù)據(jù)作為訓(xùn)練集,這里遇到兩個(gè)問(wèn)題:
????1)需要招聘大量人員參與標(biāo)注,如舉手
????2)需要采集不同視角下的多樣化的數(shù)據(jù),因此需要自己布置采集數(shù)據(jù)的硬件設(shè)備和場(chǎng)景

3.6 Datastore
1)non-image classroom data (ASCII JSON),250MB for one class lasting around 80 minutes with 25 students
2)Infilled data (real-time class video), about 16GB for one class at 15FPS with 4K every frame for both front and back cameras
3)Web interface (Go APP) and MongoDB build a backend server. Also REST API +?Transport Layer Security (TLS) (不同的技術(shù)路線和技術(shù)細(xì)節(jié))
4)We do not save these frames long-term to mitigate obvious privacy concerns (數(shù)據(jù)不長(zhǎng)期保存,一刪了之,避免隱私問(wèn)題)
5)secure Network Attached Storage (NAS)
3.7 Automated Scheduling & Classroom Processing Instances
>?scheduler:SOS JobScheduler (技術(shù)路線不同,我們使用的是python平臺(tái)下的開源調(diào)度器apscheduler)
>?FFMPEG instances:record the front and back camera streams (技術(shù)路線不同,我們使用的是opencv)
3.8 High Temporal Resolution Infilling
EduSense包括兩種數(shù)據(jù)處理模式:real-time mode(0.5FPS);infilling mode(15FPS的視頻)
> real-time模式,顧名思義需要在課程進(jìn)行時(shí)同時(shí)出現(xiàn)各種分析指標(biāo),目前的效率是兩秒鐘一幀
> infilling模式,是在課程同時(shí)進(jìn)行或課后進(jìn)行的非實(shí)時(shí)分析,提供了高時(shí)序分辨率(high temporal resolution),是實(shí)時(shí)處理系統(tǒng)的補(bǔ)充。另外,這種更精確的分析還可以用于后續(xù)的end-of-day reports或semester-long analytics
3.9 Privacy Preservation
>?已經(jīng)采取的措施:EduSense不專門存儲(chǔ)課堂視頻;如果需要infilling模式,會(huì)在臨時(shí)緩存中暫存,并在分析完成后刪除這些視頻;控制用戶分權(quán)限分角色訪問(wèn)教室數(shù)據(jù),防止數(shù)據(jù)泄露;追蹤學(xué)生個(gè)體,但是并沒(méi)有使用私密信息,且每節(jié)課tacking分配的ID互相之間沒(méi)有關(guān)聯(lián);暫存的用于后續(xù)發(fā)展的視頻(包括測(cè)試、驗(yàn)證和標(biāo)注后擴(kuò)充數(shù)據(jù)集),將在使用后被及時(shí)刪除
>?未來(lái)將要采取的措施:僅僅只展示高階抽象的課堂指標(biāo)數(shù)據(jù)(class aggregates);
3.10 Debug and Development Interface
QT5 GUI + RTSP/local filesystem + many widgets

3.11 Open Source and Community Involvement
● hope that others will deploy the system
●?serve as a comprehensive springboard
●?cultivate a community
※?Controlled Study
4.1 Overall Procedure
>?five exemplary classrooms, 5 instructors and 25 student participants
> 參與者按照事先提供的“指令表格”,依次按照相應(yīng)的要求做出動(dòng)作,同時(shí)debug系統(tǒng)會(huì)同時(shí)記錄下這些動(dòng)作的時(shí)刻、類型、以及圖像數(shù)據(jù)

4.2 Body Keypointing
> Openpose被用來(lái)做姿態(tài)估計(jì),但其在教室場(chǎng)景下的效果并不魯棒,因此作者調(diào)整了算法的一些參數(shù),外加一些pose的邏輯判斷,提升了算法的穩(wěn)定性和準(zhǔn)確度(和我改進(jìn)openpose的思路差不多?)
> 關(guān)于改進(jìn)后openpose的效果,作者也沒(méi)給出較嚴(yán)謹(jǐn)?shù)臏y(cè)試結(jié)果,只是在少量數(shù)據(jù)集統(tǒng)計(jì)了關(guān)鍵點(diǎn)的效果(這種方式有道理嗎?)
> 如下圖,作者又統(tǒng)計(jì)了9種人體關(guān)鍵點(diǎn)的檢測(cè)準(zhǔn)確度,顯然上半身比下半身的準(zhǔn)確率要高(但這些準(zhǔn)確率是在多少數(shù)據(jù)下統(tǒng)計(jì)的不可知)

4.3 Phase A: Hand Raises & Upper Body Pose
> 作者分了七種上身姿態(tài)類別:arms resting, left hand raised, left hand raised partial, right hand raised, right hand raised partial, arms closed, and hands on face
> 參與實(shí)驗(yàn)的學(xué)生被要求在一堂課中,分別要執(zhí)行三次這些姿態(tài)類別,共計(jì)21個(gè)實(shí)例
> 參與實(shí)驗(yàn)的老師被要求,分別要執(zhí)行arms resting和arms closed三次,且在不同的教室位置(left front, center front, right front),共計(jì)6個(gè)實(shí)例
>?We only studied frames where participants’ upper bodies were captured (consisting of head, chest, shoulder, elbow, and wrist?keypoints - without these eight keypoints, our hand raise classifier returns null).
> 另外,作者在文中提到的舉手檢測(cè)準(zhǔn)確率高達(dá)94.6%,其它三類上身姿態(tài)檢測(cè)準(zhǔn)確率高達(dá)98.6%(學(xué)生)和100%(教師),但是沒(méi)有提到訓(xùn)練集和測(cè)試集的規(guī)模,且這些都是在特定布置的實(shí)驗(yàn)場(chǎng)景中的結(jié)果,是否有說(shuō)服力呢?
4.4 Phase B: Mouth State
> 作者設(shè)定了4種嘴部狀體:neutral (mouth closed), mouth open (teeth apart, as if talking), closed smile (no teeth showing),?teeth smile?(with teeth showing)
> 參與學(xué)生被要求每種狀態(tài)執(zhí)行三次,共計(jì)12個(gè)實(shí)驗(yàn)樣例;
> 參與教師被要求每種狀態(tài)執(zhí)行三次,且在教室前面的不同位置,共計(jì)12個(gè)實(shí)驗(yàn)樣例
> 基于以上人臉landmarks檢測(cè),作者做了微笑分類(準(zhǔn)確率78.6%和87.2%),以及張嘴分類(準(zhǔn)確率83.6%和82.1%)。但是仍舊沒(méi)提數(shù)據(jù)量
> 作者坦承,由于分辨率問(wèn)題,后排的學(xué)生人臉幾乎不可準(zhǔn)確檢測(cè)landmarks,并樂(lè)觀地認(rèn)為高分辨率相機(jī)可以解決該問(wèn)題。(實(shí)際上我們測(cè)試,即使是4K相機(jī),仍舊存在低分辨率問(wèn)題,且landmarks還有大角度和遮擋的問(wèn)題)

4.5 Phase C: Sit vs. Stand
> 這里作者主要是區(qū)分站立和坐下兩種姿勢(shì)。
> 同樣按照前面的安排,學(xué)生參與者被要求在整個(gè)測(cè)試過(guò)程中,隨機(jī)執(zhí)行三次兩種姿勢(shì),每個(gè)參與者共計(jì)6個(gè)實(shí)例;教師總是保持站立,本輪不參與
> 站立和坐下的分類準(zhǔn)確率約為84.4%(盡管作者還是沒(méi)提是在多大的數(shù)據(jù)集上測(cè)試的結(jié)果,但從這一章節(jié)提供的錯(cuò)誤率推斷出,總樣例數(shù)量約為143)
> 由于只是依賴2D關(guān)鍵點(diǎn)檢測(cè)的結(jié)果來(lái)分類,作者提到這種方法受到相機(jī)視角的影響很大。(那是當(dāng)然,還是沒(méi)有我們直接檢測(cè)站立準(zhǔn)確,且魯棒性高)
> 作者最后又提到,將來(lái)可以使用深度數(shù)據(jù),改善這種情況。(我只能說(shuō)深度相機(jī)也不見(jiàn)得有用,況且深度數(shù)據(jù)并不好采集和用來(lái)訓(xùn)練)
4.6 Phase D: Head Orientation
> 作者設(shè)定了8種頭部朝向:three possible pitches (“down” -15°, “straight” 0°, “up” +15°) × three possible yaws (“l(fā)eft” -20°, “straight” 0°, “right” +20°), omitting directly straight ahead (i.e., 0°/0°) (仍舊是將檢測(cè)和估計(jì)問(wèn)題,轉(zhuǎn)化成了分類問(wèn)題)
> 為了讓參與者做出相應(yīng)的head pose,作者設(shè)計(jì)使用運(yùn)行位姿估計(jì)APP的智能手機(jī),以及打印出來(lái)操作表格貼在課桌上。相關(guān)流程請(qǐng)閱讀論文
> 同樣,學(xué)生參與者被要求分別執(zhí)行8種頭部方向2次,這樣每個(gè)人會(huì)產(chǎn)生16個(gè)實(shí)驗(yàn)樣例
> Unfortunately, in many frames we collected, ~20% of landmarks were occluded by the smartphones we gave participants - an experimental design error in hindsight.(果不其然,這種依靠人臉landmarks的頭部姿態(tài)估計(jì)方式,即使是在實(shí)驗(yàn)場(chǎng)景下,結(jié)果也并不靠譜)
> Which should be sufficient for coarse estimation of attention.(作者刪除掉一些landmarks檢測(cè)不好的樣例,僅僅剩下了1/4的數(shù)據(jù),在這種情況下測(cè)試的結(jié)果,還要說(shuō)sufficient,有點(diǎn)勉強(qiáng)了,甚至睜眼說(shuō)瞎話了)
> 作者最后提到,主要問(wèn)題還是出在landmarks的檢測(cè),將來(lái)能檢測(cè)出來(lái)充足的landmarks點(diǎn),就能解決頭部朝向問(wèn)題。(我對(duì)這種技術(shù)路線持保守態(tài)度)

4.7 Phase E: Speech Procedure
> 這里只是識(shí)別是否有說(shuō)話,包括教師和學(xué)生,但并未做區(qū)分
> 實(shí)驗(yàn)方案是要求30個(gè)參與者分別說(shuō)一次話,這樣說(shuō)說(shuō)話語(yǔ)音段可以提取出30個(gè)5秒鐘長(zhǎng)的clips,非說(shuō)話語(yǔ)音段同樣可以提出30個(gè)段,再對(duì)這些語(yǔ)音段做分類。最終,no speech的識(shí)別100%正確,speech的識(shí)別僅有一個(gè)錯(cuò)誤,準(zhǔn)確率98.3%
>?我只能評(píng)價(jià)說(shuō),這樣的語(yǔ)音指標(biāo)和處理流程太過(guò)簡(jiǎn)單,且測(cè)試數(shù)據(jù)量太少,很缺乏說(shuō)服力
4.8 Face Landmarks Results
> 人臉關(guān)鍵點(diǎn)檢測(cè)直接使用了公共算法,如文獻(xiàn)[4][13][44]。猜測(cè)大概率使用的是文獻(xiàn)[13](CMU的Openpose)
> 同樣是在實(shí)驗(yàn)環(huán)境下,這一段展示了缺乏說(shuō)服力的所謂關(guān)鍵點(diǎn)檢測(cè)準(zhǔn)確率
>?poor registration of landmarks was due to limited resolution (還是提到了低分辨率的問(wèn)題)
4.9 Classroom Position & Sensing Accuracy vs. Distance
> We manually recorded the distance of all participants from the camera using a?surveyors’ rope
> Computer-vision-driven modules are sensitive to?image resolution?and vary in accuracy as a function of distance from the camera.
> 這里有個(gè)疑問(wèn):教師和學(xué)生的檢測(cè)不會(huì)重復(fù)嗎?換句話說(shuō)雙方不會(huì)出現(xiàn)在彼此的鏡頭里面嗎?如果出現(xiàn)了,文中并沒(méi)有考慮如何區(qū)分兩者。

4.10 Framerate and Latency
> 測(cè)試階段,只考慮處理已保存的視頻數(shù)據(jù),暫不考慮實(shí)時(shí)系統(tǒng)
> 不出意外,關(guān)鍵點(diǎn)檢測(cè)(body keypointing)和人臉關(guān)鍵點(diǎn)檢測(cè)(face landmarking)兩種基礎(chǔ)映射函數(shù)占據(jù)了大部分時(shí)間。尤其是人臉關(guān)鍵點(diǎn)定位算法的耗時(shí), 和圖像中的人物數(shù)量呈正相關(guān)函數(shù)增長(zhǎng).(這里有點(diǎn)疑問(wèn),姿態(tài)估計(jì)使用的是Bottom-up的openpose算法,所以檢測(cè)耗時(shí)不隨人數(shù)增長(zhǎng)而簡(jiǎn)單地線性增長(zhǎng),但上圖中,人數(shù)從0增加到54,檢測(cè)耗時(shí)完全沒(méi)有增加,這顯然是假的。因?yàn)槲覍?shí)測(cè)過(guò),openpose在joints grouping環(huán)節(jié),也會(huì)占據(jù)部分CPU時(shí)間。另外,openpose算法本身的檢測(cè)耗時(shí)只有約幾十毫秒,也不可信,輸入圖像即使只有1K圖像的0.5倍大小,也需要1秒左右的時(shí)間。)
> 其他處理流程的耗時(shí),暫看不出問(wèn)題

※?Real-world Classrooms Study
5.1 Deployment and Procedure
> We deployed EduSense in?13 classrooms?at our institution and recruited?22 courses?for an "in-the-wild" evaluation (with a total student enrollment of?687).
> 360.8 hours?of classroom data
>?438,331 student-facing frames?and?733,517 instructor-facing frames?were processed live, with a further?18.3M frames?infilled after class to bring the entire corpus up to a 15 FPS temporal resolution.
> We randomly pulled?100 student-view frames?(containing?1797 student body instances) and?300 instructor-view frames?(containing?291 instructor body instances; i.e., nine frames did not contain instructors) from our corpus.
> This suset is sufficiently large and diverse (不敢茍同。。)
> To provide the ground truth labels, we hired two human coders, who were not involved in the project. (和我們的數(shù)據(jù)標(biāo)注工作相比,EduSense這點(diǎn)工作量很單薄)
> It was not possible to accurately label head orientation and classroom position (有很多指標(biāo)是粗略估計(jì),但是位置如果采用我們的行列表示來(lái)評(píng)估,會(huì)更精確測(cè)量和評(píng)價(jià))
5.2 Body Keypointing Results
>?EduSense found 92.2% of student bodies and 99.6% of instructor bodies. (實(shí)際教室場(chǎng)景中的測(cè)試還是囿于少量數(shù)據(jù)之中,缺乏說(shuō)服力)
> 59.0% of student and 21.0% of instructor body instances were found to have at least one visible keypoint misalignment (真實(shí)效果不一定好)
> We were surprised that our real-world results were comparable to our controlled study, despite operating in seemingly much more challenging scenes (作者分析,和實(shí)驗(yàn)場(chǎng)景中刻意控制的復(fù)雜姿勢(shì)和頭部朝向相比,真實(shí)場(chǎng)景盡管更混亂(chaotic),但學(xué)生們一般都是直視前方,且姿態(tài)總是保持倚在課桌上,更容易識(shí)別)
5.3 Face Landmarking Results
> 仍舊是在部分?jǐn)?shù)據(jù)集上分別統(tǒng)計(jì)了學(xué)生和老師的人臉檢測(cè)準(zhǔn)確率,以及相應(yīng)的人臉關(guān)鍵點(diǎn)定位準(zhǔn)確率 (缺乏在大規(guī)模標(biāo)注的數(shù)據(jù)集上的測(cè)試結(jié)果)
> 作者提到盡管真實(shí)場(chǎng)景更復(fù)雜,人臉檢測(cè)算法的結(jié)果還是相當(dāng)魯棒的(這是公共算法的功勞,這里提及的意義何在?)
5.4 Hand Raise Detection & Upper Body Pose Classification
> Hand raises in our real-world dataset were exceedingly rare (毫無(wú)意外,上述22個(gè)視頻的測(cè)試量,以及大學(xué)課堂場(chǎng)景,注定了舉手樣例是稀缺的)
> Our of our 1797 student body instances, we only found?6 body instances?with hand raised (representing. less 0.3% of total body instances). Of those six hand raised instances, EduSense?correctly labeled three,?incorrectly labeled three, and missed zero, for an overall true positive accuracy of 50.0%. There was also?58 false positive?hand raised instances (3.8% of total body instances). (舉手姿勢(shì)的效果慘不忍睹)
> 其他姿勢(shì)的實(shí)測(cè)效果也不是很好,且同樣存在數(shù)據(jù)量少、缺乏說(shuō)服力的缺陷
5.5 Mouth Smile and Open Detection
> Only 17.1% of student body instances had the requisite mouth landmarks present for EduSense’s smile detector to execute. (有效數(shù)據(jù)更少了) --(Student)smile vs. no smile classification accuracy was 77.1%
> Only 21.0% of instructor body instances having the required facial landmarks. (同樣少了很多測(cè)試數(shù)據(jù)) --(Instructor)smile vs. no smile classification accuracy was 72.6%
>?mouth open/closed detection, accuracy was stronger – 96.5%(Student)和 82.3%(Instuctor)(注意,其中大部分都是閉嘴的樣例,約占94.8%)(這里作者分析道:張嘴檢測(cè)和微笑相比,更不易察覺(jué))
> 最后,作者還是提到分辨率的問(wèn)題,張嘴/閉嘴檢測(cè)還是強(qiáng)依賴嘴的分辨率,另外,標(biāo)注者對(duì)張嘴的判斷也有會(huì)有主觀性的(subjective)干擾。所以,這個(gè)指標(biāo)只是初步性的(preliminary)
5.6 Sit vs. Stand Classification
> We found that a vast majority of student lower bodies were occluded, which did not permit our classifier to produce a sit/stand classification, and thus we omit these results (實(shí)際測(cè)試階段,沒(méi)有包括學(xué)生的坐下/站立分類指標(biāo))
> 教師也只有66.3%的幀能被檢測(cè)到下半身關(guān)鍵點(diǎn),其中坐下和站立的識(shí)別準(zhǔn)確率粉筆是90.5%和95.2%(數(shù)據(jù)量較少,可信度如何?)
5.7 Speech/Silence & Student/Instructor Detection
> 關(guān)于Speech/Silence分類,作者分別選擇了"50段5秒長(zhǎng)的有聲"和"50段5秒長(zhǎng)的無(wú)聲",用來(lái)測(cè)試準(zhǔn)確率,最終結(jié)果是82%
> 關(guān)于Student/Instructor Detection,作者的方法是選擇”25段10秒長(zhǎng)的教師聲“和”25段10秒長(zhǎng)的學(xué)生音“,結(jié)果只有60%的準(zhǔn)確率能分別說(shuō)話者(意料之中,接近50%的隨機(jī)猜測(cè)概率)
> 作者認(rèn)為,現(xiàn)階段的說(shuō)話人識(shí)別受到教室的結(jié)構(gòu)和麥克風(fēng)采集位置的影響很大,而僅有兩個(gè)語(yǔ)音采集設(shè)備也是不夠的。想解決該問(wèn)題只能引入更復(fù)雜的方法:說(shuō)話人識(shí)別 speaker identification
5.8 Framerate & Latency
> 詳細(xì)的耗時(shí)分析參見(jiàn)Figure 15
> We achieve a mean student view processing framerate of between 0.3 and 2.0 FPS. (現(xiàn)階段線下視頻的處理速度有這么快嗎?)教師路2~3 times faster
> 根據(jù)耗時(shí)分析,實(shí)時(shí)系統(tǒng)的處理延時(shí)為3~5秒,其中各個(gè)部分耗時(shí)長(zhǎng)短依次是:IP cameras > backend processing > storing results > transmission (wired network)
> 作者認(rèn)為,未來(lái)更高端的 IP cameras將會(huì)減少時(shí)延,促進(jìn)實(shí)時(shí)系統(tǒng)的大規(guī)模應(yīng)用(5G + 高端嵌入式攝像頭處理芯片?)
※?End-user Applications
> Our future goal with EduSense is to power a suite of end-user, data-driven applications.
> 如何設(shè)計(jì)前端的展示頁(yè)面,也很講究,作者提出了幾種可能的選擇
? ? ● tracking the elapsed time of continuous speech, to help instructors inject lectures with pauses, as well as opportunities for student questions and discussion. (教師音檢測(cè)+計(jì)時(shí)?)
????●?automatically generated include suggestions to increase movement at the front of the class (教師軌跡?)
????●?and modify the ratio of facing the board vs. facing?the class. (教師朝向比例?)
????●?a cumulative heatmap of all student hand raised thus far in the lecture, which could facilitate selecting a student who has yet to contribute (舉手熱力圖?)
????●?a histogram of the instructor's gaze could highlight areas of the classroom receiving less visual attention (教師視線追蹤+統(tǒng)計(jì)?)
> 除了課上提供實(shí)時(shí)反饋的系統(tǒng)設(shè)計(jì),課下和每個(gè)學(xué)期末的分析總結(jié)報(bào)告,也很重要(制作成PDFs,并email給特定人群)
> 緊接著,作者繼續(xù)重申EduSense檢測(cè)教師指標(biāo)并提供實(shí)時(shí)意見(jiàn)可能起到的積極作用(包括gaze direction [65], gesticulation though hand movement [81], smiling [65], and moving around the classroom [55][70])
> A?web-based data visualizer (Figure 16): Node.js + ECharts + React (前端框架)

※?Discussion
> Taken together, our controlled and real classroom studies offer the first comprehensive evaluation of a holistic audio- and computer-vision-driven classroom sensing system, offering new insights into the feasibility of automated class analytics (句子很長(zhǎng),口氣很大)
> 經(jīng)過(guò)實(shí)驗(yàn)和實(shí)測(cè),作者給出了一些布置應(yīng)用場(chǎng)景的建議:比如不要選擇過(guò)大的教室(前后最大不要超過(guò)8M);攝像頭安裝在合適的位置提供好的教室視角
> 作者指出,系統(tǒng)中的算法錯(cuò)誤具有傳遞效應(yīng),文中已經(jīng)分階段分部分闡述了各個(gè)模塊的上限和下限。
> 作者接著指出,系統(tǒng)還有很多工作要做,系統(tǒng)的完善需要公共社區(qū)的研究不斷提供幫助,且需要同大學(xué)和高中等終端使用者們多接觸多溝通。
>?We also envision(展望、想象) EduSense as a stepping stone towards the furthering of a university culture that values professional development for teaching (美好的愿景,同時(shí)也是我們對(duì)自己在做的系統(tǒng)的愿景)
※?Conclusion
1. We have presented our work on EduSense, a comprehensive classroom sensing system that produces a wide variety of theoretically-motivated features, using a distributed array of commodity cameras. (貢獻(xiàn))
2. We deployed and tested our system in a controlled study, as well as real classrooms, quantifying the accuracy of key system features in both settings. (分析)
3. We believe EduSense is an important step towards the vision of automated classroom analytics, which hold the promise of offering a fidelity, scale and temporal resolution, which are impractical with the current practice of in-class observers. (愿景)
4. To further our goal of an extensible platform for classroom sensing that others can also build on, EduSense is open sourced and available to the community. (號(hào)召)
3、新穎點(diǎn)
毋庸置疑,EduSense是非常出彩的智能課堂感知系統(tǒng),盡管我在通篇的分析中有意將其與我們的技術(shù)路線進(jìn)行比對(duì),并指出了對(duì)其中多處技術(shù)細(xì)節(jié)的質(zhì)疑,但仍然不能改變它是第一個(gè)全方位地將現(xiàn)代AI技術(shù)運(yùn)用到課堂視頻分析中的完整系統(tǒng)。與之前出現(xiàn)的其他各種同類系統(tǒng)相比,EduSense有以下幾點(diǎn)奠定了其位置:
1)真正地將使用高性能GPU的AI算法引入到圖像視覺(jué)理解中,而不是之前的假人工智能(比如完全依賴硬件感知,或使用泛化性差不夠魯棒的傳統(tǒng)機(jī)器學(xué)習(xí)方法,甚至依賴人工統(tǒng)計(jì));
2)盡量多地從多個(gè)維度和模態(tài)來(lái)分析學(xué)生的課堂狀態(tài),包括行為、語(yǔ)音、表情等特征,而不是之前各類系統(tǒng)孤立地從某個(gè)特征來(lái)片面地研究和感知課堂;
3)文章的寫作框架很值得贊賞和借鑒。與傳統(tǒng)的CV算法文章不同的是,UbiComp似乎更注重描述一個(gè)完整的可執(zhí)行的系統(tǒng),需要作者盡量從框架設(shè)計(jì)、算法細(xì)節(jié)、工程布置和應(yīng)用案例等多個(gè)方面,來(lái)介紹提出的系統(tǒng)方法??赡苓@不是文章的新穎之處,但這至少是我首次接觸UbiComp會(huì)議論文的首次感受,這些的寫作方式很貼合系統(tǒng)工程性文章。
4、總結(jié)
實(shí)話實(shí)說(shuō),EduSense在一些技術(shù)路線上并非最優(yōu),且很多技術(shù)細(xì)節(jié)上存在諸多漏洞或自相矛盾,但瑕不掩瑜,審稿委員會(huì)還是給其出版,只能說(shuō)第一次吃螃蟹是難能可貴的。這之后,我再嘗試發(fā)表我們的工作(先后叫做AIClass,或EduMaster,或StuArt),都未能成功。除了被評(píng)審免不掉地與EduSense進(jìn)行對(duì)比,并指出差別性不大(實(shí)際上有很多技術(shù)細(xì)節(jié)不同,但沒(méi)有評(píng)審愿意去了解這些了),還包括各類學(xué)生隱私性問(wèn)題,這當(dāng)然也與我們輕視了隱私安全有關(guān),但更重要的變數(shù)可能來(lái)自日趨嚴(yán)格的防隱私泄露態(tài)勢(shì),無(wú)論中西方。因此,我們也許不會(huì)再糾結(jié)于在UbiComp上發(fā)表EduSense 2.0了,但只要我們繼續(xù)深入挖掘智慧課堂分析與評(píng)價(jià)領(lǐng)域,一定能做出一些差異化的成果,彼時(shí)再討教UbiComp亦不遲。