三維人體姿態(tài)重建通常是指利用外部設(shè)備對(duì)人體進(jìn)行三維姿態(tài)還原。相比于稠密的人體幾何形狀,人體骨架是表示人體姿勢(shì)的一種緊湊型表達(dá)方式。本次主要介紹基于人體骨架的姿態(tài)重建。

目前工業(yè)界已有相對(duì)成熟的三維姿態(tài)重建解決方案,即接觸式的動(dòng)作捕捉系統(tǒng),例如著名的光學(xué)動(dòng)作捕捉系統(tǒng)Vicon(圖1)。首先將特制的光學(xué)標(biāo)記點(diǎn)(Marker)貼附在人體的關(guān)鍵部位(如人體的關(guān)節(jié)處),多個(gè)特殊的動(dòng)作捕捉相機(jī)可以從不同角度實(shí)時(shí)檢測(cè)Marker點(diǎn)。隨后根據(jù)三角測(cè)量原理精確計(jì)算Marker點(diǎn)的空間坐標(biāo),再利用反向動(dòng)力學(xué)(Inverse kinematics,IK)算法解算人體骨骼的關(guān)節(jié)角。由于場(chǎng)景與設(shè)備的限制,高昂的價(jià)格,接觸式運(yùn)動(dòng)捕捉難以被普通消費(fèi)者所使用。因此研究者繼而將目光投向了低成本、非接觸式的無標(biāo)記動(dòng)作重建技術(shù)。本次主要介紹近年來利用單目RGB-D相機(jī)或者單目RGB相機(jī)進(jìn)行姿態(tài)重建的工作。
基于單目RGB-D相機(jī)的姿態(tài)重建
基于RGB-D的三維姿態(tài)重建方法可分為兩類[1]:判別式方法與生成式方法。判別式方法通常試圖從深度圖像中直接推測(cè)出三維人體姿態(tài)。其中一部分工作嘗試從深度圖中提取與關(guān)節(jié)位置相對(duì)應(yīng)的特征。
例如,Plagemann等人[47]利用測(cè)地線極值來識(shí)別人體中的顯著點(diǎn),然后利用局部形狀描述符檢測(cè)人體三維關(guān)節(jié)位置。另外一些判別方法則依賴于離線訓(xùn)練的分類器或者回歸器。
Shotton等人[48]首先利用大量的樣本訓(xùn)練了一個(gè)隨機(jī)森林(Random forest)分類器,從深度圖中分割出不同的人體部件區(qū)域,隨后他們利用均值漂移(Mean shift)算法估計(jì)出關(guān)節(jié)位置。該方法的預(yù)測(cè)僅需要少量的計(jì)算量,可實(shí)時(shí)運(yùn)行。該方法隨后被Kinect SDK集成,用于實(shí)時(shí)重建三維姿態(tài)。
Taylor等人[49]利用隨機(jī)森林方法預(yù)測(cè)屬于人體關(guān)節(jié)的深度像素區(qū)域,隨后將其用于姿態(tài)優(yōu)化。判別式方法并不依賴于跟蹤,可以減少累計(jì)誤差,并且能夠自然的處理快速運(yùn)動(dòng)。
與判別式方法不同,生成式方法通過變形參數(shù)化或非參數(shù)化模板來匹配觀測(cè)數(shù)據(jù)。Ganapathi等人[50]使用動(dòng)態(tài)貝葉斯網(wǎng)絡(luò)(Dynamic Bayesian Network,DBN)來建模運(yùn)動(dòng)狀態(tài),并利用一個(gè)最大后驗(yàn)概率(Maximuma Posterior, MAP)框架推斷出三維姿態(tài)。該方法需要事先知道人體的身材,并且不能有效地處理快速的人體運(yùn)動(dòng)。隨后,Ganapathi等人[51]利用擴(kuò)展的ICP測(cè)量模型和自由空間約束對(duì)方法[50]進(jìn)行了改進(jìn)。新方法可對(duì)人體參數(shù)化模板的大小進(jìn)行動(dòng)態(tài)調(diào)整,用于適配捕捉到的深度數(shù)據(jù)。
基于RGB-D的姿態(tài)重建方法由于其硬件限制,容易受到深度圖噪聲干擾,只能在距離較近的場(chǎng)景下適用。
基于單目RGB相機(jī)的姿態(tài)重建
得益于大規(guī)模帶有三維人體姿態(tài)標(biāo)注的視頻數(shù)據(jù)集(如Human3.6M[52],Human-Eva[53])的出現(xiàn),基于深度學(xué)習(xí)的三維姿態(tài)重建方法發(fā)展迅速。它們直接利用深度學(xué)習(xí)模型從圖像或者視頻中提取三維人體關(guān)節(jié)點(diǎn)位置[54–60]。
Li等人[54]是最早將深度學(xué)習(xí)引入到三維姿態(tài)估計(jì),他們?cè)O(shè)計(jì)了一個(gè)包含檢測(cè)與回歸的多任務(wù)卷積神經(jīng)網(wǎng)絡(luò),直接從圖像中自動(dòng)學(xué)習(xí)特征來回歸三維關(guān)節(jié)點(diǎn)的位置,超過了以往通過人為設(shè)計(jì)特征的方法。
Pavlakos等人[56]提出了一種體素?zé)釄D來描述人體關(guān)節(jié)點(diǎn)在三維體素空間不同位置上的可能性,并且使用一種從粗到細(xì)的級(jí)聯(lián)策略來逐步細(xì)化體素?zé)釄D的預(yù)測(cè),取得了很好的姿態(tài)重建準(zhǔn)確度。然而,這種體素表示往往需要面對(duì)巨大的存儲(chǔ)和計(jì)算開銷,最近[61]利用編碼-解碼(Encoder-Decoder)思想較好的解決了這個(gè)問題。
除了直接預(yù)測(cè)關(guān)節(jié)點(diǎn)三維位置,還有一些工作預(yù)測(cè)骨骼朝向[64,65],關(guān)節(jié)角[66],骨骼向量[67,68]等等。上述工作都采用強(qiáng)監(jiān)督的方式進(jìn)行訓(xùn)練,由于訓(xùn)練數(shù)據(jù)都是在受控環(huán)境下采集,因此訓(xùn)練出的模型通常都難以泛化到自然場(chǎng)景中。
為了提高模型的泛化能力,一些工作嘗試?yán)萌醣O(jiān)督的方式來監(jiān)督自然場(chǎng)景中的圖像,比如使用域判別器[69],骨骼長度先驗(yàn)[70]等等。
另一類三維姿態(tài)估計(jì)方法則將二維人體姿態(tài)作為中間表示。首先在圖像中利用人工標(biāo)注或者自動(dòng)檢測(cè)[71–74]的二維人體關(guān)節(jié),然后通過回歸方法[57,62,75]或者模型擬合[76]的方式將其提升到三維空間。
Martinez等人[62]設(shè)計(jì)了一個(gè)簡(jiǎn)單但是有效的全連接網(wǎng)絡(luò)結(jié)構(gòu),它以二維關(guān)節(jié)點(diǎn)位置作為輸入,輸出三維關(guān)節(jié)點(diǎn)位置,如圖2。
隨后,Zhao等人[75]提出利用語義圖卷積層模塊捕捉人體關(guān)節(jié)點(diǎn)之間的拓?fù)湎嚓P(guān)性(比如人體對(duì)稱性),進(jìn)一步提高了三維姿態(tài)的重建準(zhǔn)確性。但是從二維姿態(tài)映射到三維姿態(tài)本身是一個(gè)歧義問題,原因在于多個(gè)三維姿態(tài)可以投影出同一個(gè)二維姿態(tài)[77]。最近的一些工作嘗試加入更多的先驗(yàn)知識(shí)來減輕歧義性[78–80]。

上述工作都屬于判別式模型,預(yù)測(cè)得到的三維關(guān)節(jié)點(diǎn)位置可能不符合人體解剖學(xué)約束(比如不滿足對(duì)稱性,骨骼長度比例不合理)或者運(yùn)動(dòng)學(xué)約束(關(guān)節(jié)角超過限制)。Mehta等人[63]將一個(gè)人體骨架模板擬合預(yù)測(cè)得到的二維關(guān)節(jié)點(diǎn)與三維關(guān)節(jié)點(diǎn)位置,并提出了第一個(gè)基于RGB相機(jī)的實(shí)時(shí)三維姿態(tài)重建系統(tǒng)VNect,得到了較為準(zhǔn)確的姿態(tài)重建結(jié)果。如圖3所示。
參考文獻(xiàn)
接上篇參考文獻(xiàn)
[47] PLAGEMANN C, GANAPATHI V, KOLLER D, etal. Real-time identification and localization of body parts from depthimages[C]//2010 IEEE International Conference on Robotics and Automation. IEEE,2010: 3108-3113.
[48] Shotton J, Fitzgibbon A, Cook M, etal. Real-time human pose recognition in parts from single depth images[C]//CVPR2011. 2011: 1297-1304.
[49] TAYLOR J, SHOTTON J, SHARP T, et al.The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2012: 103-110.
[50] GANAPATHI V, PLAGEMANN C, KOLLER D, etal. Real time motion capture using a single time-of-flight camera[C]//2010 IEEEComputer Society Conference on Computer Vision and Pattern Recognition. IEEE,2010: 755-762.
[51] GANAPATHI V, PLAGEMANN C, KOLLER D, etal. Real-time human pose tracking from range data[C]//European conference oncomputer vision. Springer, 2012: 738-751.
[52] IONESCU C, PAPAVA D, OLARUV, et al.Human3. 6m: Large scale datasets and predictive methods for 3D human sensing innatural environments[J]. IEEE Transactions on Pattern Analysis and MachineIntelligence, 2013, 36(7):1325-1339.
[53] SIGAL L, BALAN A O, BLACK M J.Humaneva: Synchronized video and motion capture dataset and baseline algorithmfor evaluation of articulated human motion[J]. International journal ofcomputer vision, 2010, 87(1-2):4.
[54] LI S, CHANA B. 3d human poseestimation from monocular images with deep convolutional neuralnetwork[C]//Asian Conference on Computer Vision. Springer, 2014: 332-347.
[55] POPA A I, ZANFIR M, SMINCHISESCU C.Deep multitask architecture for integrated 2d and 3d humansensing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6289-6298.
[56] PAVLAKOS G, ZHOU X, DERPANIS K G, etal. Coarse-to-fine volumetric prediction for single-image 3D humanpose[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2017:7025-7034.
[57] FANG H S, XU Y, WANG W, et al.Learning pose grammar to encode human body configuration for 3d poseestimation[C]//Proceedings of the AAAI Conference on Artificial Intelligence:volume 32. 2018.
[58] SUN X, XIAO B, WEI F, et al. Integralhuman pose regression[C]//Proceedings of the European Conference on ComputerVision (ECCV). 2018: 529-545.
[59] LEE K, LEE I, LEE S. Propagating lstm:3d pose estimation based on joint interdependency [C]//Proceedings of theEuropean Conference on Computer Vision (ECCV). 2018: 119-135.
[60] HABIBIE I, XU W, MEHTA D, et al. Inthe wild human pose estimation using explicit 2d features and intermediate 3drepresentations[C]//Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition. 2019: 10905-10914.
[61] FABBRI M, LANZI F, CALDERARA S, et al.Compressed volumetric heatmaps for multiperson 3d poseestimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition. 2020: 7204-7213.
[62] MARTINEZ J, HOSSAIN R, ROMERO J, etal. A simple yet effective baseline for 3D human poseestimation[C]//Proceedings of the IEEE International Conference on ComputerVision. 2017: 2640-2649.
[63] MEHTA D, SRIDHAR S, SOTNYCHENKO O, etal. VNect: Real-time 3D human pose estimation with a single RGB camera[J]. ACMTransactions on Graphics (TOG), 2017, 36 (4):44.
[64] LUO C, CHU X, YUILLE A. Orinet: Afully convolutional network for 3d human pose estimation[J]. arXiv preprintarXiv:1811.04989, 2018.
[65] JOO H, SIMON T, SHEIKH Y. Totalcapture: A 3D deformation model for tracking faces, hands, and bodies[C]//IEEEConference on Computer Vision and Pattern Recognition. 2018: 8320-8329.
[66] HABERMANN M, XU W, ZOLLHOEFER M, etal. Deepcap: Monocular human performance capture using weak supervision[J].arXiv: Computer Vision and Pattern Recognition, 2020.
[67] SUN X, SHANG J, LIANG S, et al.Compositional human pose regression[C]//Proceedings of the IEEE InternationalConference on Computer Vision. 2017: 2602-2611.
[68] SUN X, LI C, LIN S. Explicitspatiotemporal joint relation learning for tracking human pose [C]//Proceedingsof the IEEE/CVF International Conference onComputerVisionWorkshops.2019: 0-0.
[69] YANGW, OUYANGW,WANGX, et al. 3Dhumanpose estimation in the wild by adversarial learning[C]//IEEE Conference onComputer Vision and Pattern Recognition. 2018: 52555264.
[70] ZHOU X, HUANG Q, SUN X, et al. Towards3D human pose estimation in the wild: a weakly-supervised approach[C]//IEEEInternational Conference on Computer Vision. 2017: 398-407.
[71] WEI S E, RAMAKRISHNA V, KANADE T, etal. Convolutional pose machines[C]// Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 2016: 4724-4732.
[72] NEWELL A, YANG K, DENG J. Stackedhourglass networks for human pose estimation [C]//European conference oncomputer vision. 2016: 483-499.
[73] CHEN Y, WANG Z, PENG Y, et al.Cascaded pyramid network for multi-person pose estimation[C]//IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR). 2018.
[74] XIAO B, WU H, WEI Y. Simple baselinesfor human pose estimation and tracking[C]//The European Conference on ComputerVision (ECCV). 2018.
[75] ZHAO L, PENG X, TIANY, et al. Semanticgraph convolutional networks for 3d human pose regression[C]//Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:3425-3435.
[76] CHENC H,RAMANAND. 3D human poseestimation= 2D pose estimation+ matching[C]// Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 2017: 7035-7043.
[77] HOSSAINMR I, LITTLE J J. Exploitingtemporal information for 3d human pose estimation [C]//Proceedings of theEuropean Conference on Computer Vision (ECCV). 2018: 68-84.
[78] TEKIN B, MáRQUEZ-NEILA P, SALZMANN M,et al. Learning to fuse 2d and 3d image cues for monocular body poseestimation[C]//Proceedings of the IEEE International Conference on ComputerVision. 2017: 3941-3950.
[79] WANG J, HUANG S, WANG X, et al. Notall parts are created equal: 3d pose estimation by modeling bi-directionaldependencies of body parts[C]//Proceedings of the IEEE/CVF InternationalConference on Computer Vision. 2019: 7771-7780.
[80] PAVLAKOS G, ZHOU X, DANIILIDIS K.Ordinal depth supervision for 3D human pose estimation[C]//IEEE Conference onComputer Vision and Pattern Recognition. 2018: 7307-7316.