SoccerDB: A Large-Scale Database for Comprehensive Video Understanding
ABSTRACT
摘要
Soccer videos can serve as a perfect research object for video understanding because soccer games are played under well-defined rules while complex and intriguing enough for researchers to study. In this paper, we propose a new soccer video database named SoccerDB, comprising 171,191 video segments from 346 high-quality soccer games. The database contains 702,096 bounding boxes, 37,709 essential event labels with time boundary and 17,115 highlight annotations for object detection, action recognition, temporal action localization, and highlight detection tasks. To our knowledge, it is the largest database for comprehensive sports video under- standing on various aspects. We further survey a collection of strong baselines on SoccerDB, which have demonstrated state-of- the-art performances on independent tasks. Our evaluation suggests that we can benefit significantly when jointly considering the inner correlations among those tasks. We believe the release of SoccerDB will tremendously advance researches around comprehensive video understanding. Our dataset and code published on https://github.com/newsdata/SoccerDB.
足球視頻可以作為視頻理解的完美研究對象,因為足球比賽是在定義明確的規(guī)則下進行的,同時也足夠復(fù)雜和有趣,供研究人員研究。在本文中,我們提出了一個新的足球視頻數(shù)據(jù)庫SoccerDB,包含了來自346場高質(zhì)量足球比賽的171191個視頻片段。該數(shù)據(jù)庫包含702096個包圍框、37709個具有時間邊界的基本事件標簽和17115個高光片段注釋,用于目標檢測、動作識別、時間動作定位和高光片段檢測任務(wù)。據(jù)我們所知,它是目前最大的全面了解體育視頻各個方面的數(shù)據(jù)庫。我們進一步研究了SoccerDB上的一系列基線,這些基線在獨立任務(wù)中展示了最先進的性能。我們的評估表明,當我們共同考慮這些任務(wù)之間的內(nèi)在相關(guān)性時,我們可以獲得顯著的好處。我們相信SoccerDB的發(fā)布將極大地推進關(guān)于全面視頻理解的研究。我們的數(shù)據(jù)集和代碼發(fā)布在https://github.com/newsdata/SoccerDB上。
1 INTRODUCTION
1 引言
Comprehensive video understanding is a challenging task in computer vision. It has been explored in ways including action recognition, temporal action localization, object detection, object tracking and so on. However, most works on video understanding mainly focus on isolated aspects of video analysis, yet ignore the inner correlation among those tasks.
視頻的綜合理解是計算機視覺中的一項具有挑戰(zhàn)性的任務(wù)。在動作識別、時間動作定位、目標檢測、目標跟蹤等方面均有深入的研究。然而,大多數(shù)關(guān)于視頻理解的研究主要集中在視頻分析的孤立方面,而忽略了這些任務(wù)之間的內(nèi)在聯(lián)系。
There are many obstacles for researchers doing the correlation study: first, the manual annotation of multiple tasks’ labels on a large-scale video database is extremely time-consuming; second, different approaches lack a fair and uniform benchmark excluding interference factors for conducting rigorous quantitative analysis; third, some datasets focusing on the areas that are not challenging or valuable enough to attract researchers’ attention. We need research objects, which are challenging and with clear rules and restrictive conditions for us to conduct an accurate study on questions we are interested in. In this paper, we choose soccer matches as our research object, and construct a dataset with multiple visual understanding tasks featuring various analysis aspects, aiming at building algorithms that can comprehensively understand various aspects of videos like a human.
研究人員在進行相關(guān)性研究時存在諸多障礙:首先,在大規(guī)模視頻數(shù)據(jù)庫上手工標注多個任務(wù)的標簽非常耗時;二是不同方法缺乏一個公平、統(tǒng)一的基準,不能排除干擾因素進行嚴格的定量分析;第三,一些數(shù)據(jù)集中在沒有足夠挑戰(zhàn)性或價值的領(lǐng)域,不足以引起研究人員的注意。我們需要有挑戰(zhàn)性的研究對象,有明確的規(guī)則和限制性的條件,以便我們對我們感興趣的問題進行準確的研究。在本文中,我們選擇足球比賽作為我們的研究對象,構(gòu)建一個包含多種視覺理解任務(wù)不同分析方面的數(shù)據(jù)集,旨在構(gòu)建能夠像人一樣全面理解視頻各個方面的算法。
1.1 Soccer Video Understanding
1.1 足球視頻理解
Soccer video understanding is not only valuable to academic communities but also lucrative in the commercial world. The European soccer market generates annual revenue of $28.7 billion [6]. Regarding soccer content production, automatic soccer video analysis can help editors to produce match summaries, visualize key players’ performance for tactical analysis, and so on. Some pioneering companies like GameFace, SportLogiq adopt this technology on match statistics to analyze strategies and players’ performance. However, automatic video analysis has not fully met the markets’ needs. The CEO of Wyscout, claims the company employs 400 people on soccer data, each of whom takes over 8 hours to provide up to 2000 annotations per game [6].
足球視頻的理解不僅對學術(shù)界有價值,而且在商業(yè)世界也有利可圖。歐洲足球市場每年產(chǎn)生287億美元的收入 [6]。在足球內(nèi)容生產(chǎn)方面,足球視頻自動分析可以幫助編輯生成比賽摘要,將關(guān)鍵球員的表現(xiàn)可視化,用于戰(zhàn)術(shù)分析等。一些先鋒公司,如GameFace、SportLogiq,在比賽統(tǒng)計中采用這種技術(shù)來分析策略和球員的表現(xiàn)。然而,自動視頻分析并沒有完全滿足市場的需求。Wyscout的首席執(zhí)行官稱,該公司在足球數(shù)據(jù)方面雇傭了400名員工,每個人需要8小時以上的時間來提供多達2000條標注 [6]。
1.2 Object Detection
1.2 目標檢測
Object detection has seen huge development over the past few years and gained human-level performance in applications including face detection, pedestrian detection, etc. To localize instances of semantic objects in images is a fundamental task in computer vision. In soccer video analysis, a detection system can help us to find positions of the ball, players, and goalposts on the field. With the position information, we can produce engaging visualization as shown in Figure 1 for tactic analysis or enhance the fan experience. Though many advanced detection systems can output reliable results under various conditions, there are still many challenges when the object is small, fast-moving, or blur. In this work, we construct a soccer game object detection dataset and benchmark two state-of-the-art detection models under different framework: RetinaNet [11], a “one-stage” detection algorithm, and Faster R-CNN [15], a “two-stage” detection algorithm.
在過去的幾年中,目標檢測得到了巨大的發(fā)展,在人臉檢測、行人檢測等應(yīng)用中取得了與人類水平相當?shù)男阅?。定位圖像中語義目標的實例是計算機視覺的一項基本任務(wù)。在足球視頻分析中,檢測系統(tǒng)可以幫助我們找到球、球員和門柱在球場上的位置。通過位置信息,我們可以生成如圖1所示的引人入勝的可視化內(nèi)容,用于戰(zhàn)術(shù)分析或增強球迷體驗。雖然許多先進的探測系統(tǒng)可以在各種條件下輸出可靠的結(jié)果,但當目標很小、快速移動或模糊時,仍然存在許多挑戰(zhàn)。在這項工作中,我們構(gòu)建了一個足球目標檢測數(shù)據(jù)集,并在不同的框架下對兩種最先進的檢測模型進行了基準測試:“one-stage”檢測算法RetinaNet [11] 和“two-stage”檢測算法Faster R-CNN [15]。


1.3 Action Recognition
1.3 動作識別
Action recognition is also a core video understanding problem and has achieved a lot over the past few years. Large-scale datasets such as Kinetics [3], Sports-1M [9], YouTube-8M [1] have been published. Many state-of-the-art deep learning-based algorithms like I3D [3], Non-local Neural Networks [20], slowFast Network [5], were proposed to this task. While supervised learning has shown its power on large scale recognition datasets, it failed when lacking training data. In soccer games, key events such as penalty kicks, are rare, which means many state-of-the-art recognition models cannot output convincing results when facing these tasks. We hope this problem could be further investigated by considering multiple objects’ relationships as a whole in the dataset.
動作識別也是視頻理解的一個核心問題,在過去的幾年里已經(jīng)取得了很大的進展。已經(jīng)發(fā)布了諸如Kinetics [3], Sports-1M [9], YouTube-8M [1] 等大規(guī)模數(shù)據(jù)集。提出了許多基于深度學習的算法,如I3D [3]、Non-local Neural Networks [20]、slowFast Network [5]。雖然監(jiān)督學習在大規(guī)模識別數(shù)據(jù)集上顯示出了其強大的能力,但當缺乏訓(xùn)練數(shù)據(jù)時,它就會失敗。在足球比賽中,像點球這樣的關(guān)鍵事件很少發(fā)生,這意味著許多最先進的識別模型在面對這些任務(wù)時無法輸出令人信服的結(jié)果。我們希望通過將數(shù)據(jù)集中的多個目標的關(guān)系作為一個整體來考慮,可以進一步研究這個問題。
In this paper, we also provide our insight into the relationship between object detection and action recognition. We observe that since soccer match supply simplex scene and object classes, it is extraordinarily crucial to model the special relationships of objects and their change over time. Imagine, if you can only see players, the ball and goal posts from a game’s screenshot, could you still understand what is happening on the field? Look at the left picture in Figure 2, maybe you have guessed right: that’s the moment of a shooting. Although modeling human-object or object-object interactions have been explored to improve action recognition [7] [21] in recent years, we still need to have a closer look at how do we use the detection knowledge boosting action recognition more efficiently? Our experiments show that the performance of a state-of-the-art action recognition algorithm can be increased by a large margin while combining with object class and location knowledge.
在本文中,我們還提供了我們對于目標檢測和行動識別之間的關(guān)系的研究。我們觀察到,由于足球比賽提供了單一的場景和目標類,因此建模目標的特殊關(guān)系及其隨時間的變化是非常重要的。想象一下,如果你只能從比賽截圖中看到球員、球和門柱,你還能理解場上發(fā)生了什么嗎?看看圖2中的左圖,也許您猜對了:這就是射門的時刻。盡管近年來人們已經(jīng)探索了對人-物或物-物交互行為建模來提高動作識別能力 [7] [21],但我們?nèi)匀恍枰M一步研究如何更有效地利用檢測知識來提高動作識別能力。我們的實驗表明,在結(jié)合目標類別和位置知識的情況下,最先進的動作識別算法的性能可以大幅度提高。


1.4 Temporal Action Localization
1.4 動作時間定位
Temporal action localization is a significant and more complicated problem than action recognition in video understanding because it requires to recognize both action categories and the time boundary of an event. The definition of the temporal boundary of an event is ambiguous and subjective, for instance, some famous databases like Charades and MultiTHUMOS are not consistent among different human annotators [17]. This also increases our difficulty when labeling for SoccerDB. To overcome the challenge of ambiguity, we define soccer events with a particular emphasis on time boundaries, based on the events’ actual meaning in soccer rules. For example, we define red/yellow card as starting from a referee showing the card, and ending when the game resuming. The new definition helps us to get more consist of action localization annotations.
在視頻理解中,動作時間定位是一個比動作識別更重要、更復(fù)雜的問題,因為它既需要識別動作類別,又需要識別事件的時間邊界。事件的時間邊界的定義是模糊的和主觀的,例如一些著名的數(shù)據(jù)庫,如Charades和MultiTHUMOS在不同的人類標注者之間并不一致 [17]。這也增加了為SoccerDB標注時的難度。為了克服模糊性的挑戰(zhàn),我們在定義足球事件時特別強調(diào)時間邊界,基于事件在足球規(guī)則中的實際意義。例如,我們將紅牌/黃牌定義為從裁判出示紅牌開始,到比賽恢復(fù)時結(jié)束。新的定義幫助我們獲得更多包含動作定位的標注。
1.5 Highlight Detection
1.5 高光檢測
The purpose of highlight detection is to distill interesting content from a long video. Because of the subjectivity problem, to construct a highlight detection dataset usually requires multi-person labeling for the same video. It will greatly increase the costs and limit the scale of the dataset [18]. We find in soccer TV broadcasts, video segments containing highlight events are usually replayed many times, which can be taken as an important clue for soccer video highlight detection. Many works explored highlight detection while considering replays. Zhao Zhao et. al proposed a highlight summarization system by modeling Event-Replay(ER) structure [22], A. Ravents et. al used audio-visual descriptors for automatic summarization which introduced replays for improving the robustness [14]. SoccerDB provides a playback label and reviews this problem by considering the relationship between the actions and highlight events.
高光檢測的目的是從長視頻中提取有趣的內(nèi)容。由于主觀性問題,構(gòu)建高光檢測數(shù)據(jù)集通常需要對同一視頻進行多人標記。這將大大增加成本,并限制數(shù)據(jù)集的規(guī)模 [18]。我們發(fā)現(xiàn),在足球電視轉(zhuǎn)播中,包含高光事件的視頻片段通常會被多次回放,這可以作為足球視頻高光檢測的重要線索。許多工作在考慮回放的同時探索了高光檢測。Zhao Zhao等人通過對事件回放(ER)結(jié)構(gòu)建模,提出了一個亮點摘要系統(tǒng) [22],A. Ravents等人采用視聽描述符進行自動摘要,引入回放來提高魯棒性 [14]。SoccerDB提供了一個回放標簽,并通過考慮動作和高光事件之間的關(guān)系來檢測這個問題。
1.6 Contributions
1.6 貢獻
? We introduce a challenging database on comprehensive soccer video understanding. Object detection, action recognition, temporal action localization, and highlight detection. Those tasks, crucial to video analysis, can be investigated in the closed-form under a constrained environment.
? We provide strong baseline systems on each task, which are not only meant for academic researches but also valuable for automatic soccer video analysis in the industry.
? We discuss the benefit when considering the inner connections among different tasks: we demonstrate modeling objects’ spatial-temporal relationships by detection results could provide complementary representation to the convolution-based model learned from RGB that increases the action recognition performance by a large margin; joint training on action recognition and highlight detection can boost the performance on both tasks.
? 我們介紹了一個具有挑戰(zhàn)性的全面足球視頻理解數(shù)據(jù)集。目標檢測,動作識別,時間動作定位,高光檢測。這些任務(wù)對視頻分析至關(guān)重要,可以在受限的環(huán)境下以封閉的形式進行研究。
? 我們?yōu)槊總€任務(wù)提供強大的基線系統(tǒng),這不僅用于學術(shù)研究,而且對行業(yè)中的自動足球視頻分析也有價值。
? 我們討論了在考慮不同任務(wù)之間的內(nèi)部聯(lián)系時的好處:我們證明了通過檢測結(jié)果建模目標的時空關(guān)系可以為從RGB學習的基于卷積的模型提供輔助表示,從而大大提高了動作識別性能;動作識別和高光檢測的聯(lián)合訓(xùn)練可以提高這兩項任務(wù)的表現(xiàn)。
2 RELATED WORK
2 相關(guān)工作
2.1 Sports Analytics
2.1 運動分析
Automated sports analytics, particularly those on soccer and basketball, are popular around the world. The topic has been profoundly researched by the computer vision community over the past few years. Vignesh Ramanathan et al. brought a new automatic attention mechanism on RNN to identify who is the key player of an event in basketball games [13]. Silvio Giancola et al. focused on temporal soccer events detection for finding highlight moments in soccer TV broadcast videos [6]. Rajkumar Theagarajan et al. presented an approach that generates visual analytics and player statistics for solving the talent identification problem in soccer match videos [19]. Huang-Chia Shih surveyed 251 sports video analysis works from content-based viewpoint for advancing broad- cast sports video understanding [16]. The above works were only the tip of the iceberg among magnanimous research achievements in the sports analytics area.
自動化的體育分析,尤其是足球和籃球方面的分析,在世界各地都很受歡迎。在過去的幾年中,計算機視覺界對這一主題進行了深入的研究。Vignesh Ramanathan等人在RNN上提出了一種新的automatic attention機制,用于識別籃球比賽中誰是某一事件的關(guān)鍵球員 [13]。Silvio Giancola等人專注于足球時間事件檢測,以發(fā)現(xiàn)足球電視廣播視頻中的高光時刻 [6]。Rajkumar Theagarajan等人提出了一種生成視覺分析和球員統(tǒng)計數(shù)據(jù)的方法,用于解決足球比賽視頻中的持球人識別的問題 [19]。Huang-Chia Shih從內(nèi)容的角度對251部體育視頻分析作品進行了調(diào)查,以提高對體育視頻的理解[16]。上述工作只是體育分析領(lǐng)域眾多研究成果中的冰山一角。
2.2 Datasets
2.2 數(shù)據(jù)集
Many datasets have been contributed to sports video understand- ing. Vignesh Ramanathan et al. provided 257 basketball games with 14K event annotations corresponding to 10 event classes for event classification and detection [13]. Karpathy et al. collected one million sports videos from Youtube belonging to 487 classes of sports promoting deep learning research on action recognition greatly [9]. Datasets for video classification in the wild have played a vital role in related researches. Two famous large-scale datasets, Youtube-8M [1] and Kinetics [3] were widely investigated, which have inspired most of the state-of-the-art methods in the last few years. Google proposed the AVA dataset to tackle the dense activity understanding problem, which contained 57,600 clips of 3 seconds duration taken from featured films [8]. ActivityNet explored general activity understanding by providing 849 video hours of 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video[2]. Although ActivityNet considered video understanding from multiple aspects including semantic ontology, trimmed and untrimmed video classification, spatial-temporal action localization, we argued that it is still too far away from a human-comparable general activity understanding in an unconstrained environment. Part of the source videos in our dataset was collected from SoccerNet [6], a benchmark with a total of 6,637 temporal annotations on 500 complete soccer games from six main European leagues. A comparison between different databases is shown in Table 3.
許多數(shù)據(jù)集為體育視頻的理解做出了貢獻。Vignesh Ramanathan等人為257場籃球比賽提供了14K事件標注,對應(yīng)10個事件類,用于事件分類和檢測 [13]。Karpathy等人從Youtube上收集了100萬個體育視頻,屬于487類體育大大的促進動作識別的深度學習研究 [9]。自然環(huán)境下的視頻分類數(shù)據(jù)集在相關(guān)研究中發(fā)揮著重要作用。兩個著名的大規(guī)模數(shù)據(jù)集,Youtube-8M [1] 和Kinetics [3] 被廣泛研究,在過去的幾年里啟發(fā)了大多數(shù)最先進的方法。谷歌提出了AVA數(shù)據(jù)集來解決密集活動理解問題,該數(shù)據(jù)集包含了從特色電影中提取的57600個時長為3秒的片段 [8]。ActivityNet用于探索了一般的事件理解,提供203個事件類的849個視頻小時,平均每個類包含137個未修剪的視頻,每個視頻包含1.41個事件 [2]。盡管ActivityNet從語義本體、裁剪和非裁剪視頻分類、時空動作定位等多個方面考慮了視頻理解,但我們認為,它距離不受約束環(huán)境下人類可比的一般活動理解還太遙遠。我們數(shù)據(jù)集中的部分源視頻來自SoccerNet [6],這是一個基準測試,在六個主要歐洲聯(lián)賽的500場完整足球比賽中總共有6637個時間標注。表3顯示了不同數(shù)據(jù)庫之間的比較。


3 CREATING SOCCERDB
3 SoccerDB創(chuàng)建
3.1 Object Detection Dataset Collection
3.1 目標檢測數(shù)據(jù)集收集
To train a robust detector for different scene, we increase the diversity of the dataset by collecting data from both images and videos. We crawl 24,475 images of soccer matches from the Internet covering as many different scenes as possible, then use them to train a detector for boosting the labeling process. For video parts, we collect 103 hours of soccer match videos including 53 full-match and 18 half-match which source is described in section 3.2. To increase the difficulty of the dataset, we auto-label each frame from the videos by the detector trained on image set, then select the keyframes with poor predictions as the dataset proposals. Finally, we select 45,732 frames from the videos for object detection task. As shown in Table 1, the total number of bounding box labels for image parts are 142,579, with 117,277 player boxes, 19,072 ball boxes, and 6,230 goal boxes, the total number of bounding box labels for video parts are 702,096, with 643,581 player boxes, 45,160 ball boxes, and 13,355 goal boxes. We also calculate the scale of the boxes by COCO definition [12]. The image parts is spited into 21,985 images for training, and 2,490 for testing randomly. For the video parts random select 18 half-matches for testing, the other matches for training yielding 38,784 frames for training and 6,948 for testing.
為了訓(xùn)練針對不同場景都魯棒的檢測器,我們通過從圖像和視頻中收集數(shù)據(jù)來增加數(shù)據(jù)集的多樣性。我們從互聯(lián)網(wǎng)上抓取24,475張足球比賽的圖像,覆蓋盡可能多的不同場景,然后用它們來訓(xùn)練一個檢測器,以促進標注過程。視頻部分,我們收集了103個小時的足球比賽視頻,其中53個整場比賽,18個半場比賽,其來源見3.2節(jié)。為了增加數(shù)據(jù)集的難度,我們通過圖像集訓(xùn)練的檢測器自動標注視頻中的每一幀,然后選擇預(yù)測較差的關(guān)鍵幀作為數(shù)據(jù)集的建議。最后,我們從視頻中選擇45732幀進行目標檢測任務(wù)。如表1所示,圖像部件邊界框標簽總數(shù)為142579個,球員的邊界框為117277個,球的邊界框為19072個,球門的邊界框為6230個,視頻部件邊框標簽總數(shù)為702096個,球員的邊界框為643581個,球的邊界框為45160個,球門的邊界框為13355個。我們還通過COCO的定義計算了盒子的尺寸 [12]。將圖像部分分割成21,985張圖像進行訓(xùn)練,2490張隨機測試。對于視頻部分,隨機選擇18個半場比賽進行測試,其他比賽用于訓(xùn)練,得到38,784幀用于訓(xùn)練,6948幀用于測試。



3.2 Video Dataset Collection
3.2 視頻數(shù)據(jù)集收集
We adopt 346 high-quality full soccer matches’ videos, including 270 matches from SoccerNet [6] covering six main European leagues ranging from 2014 to 2017 three seasons, 76 matches videos from the China Football Association Super League from 2017 to 2018, and the 18th, 19th, 20th FIFA World Cup3. The whole dataset consumes 1.4 TB storage, with a total duration of 668.6 hours. We split the games into 226 for training, 63 for validation, and 57 for testing randomly. All videos for object detection are not included in this video dataset.
我們采用了346個高質(zhì)量的足球比賽全視頻,其中包括2014 - 2017三個賽季,覆蓋歐洲六大聯(lián)賽的SoccerNet [6] 的270場比賽,2017 - 2018年的中國足球超級聯(lián)賽,以及第18、19、20屆世界杯的76場比賽視頻。整個數(shù)據(jù)集消耗1.4 TB的存儲空間,總持續(xù)時間為668.6小時。我們將比賽分成226個用于訓(xùn)練,63個用于驗證,57個用于隨機測試。所有用于目標檢測的視頻都不包含在這個視頻數(shù)據(jù)集中。
3.3 Event Annotations
3.3 事件標注
We define ten different soccer events which are usually the high- lights of the soccer game with standard rules for their definition. We define the event boundaries as clear as possible and annotate all of them densely in long soccer videos. The annotation system records the start/end time of an event, the categories of the event and if the event is a playback. An annotator takes about three hours to label one match, and another experienced annotator reviews those annotations to ensure the outcomes’ quality.
我們定義了十個不同的足球事件,它們通常是足球比賽的亮點,并為它們的定義制定了標準規(guī)則。我們盡可能清晰地定義了事件邊界,并在長的足球視頻中密集地標注它們。標注系統(tǒng)記錄事件的開始/結(jié)束時間、事件的類別以及事件是否為回放。一個標注員花費大約3個小時來標記一個比賽,另一個有經(jīng)驗的標注員檢查這些標注以確保結(jié)果的質(zhì)量。
3.4 Video Segmentation Processing
3.4 視頻分割處理
We split the dataset into 3 to 30 seconds segments for easier processing. We make sure an event would not be divided into two segments, and keep the event’s temporal boundary in one segment. Video without any event is randomly split into 145,473 video clips with time duration from 3 to 20 seconds. All of the processed segments are checked again by humans to avoid annotation mistakes. Some confusing segments are discarded during this process. Finally, we get a total of 25,719 video segments with event annotations (core dataset) and 145,473 background segments. There are 1.47 labels per segment in the core dataset.
為了更容易處理,我們將數(shù)據(jù)集分成3到30秒的片段。我們確保一個事件不會被分為兩個部分,并保持事件的時間邊界在一個部分。沒有任何事件的視頻被隨機分成145,473個視頻片段,時長從3秒到20秒不等。所有處理過的片段都由人工再次檢查,以避免標注錯誤。在這個過程中,一些混亂的片段被丟棄。最后,我們總共得到25,719個帶有事件標注的視頻片段(核心數(shù)據(jù)集)和145,473個背景片段。在核心數(shù)據(jù)集中,每個片段有1.47個標簽。
3.5 Dataset Analysis
3.5 數(shù)據(jù)集分析
Details of SoccerDB statistics are shown in Table 2. A total of 14,358 segments have shot labels, which account for 38.07% among all events, except for the background. In contrast, we only collected 156 segments for penalty kick, and 1160 for red and yellow card, accounting for 0.41% and 3.07%, respectively. Since the dataset has an extreme class imbalance problem, it is difficult for the existing state-of-the-art supervised methods to produce convincing results. We also explored the distribution of playbacks and found its relevance to events’ type, as every goal event has playbacks, contrasting with only 1.6% proportion of substitution have playbacks. In section 5.5 we prove this relevance. As shown in section 2.2, we also provide comparisons of many aspects between other popular datasets and ours. Our dataset supports more variety in tasks and more detailed soccer class labels for constrained video understanding.
SoccerDB的詳細統(tǒng)計信息如表2所示。共有14,358個片段有射門標簽,占除背景外所有事件的38.07%。相比之下,我們只收集了156個點球片段,1160個紅黃牌片段,分別占0.41%和3.07%。由于數(shù)據(jù)集存在極端類別不平衡問題,現(xiàn)有的最先進的監(jiān)督方法很難產(chǎn)生令人信服的結(jié)果。我們還探索了回放的分布,發(fā)現(xiàn)了它與事件類型的相關(guān)性,因為每個目標事件都有回放,相比之下,只有1.6%的換人事件有回放。在第5.5節(jié)中,我們證明了這種相關(guān)性。如2.2節(jié)所示,我們還提供了其他流行數(shù)據(jù)集與我們的數(shù)據(jù)集在許多方面的比較。我們的數(shù)據(jù)集支持更多的任務(wù)和更詳細的足球類標簽對于約束視頻理解。


4 THE BASELINE SYSTEM
4 基線系統(tǒng)
To evaluate the capability of current video understanding technologies, and also to understand challenges to the dataset, we developed algorithms that have show strong performances on various datasets, which can provide strong baselines for future work to compare with. In our baseline system, the action recognition sub-module plays an essential role by providing basic visual representation to both temporal action detection and highlight detection tasks.
為了評估當前視頻理解技術(shù)的能力,也為了理解數(shù)據(jù)集面臨的挑戰(zhàn),我們開發(fā)了在各種數(shù)據(jù)集上表現(xiàn)出強大性能的算法,這可以為未來的工作提供強有力的基線進行比較。在我們的基線系統(tǒng)中,動作識別子模塊扮演著至關(guān)重要的角色,它為時間動作檢測和高光檢測任務(wù)提供基本的視覺表示。
4.1 Object Detection
4.1 目標檢測
We adopt two representative object detection algorithms as base- lines. One is Faster R-CNN, developed by Shaoqing Ren et al. [15]. The algorithm and its variant are widely used in many detection systems in recent years. Faster R-CNN belongs to the two-stage detector: The model using RPN proposes a set of regions of interests (RoI), then a classifier and a regressor only process the region candidates to get the category of the RoI and precise coordinates of bounding boxes. Another one is RetinaNet, which is well known as a one-stage detector. The authors Tsung-Yi Lin et al. discover that the extreme foreground-background class imbalance encountered is the central cause and introduced focal loss for solving this problem [11].
我們采用兩種具有代表性的目標檢測算法作為基線。一個是Faster R-CNN,由Shaoqing Ren等人開發(fā) [15]。近年來,該算法及其變體在許多檢測系統(tǒng)中得到了廣泛應(yīng)用。Faster R-CNN屬于two-stage 檢測器:使用RPN的模型提出一組感興趣區(qū)域(RoI),然后分類器和回歸器只對候選區(qū)域進行處理,得到感興趣區(qū)域的類別和精確的邊界框坐標。另一種是RetinaNet,是一個one-stage探測器。作者Tsung-Yi Lin等人發(fā)現(xiàn),遇到的極端前景-背景類失衡是核心原因,并引入focal loss來解決這一問題 [11]。
4.2 Action Recognition
4.2 動作識別
We treat each class as a binary classification problem. Cross entropy loss is adopted for each class. Two state-of-the-art action recognition algorithms are explored, the SlowFast Networks and the Non-local Neural Networks. The SlowFast networks contain two pathways: a slow pathway, simple with low frame rate, to capture spatial semantics, and a fast pathway, opposed to the slow pathway, operating at a high frame rate, to capture the motion pattern. We use ResNet-50 as the backbone of the network. The Non-local Neural Networks proposed by Xiaolong Wang et. al [20], that can capture long-range dependencies on the video sequence. The non-local operator as a generic building block can be plugged into many deep architectures.
我們將每一類視為一個二元分類問題。每一類采用交叉熵損失函數(shù)。探討了兩種最先進的動作識別算法,SlowFast網(wǎng)絡(luò)和Non-local神經(jīng)網(wǎng)絡(luò)。SlowFasrt網(wǎng)絡(luò)包含兩個路徑: 一個slow路徑,使用簡單的低幀率,以捕獲空間語義,和一個fast路徑,相對于slow路徑,在高幀率下運行,以捕獲運動模式。我們使用ResNet-50作為網(wǎng)絡(luò)的主干。Non-local神經(jīng)網(wǎng)絡(luò)由Xiaolong Wang等人提出,可以在視頻序列獲取長范圍依賴關(guān)系。Non-local模塊作為通用的構(gòu)建塊可以插入到許多深度結(jié)構(gòu)中。我們采用帶有I3D的ResNet-50骨干網(wǎng)絡(luò),插入Non-local模塊。
4.3 Transfer Knowledge from Object Detection to Action Recognition
4.3 從目標檢測到動作識別的知識遷移
We survey the relationship between object detection and action recognition based on Faster R-CNN and SF-32 network (slowFast framework by sampling 32 frames per video segments) mentioned in section 4.1 and 4.2. First, we use Faster R-CNN to detect the objects from each sampled frame. Then, as shown in figure 3, we add a new branch to SF-32 for modeling object spatial-temporal interaction explicitly for explaning: object detection can provide complementary objects interaction knowledge that convolution- based model could not learn from the RGB sequence.
我們在4.1節(jié)和4.2節(jié)中提到的基于Faster R-CNN和SF-32網(wǎng)絡(luò)(slowFast框架,每個視頻片段采樣32幀)上研究了目標檢測和動作識別之間的關(guān)系。首先,我們使用Faster R-CNN從每個采樣幀中檢測目標。然后,如圖3所示,我們在SF-32中增加了一個新的分支,用于顯式地建模對象的時空交互:目標檢測可以提供基于卷積的模型無法從RGB序列學習到的互補的目標交互知識。

![圖3: Mask和RGB雙流(MRTS)方法的結(jié)構(gòu)。(https://upload-images.jianshu.io/upload_images/20265886-2dad0f9b9536cbaa.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
Mask and RGB Two-Stream (MRTS) approach. We generate object masks as the same size of the RGB frame, the channel size of the mask is equal to the object class number. For each channel, representing that one object class, the areas containing objects belong to this class are set to 1, others are set to 0. we set a two-stream ConvNet architecture, one stream takes the mask as input, the other input original RGB frame. Two streams are converged by concatenating the last fully-connected layers. We suppose that if the spatial-temporal modeling of object location can provide complementary representation, the result of this approach exceeds the baseline SF-32 network performance largely.
Mask和RGB雙流(MRTS)方法。我們生成與RGB幀大小相同的對象掩碼,掩碼的通道大小等于目標的類數(shù)。對于表示一個目標類的每個通道,包含屬于該類的對象的區(qū)域設(shè)置為1,其他設(shè)置為0。我們設(shè)置了一個雙流ConvNet架構(gòu),一個流以mask作為輸入,另一個流輸入原始RGB幀。兩個流通過連接最后完全連接的層來聚合。我們假設(shè),如果目標位置的時空建模能夠提供互補表示,該方法的結(jié)果大大超過基線SF-32網(wǎng)絡(luò)的性能。
4.4 Temporal Action Detection
4.4 動作時間檢測
We explore temporal action detection by a two-stage based method. First, a class-agnostic algorithm generates potential event proposals, then apply the classifying proposals approach for final temporal boundary localization. During the first stage, we utilize Boundary-Matching Network (BMN), a fibottom-upfi temporal action proposal generation method, for generating high-quality proposals [10]. The BMN model is composed of three modules: (1) Base module processes the extracted feature sequence of the origin video, and output another video embedding shared by Temporal Evaluation Module (TEM) and Proposal Evaluation Module (PEM). (2) TEM evaluates the starting and ending probabilities of each location in a video to generate boundary probability sequences. (3) PEM transfers the feature to a boundary-matching feature map which contains confidence scores of proposals. During the second stage, an action recognition models mentioned in section 4.2 predicts the classification score of each top K proposals. We choose the highest prediction score of each class as the final detection result.
我們探索了基于兩階段的方法的時間動作檢測。首先,使用類無關(guān)算法生成潛在的事件提議,然后應(yīng)用分類提議方法進行最終的時間邊界定位。在第一階段,我們利用Boundary-Matching Network(BMN),一種自下而上的時間動作提議生成方法,生成高質(zhì)量的提議 [10]。BMN模型由三個模塊組成:(1)基本模塊對提取的原始視頻特征序列進行處理,輸出另一個視頻編碼,由時間評估模塊(TEM)和提議評估模塊(PEM)共享。(2) TEM計算視頻中每個位置的屬于起始概率和結(jié)束概率,生成邊界概率序列。(3) PEM將特征轉(zhuǎn)換為包含建議置信度分數(shù)的邊界匹配特征映射。在第二階段,用一個4.2節(jié)中提到的動作識別模型預(yù)測排名前K的提案的分類得分。我們選擇每一類預(yù)測得分最高的作為最終檢測結(jié)果。
4.5 Highlight Detection
4.5 高光檢測
In this section, we formalize the highlight detection task as a binary classification problem, to recognize which video is a playback video. The structures of the highlight detection models are presented in Figure 4. We select SF-32 network as the basic classifier, then we consider four scenarios:
? Fully-connected only (fc-only) approach involves extracting features from the final fc layer of a pre-trained model which is trained by action recognition task as shown in section 4.2. Then we train a logistic regressor for highlight detection. This approach evaluates the strength of the representation learned by action recognition, which can indicate the internal correlation between highlight detection and action recognition tasks.
? Fully Fine-tuning (full-ft) approach fine-tuning a binary classification network by initializing weights from the action recognition model.
? Multi-task (mt) approach we train a multi-label classification network for both action recognition and highlight detection tasks. We adopt a per-label sigmoid output followed by a logistic loss at the end of slowFast-32 network. This approach takes highlight segments as another action label in the action recognition framework. The advantage of this setting is that it can force the network to learn the relevance among tasks, while the disadvantage is that the new label may introduce noise confusing the learning procedure.
? Multi-task with highlight detection branch (mt-hl-branch) approach we add a new two layers 3x3x3 convolution branch for playback recognition, which shares the same backbone with the recognition task. We only train the highlight detection branch by freezing action recognition pre-trained model initialized parameters at first, then fine- tune all parameters for multi-task learning.
在本節(jié)中,我們將高光檢測任務(wù)形式化為一個二進制分類問題,以識別哪個視頻是回放視頻。高光檢測模型的結(jié)構(gòu)如圖4所示。我們選擇SF-32網(wǎng)絡(luò)作為基本分類器,然后我們考慮四種情況:
? 全連接(fc-only)方法通過如4.2節(jié)所示在動作識別任務(wù)上訓(xùn)練的預(yù)訓(xùn)練模型的最終fc層提取特征。然后訓(xùn)練一個邏輯回歸進行高光檢測。該方法評估通過動作識別學習到的表示的能力,可以表明高光檢測和動作識別任務(wù)之間的內(nèi)在關(guān)聯(lián)。
? 完全微調(diào)(full-ft)方法通過從動作識別模型初始化權(quán)值來微調(diào)二元分類網(wǎng)絡(luò)。
? 多任務(wù)(mt)方法我們訓(xùn)練了一個多標簽分類網(wǎng)絡(luò),用于動作識別和高光檢測任務(wù)。在slowFast-32網(wǎng)絡(luò)的末端,我們采用了逐標簽sigmoid輸出和邏輯損失。這種方法將高光片段作為動作識別框架中的另一個動作標簽。這種設(shè)置的優(yōu)點是它可以強迫網(wǎng)絡(luò)學習任務(wù)之間的相關(guān)性,而缺點是新的標簽可能會引入干擾學習過程的噪聲。
? 多任務(wù)高光檢測分支(mt-hl-branch)方法增加了一個新的兩層3x3x3卷積分支用于高光檢測,該分支與識別任務(wù)共享相同的主干。首先,我們通過凍結(jié)動作識別預(yù)訓(xùn)練的模型初始化參數(shù)只訓(xùn)練高光檢測分支,然后對所有參數(shù)進行微調(diào),進行多任務(wù)學習。


5 EXPERIMENTS
5 實驗
In this section, we focus on the performance of our baseline system on SoccerDB for object detection, action recognition, temporal action detection, and highlight detection tasks.
在本節(jié)中,我們將重點介紹基線系統(tǒng)在SoccerDB上的目標檢測、動作識別、動作時間檢測和高光檢測任務(wù)的性能。
5.1 Object Detection
5.1 目標檢測
We choose ResNeXt-101 with FPN as the backbone of both RetinaNet and Faster R-CNN. We use a pre-trained model on the MS-COCO dataset, and train the models by 8 NVIDIA-2080TI GPUs, with the initial learning rate of 0.01 for RetinaNet, and 0.02 for Faster R-CNN. MS-COCO style evaluation method is applied to models’ benchmark. The training data from both video parts and image parts are mixed to train each model. We present AP with IoU=0.5:0.95 and multi-scale in table 4, and also report the AP of each class as shown in table 5. RetinaNet performs better than Faster R-CNN, and large-scale object is easier for both methods than the small object. The ball detection result is lower than the player and goal dual to the small scale and motion blur issue. All of the detection experiments are powered by mmdetection software which is developed by the winner of the COCO detection challenge in 2018 [4].
我們選擇帶有FPN的ResNeXt-101作為RetinaNet和Faster R-CNN的骨干。我們在MS-COCO數(shù)據(jù)集上使用預(yù)訓(xùn)練的模型,用8塊NVIDIA-2080TI GPU對模型進行訓(xùn)練,對RetinaNet的初始學習率為0.01,對Faster R-CNN的初始學習率為0.02。模型基準采用MS-COCO style評價方法。將來自視頻部分和圖像部分的訓(xùn)練數(shù)據(jù)混合,對每個模型進行訓(xùn)練。我們在表4中用IoU=0.5:0.95和多尺度來表示AP,并報告了每個類別的AP,如表5所示。與Faster R-CNN相比,RetinaNet的性能更好,且兩種方法對大尺度對象的處理都比小尺度對象容易。球的檢測結(jié)果低于球員和目標的因為小尺寸和運動模糊的問題。所有檢測實驗均采用mmdetection軟件,該軟件由2018年COCO檢測挑戰(zhàn)賽冠軍 [4] 開發(fā)。
5.2 Action Recognition
5.2 動作識別
We set up the experiments by open-source tool PySlowFast, and boost all recognition network from Kinetics pre-training model. Since some labels are rare in the dataset, we adjust the distribution of different labels appearing in the training batch to balance the proportion of labels. We resize the original video frames to 224x224 pixels and do horizontal flip randomly on the training stage. On the inference stage, we just resize the frame to 224x224 without a horizontal flip. We compare 32 and 64 sample frame number for investigating the sample rate influence. For each class, the average precision (AP) scores are demonstrated on Table 6.
我們使用開源工具PySlowFast建立實驗,并用Kinetics預(yù)訓(xùn)練模型中對所有的識別網(wǎng)絡(luò)進行增強。由于有些標簽在數(shù)據(jù)集中很少見,我們調(diào)整不同標簽出現(xiàn)在訓(xùn)練批中的分布,以平衡標簽的比例。我們將原始視頻幀的大小調(diào)整為224x224像素,并在訓(xùn)練階段隨機進行水平翻轉(zhuǎn)。在推斷階段,我們只是將幀大小調(diào)整為224x224,而沒有進行水平翻轉(zhuǎn)。我們比較了32幀和64幀的采樣幀數(shù)來研究采樣率的影響。表6展示了每個類別的平均精度(AP)分數(shù)。


The dense frame sample rate surpasses the sparse sample rate for both methods. The classes with more instances like shot perform better than classes with fewer instances. Substitution and corner with discriminative visual features to others, obtain high AP scores too. The AP of penalty kick fluctuates in a wide range because there are only 30 instances in the validation dataset.
在兩種方法上密集幀采樣率都超過稀疏幀采樣率。擁有更多樣本的類如shot的類比擁有更少樣本的類性能更好。對他人具有辨別性視覺特征的換人和角球,也能獲得較高的AP分數(shù)。點球的AP在一個很大的范圍內(nèi)波動,因為在驗證數(shù)據(jù)集中只有30個樣本。
5.3 Transfer Knowledge from Object Detection to Action Recognition
5.3 目標檢測到動作識別的知識遷移
To make the results more comparable, all the basic experiment settings in this section are the same as described in section 5.2. The average precision results of MRST approach introduced by section 4.3 are shown in Table 6.
為了使結(jié)果更具可比性,本節(jié)的所有基本實驗設(shè)置與5.2節(jié)相同。4.3節(jié)介紹的MRST方法的平均精度結(jié)果如表6所示。
From the experiment results, we can easily conclude that understanding the basic objects spatial-temporal interaction is critical for action recognition. MRST increases SF-32 by 15%, which demonstrates the objects’ relationship modeling can provide complementary representation that cannot be captured by 3D ConvNet from RGB sequence.
從實驗結(jié)果可以很容易地得出結(jié)論,理解基本物體的時空相互作用對動作識別至關(guān)重要。MRST使SF-32提高了15%,這表明目標關(guān)系建??梢蕴峁?D ConvNet無法從RGB序列捕獲的互補表示。
5.4 Temporal Action Detection
5.4 動作時間檢測
In this section, we evaluate performances of temporal action proposal generation and detection and give quantified analysis on how action recognition task affects temporal action localization. For a fair comparison of different action detection algorithms, we benchmark our baseline system on the core dataset instead of the results produced by section 4.2 models. We adopt the fc-layer of action classifier as a feature extractor on contiguous 32 frames getting 2304 length features. We set 32 frames sliding window with 5 frames for each step, which produces overlap segments for a video. The feature sequence is re-scaled to a fixed length D by zero-padding or average pooling with D=100. To evaluate proposal quality, Average Recall (AR) under multiple IoU thresholds [0.5:0.05:0.95] is calculated. We report AR under different Average Number of proposals (AN) as AR@AN, and the area under the AR-AN curve (AUC) as ActivityNet-1.3 metrics, where AN ranges from 0 to 100. To show the different feature extractor influence on the detection task, we compare two slowFast-32 pre-trained models, one is trained on the SoccerDB action recognition task described in section 4.2, another is trained on Kinetics. Table 7 demonstrates the results of those two extractors.
在這一部分中,我們評估了動作時間提議生成和檢測的性能,并對動作識別任務(wù)對時間動作定位的影響進行了量化分析。為了公平地比較不同的動作檢測算法,我們在核心數(shù)據(jù)集上對我們的基線系統(tǒng)進行基準測試,而不是在4.2節(jié)模型產(chǎn)生的結(jié)果上進行基準測試。我們采用動作分類器的fc層作為連續(xù)32幀的特征提取器,得到2304個長度的特征。我們設(shè)置了32幀的滑動窗口,每一步5幀,這就產(chǎn)生了視頻的重疊段。通過零填充或平均池化(D=100)將特征序列重新縮放到固定長度D。為了評估提案質(zhì)量,計算了多個IoU閾值[0.5:0.05:0.95]下的平均召回率(AR)。我們報告不同提案平均數(shù)量(AN)下的AR為AR@AN, AR-AN曲線面積(AUC)為ActivityNet-1.3指標,其中AN范圍從0到100。為了顯示不同的特征提取器對檢測任務(wù)的影響,我們比較了兩個slowFast-32預(yù)訓(xùn)練模型,一個在4.2節(jié)描述的SoccerDB動作識別任務(wù)上訓(xùn)練,另一個在Kinetics上訓(xùn)練。表7展示了這兩種提取器的結(jié)果。


The feature extractor trained on SoccerDB exceeds Kinetics extractor by 0.7% on the AUC metric. The results mean we benefit from training feature encoder on the same dataset on temporal action proposal generation stage, but the gain is limited. We use the same SF-32 classifier to produce the final detection results based on temporal proposals, and the detection metric is mAP with IoU thresholds {0.3:0.1:0.7}. For Kinetics proposals the mAP is 52.35%, while SoccerDB proposals mAP is 54.30%. The similar performance adopts by different feature encoder due to following reasons: first, Kinetics is a very large-scale action recognition database which contains ample patterns for training a good general feature encoder; second, the algorithm we adopt on proposal stage is strong enough for modeling the important event temporal location.
SoccerDB上訓(xùn)練的特征提取器在AUC指標上超過Kinetics提取器的0.7%。結(jié)果意味著在時間動作提議生成階段在相同的數(shù)據(jù)集上訓(xùn)練的特征編碼器是有益的,但增益是有限的。我們使用同一個SF-32分類器根據(jù)時間建議產(chǎn)生最終的檢測結(jié)果,檢測度量為帶有IoU閾值{0.3:0.1:0.7}的mAP。Kinetics產(chǎn)生的提議mAP是52.35%,而SoccerDB產(chǎn)生提議的mAP是54.30%。不同的特征編碼器取得相似的性能,原因如下:第一,Kinetics是一個非常大規(guī)模的動作識別數(shù)據(jù)庫,它包含大量的模式,可以訓(xùn)練一個良好的通用特征編碼器;其次,我們在提議階段采用的算法對重要事件的時間位置進行了足夠強的建模。
5.5 Highlight Detection
5.5 高光檢測
We set the experiments on the whole SoccerDB dataset. The average precision results of our four baseline models are shown in Table 8. The fc-only model gets 68.72% AP demonstrates the action recognition model can provide strong representation to highlight detection tasks indicating a close relationship between our defined events and the highlight segments. The mt model decreases the AP of the full-ft model by 2.33%, which means the highlight segments are very different from action recognition when sharing the same features. The mt-hl-branch model gives the highest AP by better utilizing the correlation between the two tasks while distinguishing their differences. We also find the mt model is harmful to the recognition which decreases the mAP by 1.85 comparing to the baseline model. The mt-hl-branch can increase the action recognition mAP by 1.46% while providing the highest highlight detection score. The detailed action recognition mAP for the three models is shown in Table 9. A better way to utilize the connection between action recognition and highlight detection is expected to be able to boost the performances on both of them.
我們在整個SoccerDB數(shù)據(jù)集上設(shè)置實驗。我們的四個基線模型的平均精度結(jié)果如表8所示。fc-only獲得68.72%的AP,表明動作識別模型可以提供強表示來高光檢測任務(wù),表明我們定義的事件和高光片段之間的密切關(guān)系。mt模型使full-ft模型的AP降低了2.33%,這意味著當具有相同特征時,高亮部分與動作識別有很大的差異。mt-hl-branch模型在區(qū)分任務(wù)差異的同時,更好地利用了兩個任務(wù)之間的相關(guān)性,從而獲得了最高的AP。我們還發(fā)現(xiàn)mt模型對動作識別是有害的,它比基線模型降低了1.85的mAP。mt-hl-branch可以將動作識別mAP提高1.46%,同時提供最高的高光檢測分數(shù)。三種模型的詳細動作識別mAP如表9所示。一種更好利用動作識別和高光檢測之間的聯(lián)系的方法,有望提高兩者的性能。




6 CONCLUSION
6 結(jié)論
In this paper, we introduce SoccerDB, a new benchmark for comprehensive video understanding. It helps us discuss object detection, action recognition, temporal action detection, and video highlight detection in a closed-form under a restricted but challenging environment. We explore many state-of-the-art methods on different tasks and discuss the relationship among those tasks. The quantified results show that there are very close connections between different visual understanding tasks, and algorithms can benefit a lot when considering the connections. We release the benchmark to the video understanding community in the hope of driving researchers towards building a human-comparable video understanding system.
在本文中,我們引入了一種新的視頻綜合理解基準——SoccerDB。它幫助我們在受限但具有挑戰(zhàn)性的環(huán)境下以封閉的形式討論目標檢測、動作識別、動作時間檢測和視頻高光檢測。我們探索了許多最先進的方法在不同的任務(wù),并討論這些任務(wù)之間的關(guān)系。量化結(jié)果表明,不同的視覺理解任務(wù)之間存在著非常密切的聯(lián)系,算法在考慮這些聯(lián)系時可以受益很多。我們向視頻理解社區(qū)發(fā)布了這個基準,希望推動研究人員構(gòu)建一個與人類類似的視頻理解系統(tǒng)。
7 ACKNOWLEDGMENTS
7 致謝
This work is supported by State Key Laboratory of Media Convergence Production Technology and Systems, and Xinhua Zhiyun Technology Co., Ltd..
本研究由媒體融合生產(chǎn)技術(shù)與系統(tǒng)國家重點實驗室、新華智云科技有限公司資助。
REFERENCES
參考文獻
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube- 8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
[2] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity under- standing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.
[3] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[4] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155 (2019).
[5] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.
[6] Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. 2018. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1711–1721.
[7] Georgia Gkioxari, Ross Girshick, and Jitendra Malik. 2015. Contextual action recognition with r* cnn. In Proceedings of the IEEE international conference on computer vision. 1080–1088.
[8] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047–6056.
[9] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- thankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
[10] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary- Matching Network for Temporal Action Proposal Generation. In Proceedings of the IEEE International Conference on Computer Vision. 3889–3898.
[11] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dolla ?r. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.
[12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla ?r, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
[13] Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. 2016. Detecting events and key actors in multi- person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3043–3053.
[14] Arnau Raventos, Raul Quijada, Luis Torres, and Francesc Tarre ?s. 2015. Automatic summarization of soccer highlights using audio-visual descriptors. SpringerPlus 4, 1 (2015), 1–19.
[15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
[16] Huang-Chia Shih. 2017. A survey of content-aware video analysis for sports. IEEE Transactions on Circuits and Systems for Video Technology 28, 5 (2017), 1212–1231.
[17] Gunnar A Sigurdsson, Olga Russakovsky, and Abhinav Gupta. 2017. What
actions are needed for understanding human actions in videos?. In Proceedings
of the IEEE International Conference on Computer Vision. 2137–2146.
[18] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 5179–5187.
[19] Rajkumar Theagarajan, Federico Pala, Xiu Zhang, and Bir Bhanu. 2018. Soccer:
Who has the ball? Generating visual analytics and player statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1749–1757.
[20] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.
[21] Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV). 399–417.
[22] Zhao Zhao, Shuqiang Jiang, Qingming Huang, and Guangyu Zhu. 2006. High- light summarization in sports video based on replay detection. In 2006 IEEE international conference on multimedia and expo. IEEE, 1613–1616.