This grand challenge aims to advance large-scale human-centric video analysis in complex events using multimedia techniques. We propose the largest existing dataset (named as Human-in-Events or HiEve) for understanding human motion, pose, and action in a variety of realistic events, especially crowd & complex events. Four challenging tasks are established on our dataset, which encourages researches to address the very challenging and realistic problems in human-centric analysis. Our challenge will benefit researches in a wide range of multimedia and computer vision areas including multimedia analysis and multimedia content analysis.
We start by selecting several crowded places with complex and diverse events for video collection. In total, our video sequences are collected from 9 different scenes: airport, dining hall, indoor, jail, mall, square, school, station and street. Most of these videos are selected from our own private sequences and contain complex interactions between persons. Then, to further increase the variety and complexity of behaviors in videos, we searched some videos recording unusual scenes (e.g. jail, factory) and anomalous events (e.g. fighting, earthquake, robbery) on YouTube. For each scene, we keep several videos captured at different sites and with different types of events happening to ensure the diversity of scenarios. Moreover, data redundancy is avoided through manual checking. In order to protect the privacy of relevant personnel and units, we blurred the faces and the key text in the videos. Finally, 32 real-world video sequences in different scenes are collected, each containing one or more complex events. These video sequences are split into training and testing set of 19 and 13 videos elaborately so that both sets cover all the scenes but with different camera angles or sites.
In our dataset, the bounding-boxes, keypointbased poses,
human identities, and human actions are all manually annotated.
The annotation procedure is as follows:
First, similar to the MOT dataset, we annotate bounding boxes
for all moving pedestrians (e.g. running, walking, fighting,
riding) and static people (e.g. standing, sitting, lying). A
unique track ID is assigned to each person until it moves out
of the camera field-of-view.
Second, we annotate poses for each person in the entire video.
Different from PoseTrack and COCO, our annotated pose for
each body contains 14 key-points (Figure 2a): nose, chest,
shoulders, elbows, wrists, hips, knees, ankles. Specially, we
skip pose annotation which falls into any of the following
conditions: (1) heavy occlusion (2) area of the bounding box
is less than 500 pixels. Figure 2b presents some pose and
bounding-box annotation examples.
Third, we annotate actions of all individuals in every 20
frames in a video. For group actions, we assign the action
label to each group member involved in this group activity. In
total, we defined 14 action categories: walking-alone,
walkingtogether, running-alone, running-together, riding,
sittingtalking, sitting-alone, queuing, standing-alone, gathering,
fighting, fall-over, walking-up-down-stairs, crouching-bowing.
Finally, all annotations are double-checked to ensure their
quality.
@misc{lin2020human,
title={Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in
Complex Events},
author={Weiyao Lin and Huabin Liu and Shizhan Liu and Yuxi Li and Guo-Jun Qi and
Rui Qian and Tao Wang and Nicu Sebe and Ning Xu and Hongkai Xiong and Mubarak Shah},
year={2020},
eprint={2005.04490},
archivePrefix={arXiv},
primaryClass={cs.CV}
}