PANDA is the first gigaPixel-level humAN-centric video dataset, for large-scale, long-term, and multi-object visual analysis. The videos in PANDA were captured by a gigapixel camera and cover real-world large-scale scenes with both wide field-of-view (~1km^2 area) and high resolution details (~gigapixel-level/frame). The scenes may contain 4k head counts with over 100× scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.
For PANDA-Image, the training image and test image are stored in two compression packages respectively. The directory after the decompression contains the folders of each scene named after the scene name, each folder contains the pictures belonging to each scene.
For PANDA-Video, each video sequence is stored in a separate compression package. The compressed folder is named after the scene name and contains the frame images of the video sequences.
The two files human_bbox_train.json
and vehicle_bbox_train.json
respectively contain the annotations of the pedestrians and vehicles
in the images for training set. human_bbox_test.json
and vehicle_bbox_test.json
only contain
image_filepath
, image id
and image size
for testing set. Please note that for the results
on the test set to submit, the image id should be the same as in the annotation file.
JSON{
image_filepath : image_dict,
...
}
image_dict{
"image id" : int,
"image size" : image_size,
"objects list" : [object_dict],
}
image_size{
"height" : int,
"width" : int,
}
If the object is a person:
object_dict{
"category" : "person",
"pose" : "standing" or "walking" or "sitting" or "riding" or "held" (a baby in the
arms) or "unsure",
"riding type" : "bicycle rider" or "motorcycle rider" "tricycle
rider" or "null" (when "pose" is not "riding"),
"age" : "adult" or "child" or "unsure",
"rects" : rects,
}
rects{
"head" : rect_dict,
"visible body" : rect_dict,
"full body" : rect_dict,
}
If this box is crowd/reflection/something like person/... and need to be ignore:
object_dict{
"category" : "ignore" (someone who
is heavily occluded) or "fake person" or "crowd" (extremely dense crowd),
"rect" : rect_dict,
}
rect_dict{
"tl" : {
"x" : float,
"y" : float,
},
"br" : {
"x" : float,
"y" : float,
}
}
image_filepath
is the relative path of the image"category"
is
the key that determines whether the target box is a pedestrian or a special area that needs
to be ignored. A pedestrian can only be "person"
"riding type"
is not "null"
only if "category"
is "riding"
"x"
and "y"
are floating point numbers
between 0 and 1, representing the ratio of the coordinates to the width and height of the image,
respectivelyJSON{
image_filepath : image_dict,
...
}
image_dict{
"image id" : int,
"image size" : image_size,
"objects list" : [object_dict],
}
image_size{
"height" : int,
"width" : int,
}
object_dict{
"category" : "small car" or "midsize car" or "large car" or "bicycle"
or "motorcycle" or "tricycle" or "electric car" or "baby carriage" or "vehicles" or "unsure",
"rect" : rect_dict,
}
rect_dict{
"tl" : {
"x" : float,
"y" : float,
},
"br" : {
"x" : float,
"y" : float,
}
}
image_filepath
is the relative path of the image"vehicles"
refers to a dense vehicle group and should be ignored"small car"
, "midsize car"
and "large car"
belong to motor vehicles with four or more
wheels and are distinguished by vehicle size. "electric car"
refers to an electric sightseeing
car or patrol car, etc."x"
and "y"
are floating point numbers between 0 and 1, representing
the ratio of the coordinates to the width and height of the image, respectivelyThe annotation
files for each video sequence in PANDA-Video include two: tracks.json
and seqinfo.json
respectively contain the pedestrian trajectory annotation and the basic information of the
video sequence. The annotation file for each video sequence is stored in a folder named after
the scene name.
JSON{
[track_dict],
}
track_dict{
"track id" : int,
"frames" : [frame_dict],
}
frame_dict{
"frame id" : int,
"rect" : rect_dict,
"face orientation" : "back" or "front" or "left" or "left back" or
"left front" or "right" or "right back" or "right front" or "unsure",
"occlusion" : "normal" or "hide" or "serious hide" or "disappear",
}
rect_dict{
"tl" : {
"x" : float,
"y" : float,
},
"br" : {
"x" : float,
"y" : float,
}
}
"frame id"
and "track id"
count from 1"face orientation"
, "front"
means facing the camera"occlusion"
,
"normal"
means the occlusion rate is less than 10%, "hide"
means the occlusion rate is
between 10% and 50%, "serious hide"
means the occlusion rate is greater than 50%, and "disappear"
means the object completely disappears"x"
and "y"
are floating point numbers between
0 and 1, representing the ratio of the coordinates to the width and height of the image, respectivelyJSON{
"name" : scene_name,
"frameRate" : int,
"seqLength" : int,
"imWidth" : int,
"imHeight" : int,
"imExt" : file_extension,
"imUrls" : [image_url]
}
Please use the following citation when referencing the dataset:
@inproceedings{wang2020panda,
title={PANDA: A Gigapixel-level Human-centric Video Dataset},
author={Wang, Xueyang and Zhang, Xiya and Zhu, Yinheng and Guo, Yuchen and Yuan, Xiaoyun
and Xiang, Liuyu and Wang, Zerun and Ding, Guiguang and Brady, David and Dai, Qionghai and
others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={3268--3278},
year={2020}
}