COCO2017 is a version of COCO dataset released in 2017. It is mainly used for object detection task, keypoint detection task and panoptic segmentation task held by COCO after 2017.
The annotations of COCO2017 are stored in JSON files or png images. Please note that the COCO API described on the download page can be used to access and manipulate all annotations. All annotations share the same basic data structure below:
{
"info" : info,
"images" : [image],
"annotations" : [annotation],
"licenses" : [license],
}
info{
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime,
}
image{
"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"license" : int,
"flickr_url" : str,
"coco_url" : str,
"date_captured" : datetime,
}
license{
"id" : int,
"name" : str,
"url" : str,
}
Each object instance annotation contains a series of fields, including the category id and segmentation mask of the object. The segmentation format depends on whether the instance represents a single object (iscrowd
=0 in which case polygons are used) or a collection of objects (iscrowd
=1 in which case RLE is used). Note that a single object (iscrowd
=0) may require multiple polygons, for example, if occluded. Crowd annotations (iscrowd
=1) are used to label large groups of objects (e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object (box coordinates are measured from the top left image corner and are 0-indexed). Finally, the categories field of the annotation structure stores the mapping of category id to category and supercategory names. See also the object detection task.
annotation{
"id" : int,
"image_id" : int,
"category_id" : int,
"segmentation" : RLE or [polygon],
"area" : float,
"bbox" : [x,y,width,height],
"iscrowd" : 0 or 1,
}
categories[{
"id" : int,
"name" : str,
"supercategory" : str,
}]
A keypoint annotation contains all the data of the object annotation (including id, bbox, etc.) and two additional fields. First, keypoints
is a length 3k array where k is the total number of keypoints defined for the category. Each keypoint has a 0-indexed location x,y and a visibility flag v defined as v=0: not labeled (in which case x=y=0), v=1: labeled but not visible, and v=2: labeled and visible. A keypoint is considered visible if it falls inside the object segment. "num_keypoints" indicates the number of labeled keypoints (v>0) for a given object (many objects, e.g. crowds and small objects, will have num_keypoints=0). Finally, for each category, the categories struct has two additional fields: keypoints
, which is a length k array of keypoint names, and skeleton
, which defines connectivity via a list of keypoint edge pairs and is used for visualization. Currently, keypoints are only labeled for the person category (for most medium/large non-crowd person instances). See also the keypoint detection task.
annotation{
"keypoints" : [x1,y1,v1,...],
"num_keypoints" : int,
"[cloned]" : ...,
}
categories[{
"keypoints" : [str],
"skeleton" : [edge],
"[cloned]" : ...,
}]
"[cloned]": denotes fields copied from object detection annotations defined above.
For the panoptic task, each annotation struct is a per-image annotation rather than a per-object annotation. Each per-image annotation has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON struct that stores the semantic information for each image segment. In more detail:
image_id
field (that is annotation.image_id==image.id).annotation.file_name
. The PNGs are in a folder with the same name as the JSON, i.e., annotations/name/ for annotations/name.json. Each segment (whether it's a stuff or thing segment) is assigned a unique id. Unlabeled pixels (void) are assigned a value of 0. Note that when you load the PNG as an RGB image, you will need to compute the ids via ids=R+G256+B256^2.annotation.segments_info
. segment_info.id
stores the unique id of the segment and is used to retrieve the corresponding mask from the PNG (ids
==segment_info.id
). category_id
gives the semantic category and iscrowd
indicates the segment encompasses a group of objects (relevant for thing categories only). The bbox
and area
fields provide additional info about the segment.isthing
that distinguishes stuff and thing categories and color
that is useful for consistent visualization.annotation{
"image_id" : int,
"file_name" : str,
"segments_info" : [segment_info],
}
segment_info{
"id" : int,
"category_id" : int,
"area" : int,
"bbox" : [x,y,width,height],
"iscrowd" : 0 or 1,
}
categories[{
"id" : int,
"name" : str,
"supercategory" : str,
"isthing" : 0 or 1,
"color" : [R,G,B],
}]
The COCO API assists in loading, parsing, and visualizing annotations in COCO. The API supports multiple annotation formats (please see the data format page). For additional details see: CocoApi.m, coco.py, and CocoApi.lua for Matlab, Python, and Lua code, respectively, and also the Python API demo.
Throughout the API "ann"=annotation, "cat"=category, and "img"=image.
getAnnIds Get ann ids that satisfy given filter conditions.
getCatIds Get cat ids that satisfy given filter conditions.
getImgIds Get img ids that satisfy given filter conditions.
loadAnns Load anns with the specified ids.
loadCats Load cats with the specified ids.
loadImgs Load imgs with the specified ids.
loadRes Load algorithm results and create API for accessing them.
showAnns Display the specified annotations.
COCO provides segmentation masks for every object instance. This creates two challenges: storing masks compactly and performing mask computations efficiently. We solve both challenges using a custom Run Length Encoding (RLE) scheme. The size of the RLE representation is proportional to the number of boundaries pixels of a mask and operations such as area, union, or intersection can be computed efficiently directly on the RLE. Specifically, assuming fairly simple shapes, the RLE representation is O(√n) where n is the number of pixels in the object, and common computations are likewise O(√n). Naively computing the same operations on the decoded masks (stored as an array) would be O(n).
The MASK API provides an interface for manipulating masks stored in RLE format. The API is defined below, for additional details see: MaskApi.m, mask.py, or MaskApi.lua. Finally, we note that a majority of ground truth masks are stored as polygons (which are quite compact), these polygons are converted to RLE when needed.
encode Encode binary masks using RLE.
decode Decode binary masks encoded via RLE.
merge Compute union or intersection of encoded masks.
iou Compute intersection over union between masks.
area Compute area of encoded masks.
toBbox Get bounding boxes surrounding encoded masks.
frBbox Convert bounding boxes to encoded masks.
frPoly Convert polygon to encoded mask.
Please use the following citation when referencing the dataset:
@misc{lin2015microsoft,
title={Microsoft COCO: Common Objects in Context},
author={Tsung-Yi Lin and Michael Maire and Serge Belongie and
Lubomir Bourdev and Ross Girshick and James Hays and
Pietro Perona and Deva Ramanan and C. Lawrence Zitnick
and Piotr Dollár},
year={2015},
eprint={1405.0312},
archivePrefix={arXiv},
primaryClass={cs.CV}
}