PlatformMarketplaceSolutionsResourcesOpen DatasetsCommunityCompany
Jun 20, 2021 6:29 PM


COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Data Annotation

COCO has several annotation types: for object detection, keypoint detection, stuff segmentation, panoptic segmentation, densepose, and image captioning. The annotations are stored using JSON. Please note that the COCO API described on the download page can be used to access and manipulate all anotations. All annotations share the same basic data structure below:

"info" : info,
"images" : [image],
"annotations" : [annotation],
"licenses" : [license],

"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime,

"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"license" : int,
"flickr_url" : str,
"coco_url" : str,
"date_captured" : datetime,

"id" : int,
"name" : str,
"url" : str,

The data structures specific to the various annotation types are described below.

Object Detection

Each object instance annotation contains a series of fields, including the category id and segmentation mask of the object. The segmentation format depends on whether the instance represents a single object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1 in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons, for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object (box coordinates are measured from the top left image corner and are 0-indexed). Finally, the categories field of the annotation structure stores the mapping of category id to category and supercategory names. See also the detection task.

  "id"             : int,
  "image_id"       : int,
  "category_id"    : int,
  "segmentation"   : RLE or [polygon],
  "area"           : float,
  "bbox"           : [x,y,width,height],
  "iscrowd"        : 0 or 1,

  "id"             : int,
  "name"           : str,
  "supercategory"  : str,

Keypoint Detection

A keypoint annotation contains all the data of the object annotation (including id, bbox, etc.) and two additional fields. First, "keypoints" is a length 3k array where k is the total number of keypoints defined for the category. Each keypoint has a 0-indexed location x,y and a visibility flag v defined as v=0: not labeled (in which case x=y=0), v=1: labeled but not visible, and v=2: labeled and visible. A keypoint is considered visible if it falls inside the object segment. "num_keypoints" indicates the number of labeled keypoints (v>0) for a given object (many objects, e.g. crowds and small objects, will have num_keypoints=0). Finally, for each category, the categories struct has two additional fields: "keypoints," which is a length k array of keypoint names, and "skeleton", which defines connectivity via a list of keypoint edge pairs and is used for visualization. Currently keypoints are only labeled for the person category (for most medium/large non-crowd person instances). See also the keypoint task.

  "keypoints"        : [x1,y1,v1,...],
  "num_keypoints"    : int,
  "[cloned]"         : ...,

  "keypoints"        : [str],
  "skeleton"         : [edge],
  "[cloned]"         : ...,

"[cloned]": denotes fields copied from object detection annotations defined above.

Stuff Segmentation

The stuff annotation format is identical and fully compatible to the object detection format above (except iscrowd is unnecessary and set to 0 by default). We provide annotations in both JSON and png format for easier access, as well as conversion scripts between the two formats. In the JSON format, each category present in an image is encoded with a single RLE annotation (see the Mask API for more details). The category_id represents the id of the current stuff category. For more details on stuff categories and supercategories see the stuff evaluation page. See also the stuff task.

Panoptic Segmentation

For the panoptic task, each annotation struct is a per-image annotation rather than a per-object annotation. Each per-image annotation has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON struct that stores the semantic information for each image segment. In more detail:

  1. To match an annotation with an image, use the image_id field (that is
  2. For each annotation, per-pixel segment ids are stored as a single PNG at annotation.file_name. The PNGs are in a folder with the same name as the JSON, i.e., annotations/name/ for annotations/name.json. Each segment (whether it's a stuff or thing segment) is assigned a unique id. Unlabeled pixels (void) are assigned a value of 0. Note that when you load the PNG as an RGB image, you will need to compute the ids via ids=R+G256+B256^2.
  3. For each annotation, per-segment info is stored in annotation.segments_info. stores the unique id of the segment and is used to retrieve the corresponding mask from the PNG ( category_id gives the semantic category and iscrowd indicates the segment encompasses a group of objects (relevant for thing categories only). The bbox and area fields provide additional info about the segment.
  4. The COCO panoptic task has the same thing categories as the detection task, whereas the stuff categories differ from those in the stuff task (for details see the panoptic evaluation page). Finally, each category struct has two additional fields: isthing that distinguishes stuff and thing categories and color that is useful for consistent visualization.
  "image_id"         : int,
  "file_name"        : str,
  "segments_info"    : [segment_info],

  "id"               : int,
  "category_id"      : int,
  "area"             : int,
  "bbox"             : [x,y,width,height],
  "iscrowd"          : 0 or 1,

  "id"               : int,
  "name"             : str,
  "supercategory"    : str,
  "isthing"          : 0 or 1,
  "color"            : [R,G,B],

Image Captioning

These annotations are used to store image captions. Each caption describes the specified image and each image has at least 5 captions (some images have more). See also the captioning task.

  "id"               : int,
  "image_id"         : int,
  "caption"          : str,


For the DensePose task, each annotation contains a series of fields, including category id, bounding box, body part masks and parametrization data for selected points, which are detailed below.

Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people).

An enclosing bounding box is provided for each person (box coordinates are measured from the top left image corner and are 0-indexed).

The categories field of the annotation structure stores the mapping of category id to category and supercategory names.

DensePose annotations are stored in dp_* fields:

Annotated masks:

  • dp_masks: RLE encoded dense masks. All part masks are of size 256x256. They correspond to 14 semantically meaningful parts of the body: Torso, Right Hand, Left Hand, Left Foot, Right Foot, Upper Leg Right, Upper Leg Left, Lower Leg Right, Lower Leg Left, Upper Arm Left, Upper Arm Right, Lower Arm Left, Lower Arm Right, Head;

Annotated points:

  • dp_x, dp_y: spatial coordinates of collected points on the image. The coordinates are scaled such that the bounding box size is 256x256;
  • dp_I: The patch index that indicates which of the 24 surface patches the point is on. Patches correspond to the body parts described above. Some body parts are split into 2 patches: 1, 2 = Torso, 3 = Right Hand, 4 = Left Hand, 5 = Left Foot, 6 = Right Foot, 7, 9 = Upper Leg Right, 8, 10 = Upper Leg Left, 11, 13 = Lower Leg Right, 12, 14 = Lower Leg Left, 15, 17 = Upper Arm Left, 16, 18 = Upper Arm Right, 19, 21 = Lower Arm Left, 20, 22 = Lower Arm Right, 23, 24 = Head;
  • dp_U, dp_V: Coordinates in the UV space. Each surface patch has a separate 2D parameterization.
  "id"               : int,
  "image_id"         : int,
  "category_id"      : int,
  "is_crowd"         : 0 or 1,
  "area"             : int,
  "bbox"             : [x,y,width,height],
  "dp_I"             : [float],
  "dp_U"             : [float],
  "dp_V"             : [float],
  "dp_x"             : [float],
  "dp_y"             : [float],
  "dp_masks"         : [RLE],



The COCO API assists in loading, parsing, and visualizing annotations in COCO. The API supports multiple annotation formats (please see the data format page). For additional details see: CocoApi.m,, and CocoApi.lua for Matlab, Python, and Lua code, respectively, and also the Python API demo.

Throughout the API "ann"=annotation, "cat"=category, and "img"=image.
getAnnIds Get ann ids that satisfy given filter conditions.
getCatIds Get cat ids that satisfy given filter conditions.
getImgIds Get img ids that satisfy given filter conditions.
loadAnns Load anns with the specified ids.
loadCats Load cats with the specified ids.
loadImgs Load imgs with the specified ids.
loadRes Load algorithm results and create API for accessing them.
showAnns Display the specified annotations.


COCO provides segmentation masks for every object instance. This creates two challenges: storing masks compactly and performing mask computations efficiently. We solve both challenges using a custom Run Length Encoding (RLE) scheme. The size of the RLE representation is proportional to the number of boundaries pixels of a mask and operations such as area, union, or intersection can be computed efficiently directly on the RLE. Specifically, assuming fairly simple shapes, the RLE representation is O(√n) where n is number of pixels in the object, and common computations are likewise O(√n). Naively computing the same operations on the decoded masks (stored as an array) would be O(n).

The MASK API provides an interface for manipulating masks stored in RLE format. The API is defined below, for additional details see: MaskApi.m,, or MaskApi.lua. Finally, we note that a majority of ground truth masks are stored as polygons (which are quite compact), these polygons are converted to RLE when needed.

encode Encode binary masks using RLE.
decode Decode binary masks encoded via RLE.
merge Compute union or intersection of encoded masks.
iou Compute intersection over union between masks.
area Compute area of encoded masks.
toBbox Get bounding boxes surrounding encoded masks.
frBbox Convert bounding boxes to encoded masks.
frPoly Convert polygon to encoded mask.
🎉Many thanks to Graviti Open Datasets for contributing the dataset
Basic Information
Application ScenariosNot Available
AnnotationsNot Available
TasksNot Available
LicenseCC BY 4.0
Updated on2021-01-20 03:33:11
Data TypeNot Available
Data Volume1,009,571
Annotation Amount0
File Size0.00B
Copyright Owner