graviti logoProductOpen DatasetsAbout us
Sign in
419
0
0
MS-Celeb-1M
General
Discussion
Code
Activities
c77ba84a-8cd1-11eb-88ae-0e1f58d5e9a9
bc3b7a3·
Jun 20, 2021 9:58 AM
·1Commits

Overview

We select one million celebrities, who are real persons in the world and have/had public attentions. The steps for selection are described in details in the following paragraphs. First, we select a subset of entities from a knowledge base called freebase [11] based on the information within freebase. In freebase, each entity is identifified by a unique key (called machine identififier, mid in [11]), and associated with rich properties. More specififically, we select the entities of which the properties satisfy all the three following conditions.

The object type of the entity is defifined as “people.person” in freebase. This condition means that we select entities which are claimed (by freebase) to be real persons in the world. We don’t include movie characters since their appearance is not strictly defifined, especially when a classic movie is retaken.

The entities are required to have at least one of the properties unique for human beings, such as “person’s name”, “place of birth”, “date of birth”, “person’s professions”. This condition removes the entities which have too sparse information for us to collect and label images. This condition also helps us to remove some of the entities of which the object type are mislabeled as “people.person” in freebase.

If the date of birth is available for a given entity in freebase, this entity can not be selected if he/she was born before the mid-nineteenth century. The reason for this condition is as follows. The fifirst roll-fifilm specialized camera “Kodak” was invented in 1888 [20] and started to get popular in late nineteenth century. We can not rely on drawings or sculptures to recognize people’s faces, since whether they are visually similar to the actual person could be subjective and arguable. An interesting example is that the sculpture of John Harvard in Harvard university is claimed to be inspired by a Harvard student Sherman Hoar rather than Harvard himself, since no one knew what John Harvard had looked like [21].

In the second step, we rank all the entities in the above sub set according to the frequency of their occurrence on the web. Then, we select the top one million entities to form our one mil lion celebrity list and provide their entity keys (mid) in freebase. The occurrence frequency for a given entity is obtained by count ing how many documents contain this entity in a large corpus with billions of documents from the web.

🎉Many thanks to Graviti Open Datasets for contributing the dataset
Basic Information
Application ScenariosNot Available
AnnotationsNot Available
TasksNot Available
LicenseUnknown
Updated on2021-01-20 04:34:28
Metadata
Data TypeNot Available
Data Volume10M
Annotation Amount0
File Size0B
Copyright Owner
Microsoft
Annotator
Unknown
More Support Options
Start building your AI now
Get StartedContact