Appearance-based gaze estimation is believed to work well in real-world settings, but existing datasets have been collected under controlled laboratory conditions and methods have been not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing ones with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks that significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithms on three current datasets, including our own.
We implemented custom software running as a background service on participants’ laptops. Every 10 minutes the software automatically asked participants to look at a random sequence of 20 on-screen positions (a recording session), visualized as a grey circle shrinking in size and with a white dot in the middle. Participants were asked to fixate on these dots and confirm each by pressing the spacebar once the circle was about to disappear. This was to ensure participants concentrated on the task and fixated exactly at the intended on-screen positions. No other instructions were given to them, in particular no constraints as to how and where to use their laptops.
We collected a total of 213,659 images from 15 participants. The number of images collected by each
participant varied from 34,745 to 1,498.
The following figure shows the the collected samples across different factors, including: percentage
of images having different mean grey-scale intensities within the face region (top left), having
horizontally different mean grey-scale intensities from the left to right half of the face region
(to right), collected at different times over the day (bottom left), and collected by each
participants. Some figures with representative samples at the top.
The dataset contains three parts: "Data", "Evaluation Subset" and "Annotation subset".
The "Data" folder includes "Original" and "Normalized" for all the 15 participants. You can also find
the 6 points-based face model we used in this dataset.
The "Original" folders are the cropped eye rectangle images with the detection results based on face
detector and facial landmark detector. For each participants, the images and annotations are
organized by days. For each day's folder, there are the image collected by that participants and
corresponding "annotation.txt" files. The annotations includes:
Besides, there is also "Calibration" folder for each participants, which contains:
The "Normalized" folders are the eye patch images after the normalization that canceling scaling and rotation via perspective transformation in Sugano et al. Similar to the "Original" folders, all the data are organized by each days for each participants, and the file format is ".mat". The annotation includes:
The folder "Evaluation Subset'' contains:
The folder "Annotation Subset" contains:
The image list that indicates 10,848 samples that we manually annotated
Following the annotations with (x, y) position of 6 facial landmarks (four eye corners, two mouth corners) and (x,y) position of two pupil centers for each of above images. The comparison of the original eye rectangle and normalized eye patch is shown in the following figure(Left: Original eye rectangle image (720 x 1280 pixel), Right: Normalized eye patch image (36 x 60 pixel)).
@inproceedings{zhang2015appearance,
title={Appearance-based gaze estimation in the wild},
author={Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={4511--4520},
year={2015}
}