Yahoo-Flickr Creative Commons 100 Million Dataset
Since its inception in 2004, Flickr has become one of the largest repositories of user-generated multimedia with Creative Commons licenses. The YFCC100M dataset, used by companies and researchers like IBM DiF, Snapchat and Lawrence Livermore National Laboratory was built using Flickr photos.
All of the images in YFCC100M legally required attribution and were made public for use. The MegaFace and COCO datasets both omit this attribution.
The vast quantity of visual content shared on the popular photo sharing website Flickr makes it a great source for datasets used in research. Founded in 2004 Flickr was one of the first platforms to embrace Creative Commons licensing for user generated multimedia. The platform allows users to share their images with either a public domain or CC license.
By 2014 there were over 100 million CC licensed photos on the platform. The alluring combination of permissive licensing and massive scale media “in the wild” piqued the interest of researchers at Yahoo Labs, Lawrence Livermore National Laboratory, Snapchat and In-Q-Tel (a subsidiary of the Central Intelligence Agency).
They created the YFCC100M dataset, which is the largest freely usable photo and video dataset for AI research. It is composed of text files that contain image URLs and corresponding Flickr metadata. YFCC100M served as the source for several face recognition datasets including IBM DiF, MegaFace and FairFace.
The massive scale of photos and videos on Flickr and its social ecosystem and permissive licenses piqued the interests of researchers at Yahoo Labs, Lawrence Livermore National Laboratory, Snapchat and In-Q-Tel (a subsidiary and research group of the Central Intelligence Agency). They combined their efforts to produce one of the largest publicly available multimedia datasets that has yet to be made available for use: the Yahoo-Flickr Creative Commons 100 Million dataset.
YFCC100M exists as a text file containing 99,171,688 image URLs and Flickr metadata. It served as the basis for many smaller, more specialized datasets including IBM DiF, MegaFace and DiveFace, all of which contain images of people and have been used for computer vision applications.
The dataset contains a subset of a billion photographs and more than 50 terabytes of video, primarily in the form of short clips that were clipped from the full versions of each. In addition to the original image and video files, YFCC100M includes text files with image description and geo-tags, captions, human-readable place labels and additional metadata such as real names, biometric face landmark data and attribution information.
The photograph has transformed from a unprocessed roll of C-41 sitting in a fridge 20 years ago to images automatically leaving their capture devices and sharing via many services. But not all photographs are created equal. Some, like the 467 million Creative Commons licensed photos on Flickr, come with a set of restrictions. Filtering by restrictive clauses the most popular option is non-commercial (CC-NC) followed by attribution only (CC-BY).
This imposes a burden on researchers who want to use these images to answer scientific questions but may not have the time or resources to find and contact individual image creators. This is a problem especially when the images include sensitive data such as real names. One clear example is the MegaFace dataset which contains 3.3 million faces that are used for face recognition but do not include any attribution rights. Another is the COCO dataset which also lacks attribution. This is a huge disservice to the thousands of research projects that rely on these publicly available datasets.
Creative Commons was well designed to address the ways images were used in 2004. It unlocked clumsy restrictions from an archaic copyright system, allowing creativity to prosper and communities to grow. However, 18 years later the open licensing system is being put to unexpected uses that go against the expectations of many people who made their work available under a CC license.
Face recognition is a significant example. Its exploitation by academics, commercial organizations, and defense contractors is widely accepted, even if the datasets themselves are not open. Several large technology companies, including Google, Snapchat, and In-Q-Tel, have relied on the 3.3 million face photos in the MegaFace and COCO datasets. Both of these include biometric data such as facial landmarks and real names.
Both datasets also omit attribution, which is legally required, depriving their creators of the credit they deserve. This is a clear violation of a key principle of the open licensing system.