COSMOS Dataset
COSMOS dataset consists of images and captions scraped from news articles and other websites designed for training and evaluation of out-of-context use of images. We refer readers to the paper for more details. To get access to the dataset, please fill out this form. We will provide you script to download the dataset.
Dataset Description
Dataset Statistics
COSMOS dataset consist of three splits : Training (160 K images), Validation (40 K images) and Test (1700 images). For training, we do not have/use out-of-context annotations. We only use these annotations in the end to evaluate our model. The dataset stats are listed below.
Table 1: Dataset stats.
Split | # Images | # Captions | Context Annotation |
Train | 161,752 | 360,749 | No |
Valid | 41,006 | 90,036 | No |
Test | 1700 | 3400 | Yes |
Data Format
The COSMOS training, validation and test sets are provided as JSON (JavaScript Object Notation) text files with the following attributes for every data sample stored as a dictionary:
File Structure for train.json and val.json
{ "img_local_path": <img_path>,
"articles": [
{ "caption": <caption1>,
"article_url": <url1>,
"caption_modified": <caption_mod1>,
"entity_list": <entity_list1>},
{ "caption": <caption2>,
"article_url": <url2>,
"caption_modified": <caption_mod2>,
"entity_list": <entity_list2>},
{ "caption": <caption3>,
"article_url": <url3>,
"caption_modified": <caption_mod3>,
"entity_list": <entity_list3>},
......
],
"maskrcnn_bboxes": [ [x1,y1,x2,y2], [x1,y1,x2,y2], ... ]
}
Attributes in Train/Validation files
img_local_path
: Source path in dataset directory for the imagearticles
: List of dict containing metadata for every caption associated with the imagecaption
: Original Caption scraped from the news websitearticle_url
: Link to the website image and caption scraped fromcaption_modified
: Modified caption after applying Spacy NER (We used these caption as input to our model during experiments)entity_list
: List that consists of mapping between modified named entities in the caption with the corresponding hypernymmaskrcnn_bboxes
: List of detected bounding boxes corresponding to the image. (x1,y1) refers to start vertex of the rectangle and (x2, y2) refers to end vertex of the rectangle
Note that for detecting bounding boxes, we used Detectron2 pretrained model linked here. We detect upto 10 bounding boxes per image.
File Structure for test.json
{
"img_local_path": <img_path>,
"caption1": <caption1>,
"caption1_modified": <caption1_modified>,
"caption1_entities": <caption1_entities>,
"caption2": <caption2>,
"caption2_modified": <caption2_modified>,
"caption2_entities": <caption2_entities>,
"article_url": <article_url>,
"label": "ooc/not-ooc",
"maskrcnn_bboxes": [ [x1,y1,x2,y2], [x1,y1,x2,y2], ... ]
}
Attributes in Test file
img_local_path
: Source path in dataset directory for the imagecaption1
: First caption associated with the imagecaption1_modified
: Modified Caption1 after applying Spacy NERcaption1_entities
: List that consists of mapping between modified named entities in the caption1 with the corresponding hypernymcaption2
: Second caption associated with the imagecaption2_modified
: Modified Caption2 after applying Spacy NERcaption2_entities
: List that consists of mapping between modified named entities in the caption2 with the corresponding hypernymarticle_url
: Link to the website image and caption scraped fromlabel
: Class label whether the two captions are out-of-context with respect to the image (1=Out-of-Context, 0=Not-Out-of-Context )maskrcnn_bboxes
: List of detected bounding boxes corresponding to the image. (x1,y1) refers to start vertex of the rectangle and (x2, y2) refers to end vertex of the rectangle
Citation
If you find our dataset or paper useful for your research , please include the following citation:
@misc{aneja2021cosmos,
title={COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning},
author={Shivangi Aneja and Chris Bregler and Matthias Nießner},
year={2021},
eprint={2101.06278},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Contact Us
If you have questions regarding the dataset or code, please email us at shivangi.aneja@tum.de. We will get back to you as soon as possible.