COSMOS Dataset

COSMOS dataset consists of images and captions scraped from news articles and other websites designed for training and evaluation of out-of-context use of images. We refer readers to the paper for more details. To get access to the dataset, please fill out this form. We will provide you script to download the dataset.

../_images/teaser.jpeg

Dataset Description

Dataset Statistics

COSMOS dataset consist of three splits : Training (160 K images), Validation (40 K images) and Test (1700 images). For training, we do not have/use out-of-context annotations. We only use these annotations in the end to evaluate our model. The dataset stats are listed below.

Table 1: Dataset stats.

Split # Images # Captions Context Annotation
Train 161,752 360,749 No
Valid 41,006 90,036 No
Test 1700 3400 Yes

Data Format

The COSMOS training, validation and test sets are provided as JSON (JavaScript Object Notation) text files with the following attributes for every data sample stored as a dictionary:

File Structure for train.json and val.json

{	"img_local_path": <img_path>, 
	"articles": [
                 { "caption": <caption1>, 
                   "article_url": <url1>, 
                   "caption_modified": <caption_mod1>,
                   "entity_list": <entity_list1>},
                   
                 { "caption": <caption2>,
                   "article_url": <url2>,
                   "caption_modified": <caption_mod2>,
                   "entity_list": <entity_list2>},

                 { "caption": <caption3>,
                   "article_url": <url3>,
                   "caption_modified": <caption_mod3>,
                   "entity_list": <entity_list3>},
                   
                  ......

				 ],
    "maskrcnn_bboxes": [ [x1,y1,x2,y2], [x1,y1,x2,y2], ... ]
}

Attributes in Train/Validation files

img_local_path: Source path in dataset directory for the image
articles: List of dict containing metadata for every caption associated with the image
caption: Original Caption scraped from the news website
article_url: Link to the website image and caption scraped from
caption_modified: Modified caption after applying Spacy NER (We used these caption as input to our model during experiments)
entity_list: List that consists of mapping between modified named entities in the caption with the corresponding hypernym
maskrcnn_bboxes: List of detected bounding boxes corresponding to the image. (x1,y1) refers to start vertex of the rectangle and (x2, y2) refers to end vertex of the rectangle


Note that for detecting bounding boxes, we used Detectron2 pretrained model linked here. We detect upto 10 bounding boxes per image.


File Structure for test.json

{	
        "img_local_path": <img_path>,
	"caption1": <caption1>,
	"caption1_modified": <caption1_modified>,
	"caption1_entities": <caption1_entities>,
	"caption2": <caption2>,
	"caption2_modified": <caption2_modified>,
	"caption2_entities": <caption2_entities>,
	"article_url": <article_url>,
	"label": "ooc/not-ooc",
	"maskrcnn_bboxes": [ [x1,y1,x2,y2], [x1,y1,x2,y2], ... ]
}

Attributes in Test file

img_local_path: Source path in dataset directory for the image
caption1: First caption associated with the image
caption1_modified: Modified Caption1 after applying Spacy NER
caption1_entities: List that consists of mapping between modified named entities in the caption1 with the corresponding hypernym
caption2: Second caption associated with the image
caption2_modified: Modified Caption2 after applying Spacy NER
caption2_entities: List that consists of mapping between modified named entities in the caption2 with the corresponding hypernym
article_url: Link to the website image and caption scraped from
label: Class label whether the two captions are out-of-context with respect to the image (1=Out-of-Context, 0=Not-Out-of-Context )
maskrcnn_bboxes: List of detected bounding boxes corresponding to the image. (x1,y1) refers to start vertex of the rectangle and (x2, y2) refers to end vertex of the rectangle

Citation

If you find our dataset or paper useful for your research , please include the following citation:

@misc{aneja2021cosmos,
      title={COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning}, 
      author={Shivangi Aneja and Chris Bregler and Matthias Nießner},
      year={2021},
      eprint={2101.06278},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Contact Us

If you have questions regarding the dataset or code, please email us at shivangi.aneja@tum.de. We will get back to you as soon as possible.