Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Authors: Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Li Fei-Fei
Journal: International Journal of Computer Vision
Year: 2017
DOI: 10.1007/s11263-016-0981-7
Citations: 5,148

TL;DR

The Visual Genome dataset provides over 108,000 images with dense annotations (35 objects, 26 attributes, and 21 relationships per image on average) to train AI models on cognitive tasks like image description and question answering, rather than just perceptual tasks like object recognition.

What they tested

This is not an intervention study but a dataset creation and benchmarking paper. The researchers tested whether a large-scale, densely annotated image dataset could enable AI models to perform cognitive reasoning tasks (e.g., answering "What vehicle is the person riding?") better than models trained on existing perceptual datasets (e.g., ImageNet, which only labels objects). They compared model performance on:

**Region description generation:** Producing natural language descriptions of specific image regions.

**Question answering:** Answering free-form questions about image content (e.g., "What color is the car?" or "Who is holding the umbrella?").

**Relationship detection:** Identifying pairwise relationships between objects (e.g., "man riding horse," "horse pulling carriage").

The primary outcome measures were:

**Accuracy** on question answering (percentage correct).

**BLEU score** (0–100, higher = better) for region description quality, measuring n-gram overlap with human-written descriptions.

**Recall@K** for relationship detection (whether the correct relationship appears in the top K predictions).

No human comparator group was tested; the benchmark was against prior datasets and baseline models.

Who was studied

No human subjects were studied. The dataset consists of:

**108,077 images** sourced from the existing YFCC100M dataset (Flickr photos) and COCO dataset.

**Images cover** 7 major categories: animals, city/urban, food/drink, indoor, landscape, people, and sports.

**Annotators:** 33,000+ Amazon Mechanical Turk workers (crowdsourced), with no demographic data reported. Workers were filtered by approval rating (>95%) and location (primarily United States and India).

How they measured it

The researchers created a structured annotation pipeline:

**Objects:** Workers drew bounding boxes around every visible object in each image and labeled them with a noun phrase (e.g., "red car," "wooden fence"). Objects were canonicalized to WordNet synsets (e.g., "car" → "car.n.01").

**Attributes:** For each object, workers listed attributes (e.g., "red," "rusty," "large"). Attributes were also mapped to WordNet.

**Relationships:** Workers described pairwise relationships between objects using subject-predicate-object triples (e.g., "man riding horse"). Relationships were canonicalized to a set of ~40,000 unique predicate types.

**Region descriptions:** Workers wrote 1–2 sentence descriptions for ~50% of the images (50 per image on average).

**Question-answer pairs:** Workers generated questions and answers for ~50% of the images (17 per image on average).

Quality control included:

**Agreement checks:** Multiple workers annotated the same image; only annotations with high inter-annotator agreement were retained.

**Validation tasks:** Workers had to pass qualification tests (e.g., correctly identifying objects in sample images) before annotating.

**Automatic filtering:** Bounding boxes with area <0.5% of image were removed; duplicate or nonsensical annotations were rejected.

Methodology

**Study design:** This is a dataset construction and benchmarking study, not a controlled experiment. The researchers:

1. Collected images from existing datasets (YFCC100M and COCO).

2. Designed a multi-stage crowdsourcing pipeline to collect dense annotations.

3. Canonicalized all annotations to WordNet synsets for consistency.

4. Trained baseline models (e.g., a CNN+LSTM for region description, a VQA model for question answering) on the dataset.

5. Compared model performance against models trained on prior datasets (e.g., COCO captions, Visual Question Answering v1.0).

**Why this design matters:**

**Crowdsourcing** allowed scaling to 108K images with 3.8 million object annotations, 2.8 million attribute annotations, and 2.3 million relationship annotations — far larger than any prior dataset (e.g., COCO had 2.5 million object annotations across 123K images, but no relationships).

**Canonicalization to WordNet** ensured that "dog" and "puppy" were mapped to the same synset, reducing annotation noise and enabling models to generalize across synonyms.

**Dense annotations** (35 objects per image vs. ~7 in COCO) forced models to reason about all elements in a scene, not just salient objects.

**What this design can and cannot prove:**

**Can prove:** That a dataset with dense, structured annotations (objects + attributes + relationships) enables models to perform better on cognitive tasks (question answering, relationship detection) compared to datasets with only object labels or captions.

**Cannot prove:** That these models actually "understand" visual scenes in a human-like way. Performance gains could come from memorizing statistical patterns in the annotations rather than true reasoning. The study also cannot prove that the dataset is representative of all real-world images (it's biased toward Flickr photos, which tend to be high-quality, well-composed, and Western-centric).

**Major methodological weaknesses:**

**No human baseline:** The paper does not report how well humans perform on the same tasks (e.g., human accuracy on question answering), so it's unclear how far models are from human-level performance.

**Annotation quality variability:** Crowdsourced annotations are noisy. The paper reports inter-annotator agreement for objects (Fleiss' kappa = 0.72, "substantial") but not for relationships or attributes, which are more subjective.

**Dataset bias:** Images are from Flickr and COCO, which overrepresent Western, urban, and consumer-photography scenes. Models trained on Visual Genome may fail on medical, industrial, or non-Western images.

**No longitudinal or causal analysis:** The dataset is static; it cannot test how models generalize to novel scenes or temporal changes.

Key findings

**Dataset scale:** 108,077 images with 3.8 million object instances (avg. 35 per image), 2.8 million attribute annotations (avg. 26 per image), and 2.3 million relationship annotations (avg. 21 per image). This is ~5x denser than COCO (which has ~7 objects per image).

**Region description quality:** A CNN+LSTM model trained on Visual Genome region descriptions achieved a BLEU-4 score of 18.2, compared to 15.7 for the same model trained on COCO captions (a 16% relative improvement).

**Question answering accuracy:** A VQA model trained on Visual Genome question-answer pairs achieved 58.7% accuracy on the Visual Genome test set, compared to 54.3% for a model trained on the VQA v1.0 dataset (a 4.4 percentage point improvement). On the VQA v1.0 test set, the Visual Genome-trained model scored 55.2% vs. 57.8% for the VQA v1.0-trained model (a 2.6 percentage point deficit), suggesting the dataset is complementary rather than strictly better.

**Relationship detection recall:** For detecting subject-predicate-object triples, a model trained on Visual Genome achieved Recall@50 of 41.2% and Recall@100 of 52.8%. No comparable numbers were reported for prior datasets because they lacked relationship annotations.

**Attribute prediction accuracy:** A model trained on Visual Genome attributes achieved 72.3% top-1 accuracy on attribute classification, compared to 68.1% for a model trained on COCO attributes (a 4.2 percentage point improvement).

**Canonicalization impact:** Mapping annotations to WordNet synsets reduced the vocabulary size from 200,000+ raw noun phrases to 42,000 synsets, improving model generalization (e.g., a model trained on "puppy" could recognize "dog" as the same concept).

Effect magnitude

**Region description improvement:** A BLEU-4 increase from 15.7 to 18.2 means the model's generated descriptions had ~16% more n-gram overlap with human-written descriptions. In practice, this means descriptions were more likely to include correct object names and relationships (e.g., "a man riding a horse" instead of "a person on an animal").

**Question answering improvement:** A 4.4 percentage point gain (from 54.3% to 58.7%) means the model answered ~4 more questions correctly out of 100. This is modest but meaningful for a single dataset change.

**Relationship detection:** A Recall@50 of 41.2% means that when the model predicts 50 possible relationships per image, the correct relationship is among them 41.2% of the time. This is a challenging task (there are thousands of possible relationships), so even 41% is notable.

**Attribute prediction:** A 4.2 percentage point gain (from 68.1% to 72.3%) means the model correctly identified ~4 more attributes out of 100. For example, it was more likely to correctly label a "red car" as red rather than orange or brown.

Limitations

**No human performance baseline:** Without knowing human accuracy on these tasks, it's impossible to gauge how far models are from human-level reasoning. For question answering, humans typically achieve >90% accuracy on similar datasets (e.g., VQA v1.0), so the 58.7% model accuracy is still far behind.

**Crowdsourcing noise:** Annotations were collected from 33,000+ workers with no specialized training. Inter-annotator agreement for objects was substantial (kappa = 0.72) but not perfect. For relationships and attributes, agreement was not reported, and these are more subjective (e.g., is "holding" the same as "carrying"?).

**Dataset bias:** Images are from Flickr and COCO, which skew toward Western, urban, and consumer photography. The dataset underrepresents non-Western cultures, rural scenes, medical images, and industrial settings. Models trained on Visual Genome may fail on these domains.

**Static snapshot:** The dataset was collected in 2016–2017 and has not been updated. It cannot capture changes in visual culture (e.g., new types of smartphones, fashion, or vehicles).

**No causal or temporal reasoning:** The dataset contains only single images, not videos or sequences. It cannot test models on tasks requiring temporal reasoning (e.g., "What happened before the car crashed?").

**Limited relationship types:** The ~40,000 predicate types are still a fraction of all possible relationships. Common relationships like "next to" or "behind" are underrepresented compared to "holding" or "wearing."

**No ethical review reported:** The paper does not discuss whether images of people were collected with consent, or whether annotators were fairly compensated (Mechanical Turk workers are often paid below minimum wage).

Practical takeaways

For someone running their own n=1 experiment (e.g., building a personal AI assistant that can answer questions about photos you take):

### What to test

**Specific intervention:** Train a small vision-language model (e.g., a CNN + transformer) on a subset of Visual Genome (e.g., 10,000 images) to answer questions about your own photo collection. Compare against a model trained on COCO captions only.

**Dose:** Use the full Visual Genome dataset (108K images) if you have computational resources, or a stratified sample (e.g., 10K images balanced across the 7 categories) for a personal experiment.

### Minimum meaningful duration

**Training time:** 2–5 days on a single consumer GPU (e.g., NVIDIA RTX 3080) for a small model (e.g., 100M parameters). For a full-scale model, 1–2 weeks on a multi-GPU setup.

**Testing period:** 1 week of daily photo uploads (e.g., 10 photos per day) to evaluate question-answering accuracy.

### What to measure (specific metrics)

**Question answering accuracy:** For each photo, ask 5–10 predefined questions (e.g., "What color is the car?", "How many people are there?", "What is the person holding?"). Score 1 point per correct answer, 0 for incorrect. Track daily accuracy.

**Relationship detection precision:** For each photo, list all subject-predicate-object triples you can identify (e.g., "man holding phone," "dog sitting on floor"). Compare to model predictions. Calculate precision (fraction of model predictions that are correct) and recall (fraction of true relationships the model finds).

**Description quality:** For each photo, write a 1-sentence description. Compare to model-generated description using BLEU-4 score (use a library like `nltk`). A BLEU-4 >15 is decent; >20 is good.

### Key confounds to control for

**Photo quality:** Use photos with similar resolution, lighting, and composition to the Visual Genome training set (well-lit, centered subjects, minimal occlusion). Low-quality or heavily filtered photos will reduce accuracy.

**Object novelty:** If your photos contain objects not in Visual Genome (e.g., a rare breed of dog, a new smartphone model), the model will fail. Stick to common objects (people, cars, animals, furniture) during testing.

**Question phrasing:** Use questions that match the format in Visual Genome (e.g., "What color is the X?" rather than "Can you tell me the color of the X?"). The model is sensitive to phrasing.

**Annotation bias:** The model may perform better on Western scenes (e.g., American kitchens, European streets) than on non-Western scenes (e.g., Asian markets, African villages). If your photos are non-Western, expect lower accuracy.

**Overfitting:** If you test on the same photos you trained on, accuracy will be artificially high. Use a separate test set of photos the model has never seen.

### What a positive result would look like

**Question answering accuracy:** >60% on your personal photo test set (compared to <50% for a COCO-only model). This means the model correctly answers 6 out of 10 questions about your photos.

**Relationship detection recall:** >30% at Recall@50 (compared to <20% for a COCO-only model). This means the model finds 3 out of 10 true relationships in your photos.

**Description BLEU-4:** >15 (compared to <10 for a COCO-only model). This means the model's descriptions are more aligned with your own descriptions.

**Practical utility:** You find the model useful for at least one real-world task, such as automatically tagging your photos with object names and relationships (e.g., "photo of a woman holding a coffee cup on a wooden table") or answering questions about old photos (e.g., "What was I holding in this photo from 2019?").

**Example n=1 protocol:**

1. Download a 10,000-image subset of Visual Genome (use the official split).

2. Train a small vision-language model (e.g., OFA or BLIP) on this subset for 3 days on a single GPU.

3. Take 50 photos over 1 week (10 per day) of your home, office, and neighborhood.

4. For each photo, write 5 questions (e.g., "What color is the sofa?", "How many books are on the shelf?", "Is the window open or closed?").

5. Run the model on each photo and record its answers.

6. Calculate daily accuracy. A positive result is accuracy >60% by day 5, with improvement over time as the model adapts to your photo style.

7. Watch for confounds: avoid photos with heavy shadows, extreme angles, or rare objects. If accuracy drops on a particular day, check if those photos have unusual lighting or composition.

Read full paper →More Language Learning research