GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

Karim Elmaaroufi^1,2, Liheng Lai¹, Justin Svegliato¹, Yutong Bai¹, Sanjit A. Seshia¹, Matei Zaharia¹

¹University of California, Berkeley ²Models for Embodied and Spatial Harmony

Q: How many traffic light(s) are in the image? Choose one: A) 0-5, B) 19-31, C) 6-18, D) Unsure / Not Visible. Respond with the letter only.

GPT-5 A

GRAID C

Q: Divide the image into thirds. In which third does the bicycle primarily appear? Respond with the letter only: A) left third, B) middle third, C) right third.

Gemini 2.5 Pro There is no bicycle visible in the image. The objects in the right third include a shipping container, a ladder, a white plastic chair, and what appears to be a stack of tires.

GRAID C

Q: Is there at least one traffic light to the right of any traffic sign?

Claude 4.5 Sonnet No. The red traffic light visible on the left side of the image is positioned to the left of the traffic signs in the scene, including the directional sign on the right side showing lane directions. I don't see any traffic lights positioned to the right of any traffic signs in this image.

GRAID Yes

Q: Is there at least one car to the right of any person?

Qwen3-Max Yes, there is at least one car to the right of a person.
Looking at the image:
• On the far right side of the scene, near the building entrance, there is a person walking on the sidewalk.
• In the center of the road, there is a silver sedan driving toward the camera.
• From the viewer’s perspective, the car is positioned to the left of that person — not to the right.
However, if we look more carefully at the background on the right side, behind the trees and near the building, there are parked cars visible in the distance. One of these parked cars appears to be located to the right of the person walking on the sidewalk.
Therefore, yes — there is at least one car (a parked one in the background) to the right of a person.

GRAID No

91.16% human-validated accuracy
GRAID generates high-quality spatial reasoning questions from simple 2D geometric relationships

What is GRAID?

Key Insight: Qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. No need for 3D reconstruction or generative models.

📊

8.5M+ High-Quality VQA Pairs

Across 3 datasets with 91.16% human-validated accuracy on GRAID-BDD

✅

No Architecture Changes

Works with any VLM—just add to your object detector to generate GRAID training data

🚀

Proven Transfer

Generalizes across datasets and benchmarks with significant improvements

Abstract

Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning—a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations.

We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations.

We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy—compared to 57.6% on a dataset generated by recent work.

Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2 11B, and when trained on all question types, achieve improvements on several existing benchmarks such as BLINK.

GRAID vs. Existing Frameworks

Comparison of spatial reasoning data generation approaches

Feature	GRAID	SpatialVLM	SpatialRGPT	SpaRE
Can operate on images only	✓	✓	✓	✗
No VLM architecture changes needed	✓	✓	✗	✓
No lengthy captions required	✓	✓	✓	✗
Avoids single-view 3D reconstruction	✓	✗	✗	✓
Avoids LLM-based QA generation	✓	✓	✓	✗
Open-source implementation by authors	✓	✗	✓	✗

Dataset Quality: Human Validation

57.6%

Existing Methods

SpatialVLM community implementation

91.16%

GRAID

Human evals on GRAID-BDD VQA pairs

We evaluated 317 VQA pairs from GRAID-BDD with four human evaluators and 250 examples from OpenSpaces, a dataset created by the community implementation of SpatialVLM. Our evaluators found less than 9% of VQA pairs to be either invalid or confusing, compared to over 42% for existing methods. This significant improvement stems from GRAID's principled 2D geometric approach that avoids cascading errors from 3D reconstruction and hallucinations from generative models.

How GRAID Works

A simple, three-step framework

Input

Images + Object Detector
→ Bounding Boxes

SPARQ

Predicates filter candidates
+ Question templates

Output

High-quality
VQA pairs

SPARQ (Sieve Predicates And Realize Questions) provides lightweight checks before attempting to generate questions, achieving up to 1400× speedup by filtering out computationally expensive templates when they are not possible. This enables scalable generation of millions of VQA pairs in hours.

Question Diversity

5.3M questions across 5 cognitive categories in GRAID-BDD

Spatial Relations (54.0%)

LeftOf, RightOf, Closer, Farther, etc.

Counting (26.9%)

HowMany, AreMore, WhichMore, etc.

Ranking & Extremes (15.1%)

LargestAppearance, MostAppearance, LeftMost, RightMost, etc.

Localization (2.6%)

IsObjectCentered, Quadrants, Grid Location, Thirds Location, etc.

Size & Aspect (1.4%)

WidthVsHeight, Biggest Box, Leftmost/Rightmost Dimensions, etc.

GRAID Datasets

Autonomous driving datasets are among the largest and most accurately labeled object detection datasets available so we use 3 as input to GRAID to generate over 8.5M high-quality VQA pairs

GRAID-BDD

3.82M VQA pairs

3.34M train / 485k val

18 question types without depth

Download

GRAID-BDD (with depth)

5.30M VQA pairs

4.63M train / 672k val

22 question types with depth

Download

GRAID-NuImages

2.41M VQA pairs

1.94M train / 478k val

18 question types without depth

Download

GRAID-NuImages (with depth)

3.29M VQA pairs

2.65M train / 641k val

22 question types with depth

Download

GRAID-Waymo Unique

13.8k VQA pairs

10.9k train / 2.79k val

18 question types without depth

Download

GRAID-Waymo Unique (with depth)

16.4k VQA pairs

13.1k train / 3.33k val

22 question types with depth

Download

View All Datasets on HuggingFace

Experimental Results

Generalization: Learning Transferable Spatial Concepts

Fine-tuning Llama 3.2 Vision 11B with LoRA on just 6 question types from GRAID-BDD

+47.5%

Overall improvement
on GRAID-BDD

Trained on only 6 types

4/5 → 5/5

Cognitive categories trained
on vs. improved

Generalizes to unseen categories

+38.0%

Cross-dataset transfer
to NuImages

New dataset with different scenes

Models trained on only 6 of 19 kinds of questions representing 4 cognitive categories from GRAID-BDD show improvements across all 19 kinds of questions across all 5 categories and generalize to NuImages—a completely different dataset with distinct scene distributions. This demonstrates that GRAID teaches fundamental spatial reasoning concepts that transfer, rather than template memorization.

Benchmark Performance

Fine-tuning Llama 3.2 Vision 11B on GRAID-BDD improves performance across multiple benchmarks

A-OKVQA

+32.5%

64.02% → 84.80%

RealWorldQA

+26.28%

35.16% → 61.44%

BLINK (Overall)

+15.94%

25.72% → 41.66%

Notable BLINK Task Improvements:

Relative Depth:
+41.13%

Visual Correspondence:
+31.98%

Spatial Relations:
+30.77%

Relative Reflectance:
+28.36%

Example VQA Pairs

from GRAID-BDD dataset

Q: Are there less than 4 car(s) in this image? Respond Yes/No.

A: Yes

Q: Does the leftmost object in the image appear to be wider than it is tall?

A: Yes

Q: Is there at least one truck to the right of any traffic sign?

A: Yes

Q: Are there less than 2 car(s) in this image? Respond Yes/No.

A: No

Q: Are there less than 3 traffic light(s) in this image? Respond Yes/No.

A: No

Q: Rank the 2 kinds of objects that appear the largest (by pixel area) in the image from largest to smallest. Provide your answer as a comma-separated list of object names only.

A: car, traffic sign

BibTeX

@article{elmaaroufi2025graid,
  title={GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation},
  author={Elmaaroufi, Karim and Lai, Liheng and Svegliato, Justin and Bai, Yutong and Seshia, Sanjit A. and Zaharia, Matei},
  journal={arXiv preprint arXiv:2510.22118},
  year={2025}
}

GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

91.16% human-validated accuracy GRAID generates high-quality spatial reasoning questions from simple 2D geometric relationships

What is GRAID?

8.5M+ High-Quality VQA Pairs

No Architecture Changes

Proven Transfer

Abstract

GRAID vs. Existing Frameworks

Dataset Quality: Human Validation

How GRAID Works

Input

SPARQ

Output

Question Diversity

Spatial Relations (54.0%)

Counting (26.9%)

Ranking & Extremes (15.1%)

Localization (2.6%)

Size & Aspect (1.4%)

GRAID Datasets

GRAID-BDD

GRAID-BDD (with depth)

GRAID-NuImages

GRAID-NuImages (with depth)

GRAID-Waymo Unique

GRAID-Waymo Unique (with depth)

Experimental Results

Generalization: Learning Transferable Spatial Concepts

Benchmark Performance

A-OKVQA

RealWorldQA

BLINK (Overall)

Notable BLINK Task Improvements:

Example VQA Pairs

BibTeX

91.16% human-validated accuracy
GRAID generates high-quality spatial reasoning questions from simple 2D geometric relationships