GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

1University of California, Berkeley 2Models for Embodied and Spatial Harmony

91.16% human-validated accuracy
GRAID generates high-quality spatial reasoning questions from simple 2D geometric relationships

What is GRAID?

Key Insight: Qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. No need for 3D reconstruction or generative models.

📊

8.5M+ High-Quality VQA Pairs

Across 3 datasets with 91.16% human-validated accuracy on GRAID-BDD

No Architecture Changes

Works with any VLM—just add to your object detector to generate GRAID training data

🚀

Proven Transfer

Generalizes across datasets and benchmarks with significant improvements

Abstract

Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning—a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations.

We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations.

We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy—compared to 57.6% on a dataset generated by recent work.

Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2 11B, and when trained on all question types, achieve improvements on several existing benchmarks such as BLINK.

GRAID vs. Existing Frameworks

Comparison of spatial reasoning data generation approaches

Feature GRAID SpatialVLM SpatialRGPT SpaRE
Can operate on images only
No VLM architecture changes needed
No lengthy captions required
Avoids single-view 3D reconstruction
Avoids LLM-based QA generation
Open-source implementation by authors

Dataset Quality: Human Validation

57.6%
Existing Methods
SpatialVLM community implementation
91.16%
GRAID
Human evals on GRAID-BDD VQA pairs

We evaluated 317 VQA pairs from GRAID-BDD with four human evaluators and 250 examples from OpenSpaces, a dataset created by the community implementation of SpatialVLM. Our evaluators found less than 9% of VQA pairs to be either invalid or confusing, compared to over 42% for existing methods. This significant improvement stems from GRAID's principled 2D geometric approach that avoids cascading errors from 3D reconstruction and hallucinations from generative models.

How GRAID Works

A simple, three-step framework

1

Input

Images + Object Detector
→ Bounding Boxes

2

SPARQ

Predicates filter candidates
+ Question templates

3

Output

High-quality
VQA pairs

SPARQ (Sieve Predicates And Realize Questions) provides lightweight checks before attempting to generate questions, achieving up to 1400× speedup by filtering out computationally expensive templates when they are not possible. This enables scalable generation of millions of VQA pairs in hours.

Question Diversity

5.3M questions across 5 cognitive categories in GRAID-BDD

Question types distribution

Spatial Relations (54.0%)

LeftOf, RightOf, Closer, Farther, etc.

Counting (26.9%)

HowMany, AreMore, WhichMore, etc.

Ranking & Extremes (15.1%)

LargestAppearance, MostAppearance, LeftMost, RightMost, etc.

Localization (2.6%)

IsObjectCentered, Quadrants, Grid Location, Thirds Location, etc.

Size & Aspect (1.4%)

WidthVsHeight, Biggest Box, Leftmost/Rightmost Dimensions, etc.

GRAID Datasets

Autonomous driving datasets are among the largest and most accurately labeled object detection datasets available so we use 3 as input to GRAID to generate over 8.5M high-quality VQA pairs

GRAID-BDD

3.82M VQA pairs

3.34M train / 485k val

18 question types without depth

Download

GRAID-BDD (with depth)

5.30M VQA pairs

4.63M train / 672k val

22 question types with depth

Download

GRAID-NuImages

2.41M VQA pairs

1.94M train / 478k val

18 question types without depth

Download

GRAID-NuImages (with depth)

3.29M VQA pairs

2.65M train / 641k val

22 question types with depth

Download

GRAID-Waymo Unique

13.8k VQA pairs

10.9k train / 2.79k val

18 question types without depth

Download

GRAID-Waymo Unique (with depth)

16.4k VQA pairs

13.1k train / 3.33k val

22 question types with depth

Download

Experimental Results

Generalization: Learning Transferable Spatial Concepts

Fine-tuning Llama 3.2 Vision 11B with LoRA on just 6 question types from GRAID-BDD

Generalization results
+47.5%
Overall improvement
on GRAID-BDD
Trained on only 6 types
4/5 → 5/5
Cognitive categories trained
on vs. improved
Generalizes to unseen categories
+38.0%
Cross-dataset transfer
to NuImages
New dataset with different scenes

Models trained on only 6 of 19 kinds of questions representing 4 cognitive categories from GRAID-BDD show improvements across all 19 kinds of questions across all 5 categories and generalize to NuImages—a completely different dataset with distinct scene distributions. This demonstrates that GRAID teaches fundamental spatial reasoning concepts that transfer, rather than template memorization.

Benchmark Performance

Fine-tuning Llama 3.2 Vision 11B on GRAID-BDD improves performance across multiple benchmarks

A-OKVQA

+32.5%

64.02% → 84.80%

RealWorldQA

+26.28%

35.16% → 61.44%

BLINK (Overall)

+15.94%

25.72% → 41.66%

Example VQA Pairs

from GRAID-BDD dataset

BibTeX

@article{elmaaroufi2025graid,
  title={GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation},
  author={Elmaaroufi, Karim and Lai, Liheng and Svegliato, Justin and Bai, Yutong and Seshia, Sanjit A. and Zaharia, Matei},
  journal={arXiv preprint arXiv:2510.22118},
  year={2025}
}