GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

TL;DR

We introduce GGBench, a geometric generative reasoning benchmark purpose-built for unified multimodal models (UMMs). Unlike prior evaluations that treat discriminative understanding and unconstrained image generation separately, GGBench diagnoses whether a model can fuse language comprehension with precise visual construction. Geometric construction serves as an ideal testbed, revealing how well a system can actively reason and synthesize structured solutions across modalities.

Introduction

Unified multimodal models (UMMs) herald a shift from passive perception toward proactive, cross-modal generation. However, current benchmarks rarely stress-test whether these systems can integrate reasoning with controlled synthesis. Most evaluations remain disjoint—either probing language understanding or measuring image fidelity in isolation. As a result, we still lack a principled way to measure generative reasoning.

GGBench closes this gap by framing geometric construction as a rigorous reasoning task. A model must parse natural-language specifications, plan a construction, and render accurate intermediate artifacts. This workflow surfaces fine-grained failure modes in alignment, consistency, and controllability—dimensions that traditional captioning or free-form generation overlook. By standardizing data, protocol, and diagnostics, GGBench offers researchers a reproducible lens on how UMMs evolve from understanding to deliberate problem solving.

We curate GGbench Benchmark, a comprehensive benchmark providing a standardized taxonomy and an evaluation protocol, enabling consistent and category-wise assessment beyond surface-level metrics.

Evaluation Radar Map

Category Distribution

Leaderboard

Main results on GGBench. VLM-T and VLM-I denote step reasoning and final diagram quality, respectively. VLM-Avg averages middle and final stages. All values are percentages.

Model	Planning	Middle Process	Final Result				Overall Scores
Model	VLM-T ↑	VLM-I-Mid ↑	VLM-I-Res ↑	LPIPS ×10^-2 ↓	PSNR ↑	SSIM ×10^-2 ↑	VLM-I ↑	Human ↑
End-to-end UMMs
Qwen-Image	—	—	22.75	56.39	58.23	48.06	22.75	25.56
Seedream 4.0	—	—	24.45	51.06	59.44	56.44	24.45	37.56
Janus	33.85	21.69	19.76	57.74	57.76	60.97	20.73	19.46
BAGEL	23.07	21.84	19.99	57.07	61.78	58.82	20.91	20.12
Nano Banana	58.54	44.83	22.81	51.85	64.53	59.51	33.82	45.75
LLMs/LRMs
GPT-4o	59.73	26.19	2.66	95.43	5.45	5.69	14.43	23.04
GLM-4.5V	53.32	25.63	5.02	52.91	12.19	12.94	15.33	30.14
Qwen3-14B	58.65	39.30	12.97	78.81	23.92	24.81	26.13	38.23
Gemini 2.5 Pro	38.50	37.41	15.80	68.39	37.17	39.73	26.61	44.68
DeepSeek-R1	61.16	62.42	20.48	66.06	37.94	37.59	41.45	49.55
GPT-4	55.66	50.39	20.30	67.35	35.26	38.31	33.04	55.99
Qwen3-VL	56.40	49.55	23.94	39.40	52.33	58.71	36.74	66.77
DeepSeek-V3.1	60.24	73.13	26.41	57.21	48.33	50.12	49.77	68.12
Claude Sonnet 4.5	61.19	77.92	30.29	52.22	51.74	50.52	54.11	72.12
GPT-5	62.01	76.79	37.36	49.65	54.80	59.49	57.08	83.06

BibTeX

@article{author2025yourproject,
  title={Your Research Project Title},
  author={Author One and Author Two and Author Three and Author Four},
  journal={Journal Name},
  year={2025}
}