TL;DR

We introduce GGBench, a geometric generative reasoning benchmark purpose-built for unified multimodal models (UMMs). Unlike prior evaluations that treat discriminative understanding and unconstrained image generation separately, GGBench diagnoses whether a model can fuse language comprehension with precise visual construction. Geometric construction serves as an ideal testbed, revealing how well a system can actively reason and synthesize structured solutions across modalities.

introduction


Introduction

Unified multimodal models (UMMs) herald a shift from passive perception toward proactive, cross-modal generation. However, current benchmarks rarely stress-test whether these systems can integrate reasoning with controlled synthesis. Most evaluations remain disjoint—either probing language understanding or measuring image fidelity in isolation. As a result, we still lack a principled way to measure generative reasoning.

GGBench closes this gap by framing geometric construction as a rigorous reasoning task. A model must parse natural-language specifications, plan a construction, and render accurate intermediate artifacts. This workflow surfaces fine-grained failure modes in alignment, consistency, and controllability—dimensions that traditional captioning or free-form generation overlook. By standardizing data, protocol, and diagnostics, GGBench offers researchers a reproducible lens on how UMMs evolve from understanding to deliberate problem solving.

Benchmark

We curate GGbench Benchmark, a comprehensive benchmark providing a standardized taxonomy and an evaluation protocol, enabling consistent and category-wise assessment beyond surface-level metrics.



Evaluation Radar Map

Evaluation Radar Map

Category Distribution

Category Distribution

Leaderboard

Main results on GGBench. VLM-T and VLM-I denote step reasoning and final diagram quality, respectively. VLM-Avg averages middle and final stages. All values are percentages.

Model Planning Middle Process Final Result Overall Scores
VLM-T ↑ VLM-I-Mid ↑ VLM-I-Res ↑ LPIPS ×10-2 PSNR ↑ SSIM ×10-2 VLM-I ↑ Human ↑
End-to-end UMMs
Qwen-Image 22.75 56.39 58.23 48.06 22.75 25.56
Seedream 4.0 24.45 51.06 59.44 56.44 24.45 37.56
Janus 33.85 21.69 19.76 57.74 57.76 60.97 20.73 19.46
BAGEL 23.07 21.84 19.99 57.07 61.78 58.82 20.91 20.12
Nano Banana 58.54 44.83 22.81 51.85 64.53 59.51 33.82 45.75
LLMs/LRMs
GPT-4o 59.73 26.19 2.66 95.43 5.45 5.69 14.43 23.04
GLM-4.5V 53.32 25.63 5.02 52.91 12.19 12.94 15.33 30.14
Qwen3-14B 58.65 39.30 12.97 78.81 23.92 24.81 26.13 38.23
Gemini 2.5 Pro 38.50 37.41 15.80 68.39 37.17 39.73 26.61 44.68
DeepSeek-R1 61.16 62.42 20.48 66.06 37.94 37.59 41.45 49.55
GPT-4 55.66 50.39 20.30 67.35 35.26 38.31 33.04 55.99
Qwen3-VL 56.40 49.55 23.94 39.40 52.33 58.71 36.74 66.77
DeepSeek-V3.1 60.24 73.13 26.41 57.21 48.33 50.12 49.77 68.12
Claude Sonnet 4.5 61.19 77.92 30.29 52.22 51.74 50.52 54.11 72.12
GPT-5 62.01 76.79 37.36 49.65 54.80 59.49 57.08 83.06

BibTeX

@article{author2025yourproject,
  title={Your Research Project Title},
  author={Author One and Author Two and Author Three and Author Four},
  journal={Journal Name},
  year={2025}
}