We introduce GGBench, a geometric generative reasoning benchmark purpose-built for unified multimodal models (UMMs). Unlike prior evaluations that treat discriminative understanding and unconstrained image generation separately, GGBench diagnoses whether a model can fuse language comprehension with precise visual construction. Geometric construction serves as an ideal testbed, revealing how well a system can actively reason and synthesize structured solutions across modalities.
Unified multimodal models (UMMs) herald a shift from passive perception toward proactive,
cross-modal generation. However, current benchmarks rarely stress-test whether these systems
can integrate reasoning with controlled synthesis. Most evaluations remain disjoint—either
probing language understanding or measuring image fidelity in isolation. As a result, we
still lack a principled way to measure generative reasoning.
GGBench closes this gap by framing geometric construction as a rigorous reasoning task. A
model must parse natural-language specifications, plan a construction, and render accurate
intermediate artifacts. This workflow surfaces fine-grained failure modes in alignment,
consistency, and controllability—dimensions that traditional captioning or free-form
generation overlook. By standardizing data, protocol, and diagnostics, GGBench offers
researchers a reproducible lens on how UMMs evolve from understanding to deliberate
problem solving.
We curate GGbench Benchmark, a comprehensive benchmark providing a standardized taxonomy and an evaluation protocol, enabling consistent and category-wise assessment beyond surface-level metrics.
Evaluation Radar Map
Category Distribution
Main results on GGBench. VLM-T and VLM-I denote step reasoning and final diagram quality, respectively. VLM-Avg averages middle and final stages. All values are percentages.
| Model | Planning | Middle Process | Final Result | Overall Scores | ||||
|---|---|---|---|---|---|---|---|---|
| VLM-T ↑ | VLM-I-Mid ↑ | VLM-I-Res ↑ | LPIPS ×10-2 ↓ | PSNR ↑ | SSIM ×10-2 ↑ | VLM-I ↑ | Human ↑ | |
| End-to-end UMMs | ||||||||
| Qwen-Image | — | — | 22.75 | 56.39 | 58.23 | 48.06 | 22.75 | 25.56 |
| Seedream 4.0 | — | — | 24.45 | 51.06 | 59.44 | 56.44 | 24.45 | 37.56 |
| Janus | 33.85 | 21.69 | 19.76 | 57.74 | 57.76 | 60.97 | 20.73 | 19.46 |
| BAGEL | 23.07 | 21.84 | 19.99 | 57.07 | 61.78 | 58.82 | 20.91 | 20.12 |
| Nano Banana | 58.54 | 44.83 | 22.81 | 51.85 | 64.53 | 59.51 | 33.82 | 45.75 |
| LLMs/LRMs | ||||||||
| GPT-4o | 59.73 | 26.19 | 2.66 | 95.43 | 5.45 | 5.69 | 14.43 | 23.04 |
| GLM-4.5V | 53.32 | 25.63 | 5.02 | 52.91 | 12.19 | 12.94 | 15.33 | 30.14 |
| Qwen3-14B | 58.65 | 39.30 | 12.97 | 78.81 | 23.92 | 24.81 | 26.13 | 38.23 |
| Gemini 2.5 Pro | 38.50 | 37.41 | 15.80 | 68.39 | 37.17 | 39.73 | 26.61 | 44.68 |
| DeepSeek-R1 | 61.16 | 62.42 | 20.48 | 66.06 | 37.94 | 37.59 | 41.45 | 49.55 |
| GPT-4 | 55.66 | 50.39 | 20.30 | 67.35 | 35.26 | 38.31 | 33.04 | 55.99 |
| Qwen3-VL | 56.40 | 49.55 | 23.94 | 39.40 | 52.33 | 58.71 | 36.74 | 66.77 |
| DeepSeek-V3.1 | 60.24 | 73.13 | 26.41 | 57.21 | 48.33 | 50.12 | 49.77 | 68.12 |
| Claude Sonnet 4.5 | 61.19 | 77.92 | 30.29 | 52.22 | 51.74 | 50.52 | 54.11 | 72.12 |
| GPT-5 | 62.01 | 76.79 | 37.36 | 49.65 | 54.80 | 59.49 | 57.08 | 83.06 |
@article{author2025yourproject,
title={Your Research Project Title},
author={Author One and Author Two and Author Three and Author Four},
journal={Journal Name},
year={2025}
}