Semantic Consistency
Alignment between visual representation and conceptual meaning.
Factual Consistency
Adherence to empirical facts and scientific knowledge.
Spatial-Temporal
Coherence in spatial evolution and temporal progression.
Figure 1: Pipeline Overview
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision—a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score—a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling—ultimately limiting world knowledge internalization, generation.
Biology
Ecological succession, cellular dynamics, and biological growth processes.
Chemistry
Chemical reactions, phase changes, and molecular synthesis.
Culture
Historical events, artistic evolution, and societal transformations.
Geography
Tectonic shifts, erosion, volcanic activity, and landscape evolution.
Meteorology
Storm formation, cloud dynamics, and atmospheric phenomena.
Physics
Fluid dynamics, zero-gravity behaviors, classical mechanics, and forces.
Generating coherent 4-step sequences based on causal prompts.
Prompt: A bare patch of soil after a volcanic eruption, with scattered rocks and ash under clear daylight.
Explanation: Establishing the primary lifeless state with no vegetation.
Prompt: Small pioneer species like lichens and mosses beginning to colonize the barren volcanic rock.
Explanation: Pioneer species break down rock into soil, initiating transformation.
Prompt: Maturing site where grasses and small shrubs grow in deepening soil, attracting insects.
Explanation: Soil depth increases, supporting complex plants and fauna.
Prompt: A fully developed climax community with mature trees, dense undergrowth, and diverse animal life.
Explanation: Stable ecosystem endpoint resulting from the causal chain.
Prompt: A foggy London street in 1880 with gas lamps, cobblestones, and horse-drawn carriages.
Explanation: Establishing the pre-electrification baseline with traditional infrastructure.
Prompt: Workers excavating the road for Underground lines and installing electric streetlights.
Explanation: The causal trigger of new technology (electricity/steel) altering the urban landscape.
Prompt: Steel-framed buildings rising and early electric trams replacing horse-drawn wagons.
Explanation: Widespread adoption of industrial innovations changing architectural and transit patterns.
Prompt: A bustling 1900s London street with electric trams, automobiles, and subway entrances.
Explanation: The culmination of the Second Industrial Revolution, resulting in a modern electrified city.
A rigorous framework assessing 9 key sub-dimensions.
Interactive Evaluation PlatformSemantic Consistency
Alignment between visual representation and conceptual meaning.
Factual Consistency
Adherence to empirical facts and scientific knowledge.
Spatial-Temporal
Coherence in spatial evolution and temporal progression.
Basic Properties
Accuracy in object quantities, shapes, and proportions.
Dynamics & Interactivity
Realism in motion trajectories, forces, and interactions.
Physical Reliability
Adherence to fundamental physical laws.
Expressiveness
Emotional impact and visual storytelling capabilities.
Artistic Quality
Overall aesthetic appeal, composition, and style coherence.
Authenticity
Believability and naturalness of generated elements.
Comprehensive performance comparison across domains and detailed dimensions.
| Model | Physics | Chemistry | Biology | Geography | Meteorology | Culture | Overall |
|---|---|---|---|---|---|---|---|
| Dedicated Text-to-Image Models | |||||||
| FLUX-dev | 37.62 | 58.86 | 57.12 | 57.27 | 58.75 | 51.01 | 53.44 |
| FLUX-pro-1.1 | 39.52 | 58.52 | 56.15 | 54.29 | 57.97 | 57.62 | 54.01 |
| FLUX-pro-1.1-ultra | 39.69 | 55.08 | 56.51 | 54.54 | 53.15 | 54.27 | 52.21 |
| FLUX-kontext-pro | 43.78 | 61.72 | 61.36 | 55.00 | 58.41 | 63.45 | 57.29 |
| FLUX-kontext-max | 42.82 | 58.72 | 62.96 | 60.99 | 62.40 | 57.76 | 57.61 |
| SD-3.5-flash | 35.61 | 40.43 | 53.73 | 50.72 | 49.12 | 51.69 | 46.88 |
| SD-3.5-medium | 36.89 | 41.30 | 51.61 | 57.47 | 53.68 | 47.13 | 48.01 |
| SD-3.5-large | 36.07 | 42.32 | 50.24 | 51.12 | 55.43 | 47.34 | 47.09 |
| Closed-Source T2I Models | |||||||
| GPT-4o | 58.87 | 66.55 | 78.55 | 78.40 | 78.69 | 81.83 | 73.81 |
| Gemini-2.5-Flash-Image | 57.47 | 62.91 | 67.63 | 75.38 | 69.74 | 69.94 | 67.18 |
| Unified Multimodal Models | |||||||
| Seedream 4.0 | 51.06 | 57.27 | 76.92 | 66.09 | 67.35 | 65.55 | 64.04 |
| Qwen-Image | 47.98 | 56.22 | 76.40 | 63.81 | 58.94 | 66.01 | 61.56 |
| Hunyuan Image 3.0 | 37.84 | 49.76 | 51.27 | 70.49 | 67.74 | 62.10 | 56.53 |
| Bagel | 39.40 | 56.25 | 57.65 | 51.00 | 58.20 | 72.40 | 55.82 |
| Janus-Pro-7B | 36.24 | 44.08 | 53.09 | 55.05 | 62.70 | 50.52 | 50.28 |
| Model | Sem. Cons. |
Fact. Cons. |
Spat. Temp. |
Consist. Avg |
Expr. | Art. Qual. |
Auth. | Aesth. Avg |
Phys. Rel. |
Basic Prop. |
Dyn. & Int. |
Phys. Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dedicated Text-to-Image Models | ||||||||||||
| FLUX-dev | 54.11 | 56.30 | 40.98 | 50.37 | 71.05 | 75.66 | 53.02 | 66.60 | 45.23 | 58.97 | 45.84 | 49.96 |
| FLUX-pro-1.1 | 55.20 | 55.89 | 40.56 | 50.45 | 72.95 | 76.73 | 52.67 | 67.30 | 46.96 | 58.18 | 47.62 | 50.88 |
| FLUX-pro-1.1-ultra | 52.04 | 54.53 | 40.71 | 49.01 | 68.69 | 72.68 | 50.29 | 63.75 | 44.88 | 58.28 | 45.82 | 49.61 |
| FLUX-kontext-pro | 57.09 | 60.33 | 48.46 | 55.22 | 64.63 | 70.09 | 56.10 | 63.53 | 52.70 | 66.04 | 49.94 | 56.19 |
| FLUX-kontext-max | 60.85 | 63.46 | 46.06 | 56.48 | 68.42 | 71.94 | 55.83 | 65.30 | 51.24 | 64.38 | 49.07 | 54.86 |
| SD-3.5-flash | 48.12 | 46.05 | 36.56 | 42.44 | 62.92 | 71.76 | 48.08 | 60.79 | 41.08 | 52.20 | 39.87 | 44.35 |
| SD-3.5-medium | 49.79 | 50.89 | 35.64 | 44.08 | 61.65 | 67.86 | 48.86 | 59.35 | 41.66 | 54.88 | 42.29 | 46.23 |
| SD-3.5-large | 40.82 | 50.52 | 33.84 | 43.30 | 63.02 | 67.32 | 45.89 | 58.62 | 41.01 | 53.64 | 40.73 | 45.08 |
| Closed-Source T2I Models | ||||||||||||
| GPT4o | 75.76 | 78.65 | 67.42 | 73.88 | 77.05 | 81.37 | 71.81 | 76.70 | 70.56 | 79.90 | 66.44 | 72.28 |
| Gemini-2.5-Flash-Image | 69.12 | 73.20 | 58.71 | 66.92 | 73.29 | 75.65 | 64.10 | 70.95 | 62.68 | 74.45 | 59.50 | 65.52 |
| Unified Multimodal Models | ||||||||||||
| Seedream 4.0 | 66.15 | 66.78 | 56.79 | 63.18 | 73.54 | 76.12 | 58.61 | 69.32 | 60.05 | 69.83 | 56.91 | 62.24 |
| Qwen-Image | 61.16 | 59.88 | 53.20 | 58.03 | 79.54 | 82.57 | 58.05 | 73.23 | 55.97 | 67.11 | 54.69 | 59.22 |
| Hunyuan Image 3.0 | 57.42 | 58.14 | 46.03 | 53.78 | 73.08 | 76.05 | 53.48 | 67.40 | 49.54 | 61.34 | 50.70 | 53.81 |
| Bagel | 58.61 | 57.50 | 51.11 | 55.70 | 65.84 | 69.17 | 48.89 | 61.17 | 51.94 | 61.94 | 46.11 | 53.32 |
| Janus-Pro-7B | 52.29 | 55.05 | 41.51 | 49.54 | 63.26 | 64.86 | 45.96 | 57.90 | 44.40 | 53.64 | 44.63 | 47.52 |
We invite the community to join us in extending and refining the Envision benchmark.
Evaluate and add results for the latest text-to-image or multimodal models.
We are committed to regularly updating the leaderboard with the latest results.
Submit Pull RequestSubmit your contributions via GitHub PR
@misc{tian2025envisionbenchmarkingunifiedunderstanding,
title={Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights},
author={Juanxi Tian and Siyuan Li and Conghui He and Lijun Wu and Cheng Tan},
year={2025},
eprint={2512.01816},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.01816},
}