Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Shanghai Artificial Intelligence Laboratory
† Corresponding Authors

Key Figures

Abstract

Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision—a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score—a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling—ultimately limiting world knowledge internalization, generation.

Six Core Domains

Biology

Ecological succession, cellular dynamics, and biological growth processes.

Chemistry

Chemical reactions, phase changes, and molecular synthesis.

Culture

Historical events, artistic evolution, and societal transformations.

Geography

Tectonic shifts, erosion, volcanic activity, and landscape evolution.

Meteorology

Storm formation, cloud dynamics, and atmospheric phenomena.

Physics

Fluid dynamics, zero-gravity behaviors, classical mechanics, and forces.

Task Examples

Generating coherent 4-step sequences based on causal prompts.

Evaluation Protocol

A rigorous framework assessing 9 key sub-dimensions.

Interactive Evaluation Platform

Consistency

Semantic Consistency

Alignment between visual representation and conceptual meaning.

Factual Consistency

Adherence to empirical facts and scientific knowledge.

Spatial-Temporal

Coherence in spatial evolution and temporal progression.

Physicality

Basic Properties

Accuracy in object quantities, shapes, and proportions.

Dynamics & Interactivity

Realism in motion trajectories, forces, and interactions.

Physical Reliability

Adherence to fundamental physical laws.

Aesthetic

Expressiveness

Emotional impact and visual storytelling capabilities.

Artistic Quality

Overall aesthetic appeal, composition, and style coherence.

Authenticity

Believability and naturalness of generated elements.

Leaderboard

Comprehensive performance comparison across domains and detailed dimensions.

Domain-Specific Performance

Model Physics Chemistry Biology Geography Meteorology Culture Overall
Dedicated Text-to-Image Models
FLUX-dev37.6258.8657.1257.2758.7551.0153.44
FLUX-pro-1.139.5258.5256.1554.2957.9757.6254.01
FLUX-pro-1.1-ultra39.6955.0856.5154.5453.1554.2752.21
FLUX-kontext-pro43.7861.7261.3655.0058.4163.4557.29
FLUX-kontext-max42.8258.7262.9660.9962.4057.7657.61
SD-3.5-flash35.6140.4353.7350.7249.1251.6946.88
SD-3.5-medium36.8941.3051.6157.4753.6847.1348.01
SD-3.5-large36.0742.3250.2451.1255.4347.3447.09
Closed-Source T2I Models
GPT-4o58.8766.5578.5578.4078.6981.8373.81
Gemini-2.5-Flash-Image57.4762.9167.6375.3869.7469.9467.18
Unified Multimodal Models
Seedream 4.051.0657.2776.9266.0967.3565.5564.04
Qwen-Image47.9856.2276.4063.8158.9466.0161.56
Hunyuan Image 3.037.8449.7651.2770.4967.7462.1056.53
Bagel39.4056.2557.6551.0058.2072.4055.82
Janus-Pro-7B36.2444.0853.0955.0562.7050.5250.28

Detailed Evaluation Dimensions

Model Sem.
Cons.
Fact.
Cons.
Spat.
Temp.
Consist.
Avg
Expr. Art.
Qual.
Auth. Aesth.
Avg
Phys.
Rel.
Basic
Prop.
Dyn.
& Int.
Phys.
Avg
Dedicated Text-to-Image Models
FLUX-dev54.1156.3040.9850.3771.0575.6653.0266.6045.2358.9745.8449.96
FLUX-pro-1.155.2055.8940.5650.4572.9576.7352.6767.3046.9658.1847.6250.88
FLUX-pro-1.1-ultra52.0454.5340.7149.0168.6972.6850.2963.7544.8858.2845.8249.61
FLUX-kontext-pro57.0960.3348.4655.2264.6370.0956.1063.5352.7066.0449.9456.19
FLUX-kontext-max60.8563.4646.0656.4868.4271.9455.8365.3051.2464.3849.0754.86
SD-3.5-flash48.1246.0536.5642.4462.9271.7648.0860.7941.0852.2039.8744.35
SD-3.5-medium49.7950.8935.6444.0861.6567.8648.8659.3541.6654.8842.2946.23
SD-3.5-large40.8250.5233.8443.3063.0267.3245.8958.6241.0153.6440.7345.08
Closed-Source T2I Models
GPT4o75.7678.6567.4273.8877.0581.3771.8176.7070.5679.9066.4472.28
Gemini-2.5-Flash-Image69.1273.2058.7166.9273.2975.6564.1070.9562.6874.4559.5065.52
Unified Multimodal Models
Seedream 4.066.1566.7856.7963.1873.5476.1258.6169.3260.0569.8356.9162.24
Qwen-Image61.1659.8853.2058.0379.5482.5758.0573.2355.9767.1154.6959.22
Hunyuan Image 3.057.4258.1446.0353.7873.0876.0553.4867.4049.5461.3450.7053.81
Bagel58.6157.5051.1155.7065.8469.1748.8961.1751.9461.9446.1153.32
Janus-Pro-7B52.2955.0541.5149.5463.2664.8645.9657.9044.4053.6444.6347.52

Call for Contribution

We invite the community to join us in extending and refining the Envision benchmark.

New Models

Evaluate and add results for the latest text-to-image or multimodal models.

We are committed to regularly updating the leaderboard with the latest results.

Submit Pull Request

Submit your contributions via GitHub PR

BibTeX

@misc{tian2025envisionbenchmarkingunifiedunderstanding,
      title={Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights}, 
      author={Juanxi Tian and Siyuan Li and Conghui He and Lijun Wu and Cheng Tan},
      year={2025},
      eprint={2512.01816},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.01816}, 
}