# HiDream开源8B参数统一架构图像模型，挑战传统扩散管线

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-05-19 02:03
- AIHOT 分数：57
- AIHOT 链接：https://aihot.virxact.com/items/cmpbitzpl17r9slnz3pfgngl9
- 原文链接：https://x.com/rohanpaul_ai/status/2056435573855006984

## AI 摘要

HiDream开源了8B参数的HiDream-O1-Image模型，其核心创新在于采用像素级统一变换器，用单一架构直接处理原始图像块、文本与任务条件，将文本生成图像、编辑、个性化等任务统一为上下文生成，无需传统的VAE和文本编码器管线。该模型内置推理提示代理，能原生支持最高2048×2048的高分辨率合成。在性能上，它在参数量仅为部分同类模型三分之一的情况下，达到了可比的水平，尤其在文本渲染任务上表现出色，结果接近更大规模的模型。

## 正文

HiDream just open-sourced an 8B image model with a big message behind it： the old diffusion pipeline （VAE-plus-text-encoder） may not be the only serious path left.

8B param， HiDream-O1-Image （8B） claims parity with models over 3x its size （e.g.， 27B Qwen-Image）.

@HiDream_AI ， @vivago_ai

Key Features

🧬 Pixel-Level Unified Transformer - One end-to-end model on raw pixels， no VAE， no disjoint text encoder.

🎨 One Model， Many Tasks - Text-to-image， long-text rendering， instruction editing， subject-driven personalization， and storyboard generation in a single architecture.

🧠 Reasoning-Driven Prompt Agent - Built-in "thinking" agent that resolves implicit knowledge， layout， and text rendering before generation.

🖼️ Native High Resolution - Direct synthesis up to 2，048 × 2，048 with sharp fine-grained detail.

⚡ Exceptional Efficiency and Versatility at 8B Scale - With only 8B parameters， achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.

Most image models still split the job across a text encoder， a VAE， and a diffusion model， so details can get lost when real pixels are compressed into hidden image codes.

HiDream-O1-Image removes that split by using a Pixel-level Unified Transformer， where raw image patches， text tokens， and task conditions enter the same model space.

That means text-to-image， image editing， and subject personalization become variants of one in-context generation task， not separate pipelines.

A prompt agent first rewrites messy user requests into clearer visual instructions， reasoning through layout， subject attributes， physics， and context before generation.

The strongest result is text rendering.

On LongText-Bench， the 8B model scores 0.979 in English and 0.978 in Chinese， while the 200B+ model reaches 0.982 and 0.980.

That is the part to watch， because clean text inside generated images is still one of the hardest problems for image models.

🧵 1.
