Rohan Paul@rohanpaul_ai · 5月19日57HiDream just open-sourced an 8B image model with a big message behind it: the old diffusion pipeline (VAE-plus-text-encoder) may not be the only serious path left.
8B param, HiDream-O1-Image (8B) claims parity with models over 3x its size (e.g., 27B Qwen-Image).
@HiDream_AI , @vivago_ai
Key Features
🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, and storyboard generation in a single architecture.
🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent that resolves implicit knowledge, layout, and text rendering before generation.
🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail.
⚡ Exceptional Efficiency and Versatility at 8B Scale — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.
Most image models still split the job across a text encoder, a VAE, and a diffusion model, so details can get lost when real pixels are compressed into hidden image codes.
HiDream-O1-Image removes that split by using a Pixel-level Unified Transformer, where raw image patches, text tokens, and task conditions enter the same model space.
That means text-to-image, image editing, and subject personalization become variants of one in-context generation task, not separate pipelines.
A prompt agent first rewrites messy user requests into clearer visual instructions, reasoning through layout, subject attributes, physics, and context before generation.
The strongest result is text rendering.
On LongText-Bench, the 8B model scores 0.979 in English and 0.978 in Chinese, while the 200B+ model reaches 0.982 and 0.980.
That is the part to watch, because clean text inside generated images is still one of the hardest problems for image models.
🧵 1.
译HiDream开源了8B参数的HiDream-O1-Image模型,其核心创新在于采用像素级统一变换器,用单一架构直接处理原始图像块、文本与任务条件,将文本生成图像、编辑、个性化等任务统一为上下文生成,无需传统的VAE和文本编码器管线。该模型内置推理提示代理,能原生支持最高2048×2048的高分辨率合成。在性能上,它在参数量仅为部分同类模型三分之一的情况下,达到了可比的水平,尤其在文本渲染任务上表现出色,结果接近更大规模的模型。