MaineCoon是一款22B参数的实时文本到音频-视频模型,专为实时AI角色设计。单H100 GPU可达47.5 FPS,成本低于0.001美元/秒;单RTX Pro 6000实现实时30 FPS。采用多阶段无强制流式训练(自采样、跨模态对齐、域偏好优化、强化在线策略蒸馏)及智能体流式推理框架,支持千秒级连续生成。双流扩散Transformer(视频+音频交叉注意力)保持表情、口型与声音同步,历史KV缓存和attention sink确保片段连贯。首帧小于1秒,生成与播放同步,不先制作完整视频再配音。
AI video is moving into its real-time reaction era, with MaineCoon now leading in low-latency AI video.
@catnips_ai just introduced MaineCoon, a 22B real-time text-to-audio-video model built for live AI characters, not offline video generation i.e. to make AI video feel live by generating synced speech and visuals in real time.
A record-breaking frame rate of up to 47.5 FPS on a single H100 GPU. Audio-visual generation cost drops significantly below $0.001 per second and continues to fall.
It positions the paradigm of social world models for social-interactive purposes. MaineCoon serves as the first generative core toward this paradigm and provides a technical foundation for next-generation AI-native social platforms.
It proposes a multi-stage forcing-free streaming training paradigm that includes self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). These components enable 22B-scale native and efficient streaming audio-visual training.
It designs an agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning.
The big deal is long-duration streaming at low cost.
Text goes in, the first frame appears in under 1s, and the model keeps producing synced video and audio while playback is already happening.
So it is not making a full video first, then dubbing it later. It generates forward in small chunks, and each chunk continues from the last one.
That is hard because tiny chunks usually break consistency. Faces drift. Voices change. Motion gets weird. Audio and mouth movement separate.
MaineCoon tries to solve this with a dual-stream Diffusion Transformer: one stream for video, one stream for audio, and cross-stream attention between them so expression, lip motion, voice, timing, and body movement stay tied together.