# Catnip推出MaineCoon：22B实时音频-视觉流式基础模型

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-17 04:22
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmqh3przi00o1sle1ac02z5c7
- 原文链接：https://x.com/rohanpaul_ai/status/2066979625830645803

## AI 摘要

Catnip推出MaineCoon，一个22B参数的实时音频-视觉基础模型，能将文本提示词转化为带同步语音、动作和表情的实时角色流，支持无限时长交互。作为首个流式原生模型，MaineCoon实现亚秒级首帧，单张H100上达47.5FPS，单张RTX Pro 6000上达30FPS，内部测试吞吐量比同类音频-视觉系统快约7倍。与被动视频生成不同，它能因果性地实时响应，记住自身不完美的过去，并保持角色身份、声音和节奏的连贯一致，让AI从轮次式应答变为“与你同在”的实时存在。

## 正文

Catnip just dropped MaineCoon， a 22B real-time audio-visual foundation model that turns text prompts into a live character stream with synced speech， motion， and expression.

The first streaming-native model of its kind.

sub-second first frame， 47.5FPS on one H100， 30FPS on one RTX Pro 6000， and about 7x faster throughput than comparable audio-visual systems in its internal tests.

The big deal is that a normal video generator can wait， revise， and render a finished clip， but a social interface has to move causally， remember its own imperfect past， and stay ahead of playback without breaking identity， voice， or rhythm.

### 引用推文

> Catnip：🥇MaineCoon: From Passive Video to Real-Time AI Presence The first unlimited-duration interactive audio-visual model. Most AI products today still feel like the...