Artificial Analysis@ArtificialAnlys · 4月3日56

Microsoft has released MAI-Transcribe-1: a speech transcription model achieving 3.0% on AA-WER (#4), and is fast at 69x real-time The model was developed by Microsoft AI (MAI)’s Superintelligence team and supports 25 languages including English, French, Arabic, Japanese, and Chinese. MAI-Transcribe-1 API is currently available in public preview via Azure Speech on Microsoft Foundry. On the Artificial Analysis Speech to Text (STT) leaderboard, MAI-Transcribe-1 achieves a 3.0% word error rate on AA-WER for speech transcription accuracy, positioning it 4th overall behind Mistral’s Voxtral Small (2.9% AA-WER), Google’s Gemini 3.1 Pro High (2.9% AA-WER) and ElevenLabs’ Scribe v2 (2.3% AA-WER). It also stands out as one of the faster high-accuracy transcription models available, processing audio at ~69x real-time. See more details below ⬇️

译微软AI超级智能团队发布了MAI-Transcribe-1语音转录模型。该模型在Artificial Analysis语音转文本排行榜的AA-WER指标上达到3.0%的词错误率，位列第四，仅次于Mistral Voxtral Small、Google Gemini 3.1 Pro High和ElevenLabs Scribe v2。其处理速度约为实时音频的69倍，属于高速高精度模型。模型支持包括英语、法语、阿拉伯语、日语和中文在内的25种语言，其API目前已在Microsoft Foundry的Azure Speech服务上提供公开预览。

Google Gemini@GeminiApp · 4月3日

We want to see what you’ve been making with Lyria 3 Pro in Gemini. 🎶 Share your creations in the replies 👇

译Google 官方发起创作征集，邀请用户在评论区分享使用 Gemini 内置 Lyria 3 Pro 功能生成的音乐作品，展示 AI 创作成果。

OpenAI@OpenAI · 4月3日

ChatGPT is now available in CarPlay. The voice mode you know, now available on-the-go. Rolling out to iPhone users running iOS 26.4+ where CarPlay is supported.

译ChatGPT 语音模式正式接入 CarPlay，运行 iOS 26.4 及以上版本的 iPhone 用户可在车载系统中使用语音交互功能，目前正在逐步推送中。

Satya Nadella@satyanadella · 4月2日

We’re bringing our growing MAI model family to every developer in Foundry, including … · MAI-Transcribe-1, most accurate transcription model in world across 25 languages · MAI-Voice-1, natural, expressive speech generation · MAI-Image-2, our most capable image model yet Start building: https://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/

译MAI 模型家族正式登陆 Foundry 平台，推出三款新模型：MAI-Transcribe-1（支持25种语言的最准确转录模型）、MAI-Voice-1（自然语音生成）和 MAI-Image-2（最强图像生成模型）。开发者现可通过该平台直接调用。

François Chollet@fchollet · 4月2日

One of the best AI products I've seen recently: (drumroll) Adobe Podcast

译我最近见过的最好的 AI 产品之一：（鼓声） Adobe Podcast

karminski-牙医@karminski3 · 4月1日

AI时代下, 连拥抱不确定性都要快点拥抱给大家说一下上周我浪费两天时间获得的失败经验. 我在之前的视频中做了好几个基于多模态模型的龙虾(openclaw) skill, 包括识别游戏(大菠萝2)里面的物品的背包管理器, 读书的时候给书拍照就能记录读书笔记的闪念等等. 这里面遇到的最大的问题是, openclaw 目前配置一下也支持语音输入和输出, 但是会需要 STT, TTS 问题, 啥意思呢? openclaw 原生不支持语音模态模型, 只能语音转文本(STT)输入给文本大模型, 然后文本大模型输出的文本再转语音(TTS)输出. 这两次转换及其耗时, 再加上龙虾本身上下文巨长, 导致从语音输入到龙虾语音跟你对话普遍都要30s以上. 那么有没有端到端的解决方案呢? 有的, 用omni模型, omni 模型支持文本, 音频, 图片, 视频输入, 然后支持文本和语音输出, 端到端模型最大的优点就是延迟低, 不需要转来转去, 一个模型能搞定一切.听上去先天就是为龙虾这类AI助手准备的对吧? 于是我上周抽了2天打算把这个事办了. 摆在我面前的有两条路, 一个是直接改龙虾代码, 另一个是写个龙虾插件, 把omni模型接进去, 显然第二个更快更省心, 我只需要写完了发布我的插件, 感兴趣的同学安装我的插件大家就能愉快的用了. 但是, 但是来了啊, 龙虾 channel (飞书,Discord,WhatsApp 等)都是有连接的, plugin 中的 omni 模型想要接收 channel 中的消息, 就要连接channel, 这一连接, 就把openclaw本身的连接顶掉了, openclaw直接离线. 然后openclaw也没复用自身channel的方法. 并且我浏览了openclaw的类似pr, 基本都被拒了, 因为作者认为client连接应该在channel层管理(架构上的确是合理的). 那我只剩下另一条路了, 直接给龙虾贡献代码, 于是我看了龙虾使用的接入大模型的基础库 pi-ai, 结果它也不支持OpenAI http://delta.audio 风格的语音流. 于是我又先给 pi-ai 贡献代码支持这个特性. 然而直接被拒. 我翻了翻作者 @badlogicgames 的X , 看得出来作者是很想对自己的项目负责, 不愿意接受低质量的AI生成代码(我也能理解, 毕竟是个巨大的基础库, 我的代码也是AI完成的我只是审阅了一下, 所以我也说不出来啥). 至此, 所有的路都被堵死了, 而我也花了2天时间来断断续续的搞龙虾支持Omni模型. 我觉得是时候止损了, so... 我接下可能会在这周末如果有时间完成自己的私有fork版本, 或者干脆就这样了. 至于龙虾不支持Omni模型, 那是它自己的损失了, 龙虾现在开放的 issue 和 pr 有 5000 多个, 最早的未合并pr还在1月31号, 等合并估计都猴年马月了. 我也不知道是像openclaw这种全面拥抱AI审阅PR好, 还是像pi-ai作者这样严格要求项目质量好. 我甚至觉得在现在的AI生产力加持下, 下个月直接出现一个新的基于Omni模型能力的更强的AI助手也不足为奇. 在AI生产力的冲刷下, 只能闪电拥抱不确定性. 否则你犹豫一下, 可能都抱错了.

译作者耗时两天尝试为openclaw接入Omni模型以解决语音交互延迟过高（超30秒）的问题。插件方案因channel连接冲突导致系统离线；直接改源码则遭遇底层库pi-ai不支持OpenAI语音流且PR被拒。所有技术路径均被堵死后，作者反思：在AI生产力爆发时代，必须快速拥抱不确定性，否则可能因项目架构限制或维护者审核标准而错失机会。

OpenAI Developers@OpenAIDevs · 3月28日

Build voice agents that do real work. We built a clinic concierge demo for a Singapore health clinic with gpt-realtime-1.5. It speaks naturally with patients, collects the right details, and books appointments in real time.

译用 gpt-realtime-1.5 为新加坡诊所打造接待员演示，支持与患者自然对话，自动收集就诊信息并完成实时预约，展示语音代理在实际场景中的工作能力。

Demis Hassabis@demishassabis · 3月27日

Gemini 3.1 Flash Live is our highest quality audio & voice model yet - and a big leap towards building next-gen voice-first agents. Lower latency, better precision, more natural interactions... try it now with Gemini Live in the @GeminiApp or build with it in @GoogleAIStudio!

译Google 发布 Gemini 3.1 Flash Live，称其迄今最高质量音频模型，具备更低延迟、更高精度和更自然的对话体验，改进了函数调用能力。现已在 Gemini App 和 Google AI Studio 上线。

Google Gemini@GeminiApp · 3月27日

This event is happening soon! Join the Gemini Discord here: http://discord.gg/gemini

译Gemini 将于明日（3月26日）上午11:30 PDT 在 Discord 举办活动，产品经理 Joel 将现场演示 Lyria 3 Pro 最新更新，现可通过 http://discord.gg/gemini 加入服务器参与。

Google DeepMind@GoogleDeepMind · 3月26日

Say hello to Gemini 3.1 Flash Live. 🗣️ Our latest audio model delivers more natural conversations with improved function calling – making it more useful and informed. Here’s what’s new 🧵

译Gemini 3.1 Flash Live 音频模型发布，支持更自然的实时对话，函数调用能力改进，使 AI 助手更实用、信息获取更充分。

Google Gemini@GeminiApp · 3月26日

Gemini Live just got its biggest upgrade yet, powered by Gemini 3.1 Flash Live. •Faster responses with fewer awkward pauses •Smarter & able to follow along 2x longer conversations, so you can stay in the flow •Dynamically adjusts its answer lengths & tone to match the moment

译Gemini Live 底层升级为 Gemini 3.1 Flash Live，响应更快且减少停顿，支持双倍时长对话保持连贯，可根据场景动态调整回答长度和语气。

Sundar Pichai@sundarpichai · 3月26日

Gemini 3.1 Flash Live is our highest-quality audio and voice model yet. Voice capabilities have come a long way and are a big part of how we interact with AI to get things done. 3.1 Flash Live’s improved precision and reasoning make those interactions more natural and intuitive. Available in @GoogleAIStudio through the Gemini Live API in preview.

译Gemini 3.1 Flash Live 发布，为 Google 迄今最高质量音频语音模型，精度和推理能力显著提升，交互更自然直观。现已在 Google AI Studio 通过 Gemini Live API 预览版上线。

Google Gemini@GeminiApp · 3月26日

Lyria 3 Pro’s enhanced customization offers more space to experiment and play with longer tracks, so you can now add more details to bring your full vision to life in Gemini.

译Lyria 3 Pro 升级定制功能，提供更长音轨创作空间与更多实验自由度，用户现可在 Gemini 中添加丰富细节以实现完整音乐愿景。

Demis Hassabis@demishassabis · 3月26日

Perfect background music for flow state at 2am - made with the new Lyria 3 Pro. Google AI subscribers can try it in the @GeminiApp and developers can build with the API in @GoogleAIStudio - have fun!!

译Google DeepMind 推出 Lyria 3 Pro，可生成最长3分钟的高保真音乐，支持自由编排前奏、主歌、副歌与桥段。Google AI 订阅者现可在 Gemini App 体验，开发者也能通过 Google AI Studio API 接入创作。

Google Gemini@GeminiApp · 3月26日

Ready to turn up the volume with Lyria 3 Pro? 🎶 Join us in the Gemini Discord tomorrow (3/26) at 11:30am PDT as Product Manager Joel returns for an encore performance to demo the latest Lyria 3 Pro updates. Join us on Discord now and we'll see you there! http://discord.gg/gemini

译Lyria 3 Pro 更新演示将于 3 月 26 日上午 11:30 PDT 在 Gemini Discord 举行，产品经理 Joel 返场展示最新功能。

Google Gemini@GeminiApp · 3月26日

Longer tracks are here with Lyria 3 Pro in Gemini! From experimenting with different styles to generating tracks with complex transitions, Lyria 3 Pro makes it easier to bring your full vision to life. Rolling out today to Google AI Plus, Pro, and Ultra users. Learn more 🧵

译Lyria 3 Pro 正式接入 Gemini，支持生成更长音轨及复杂风格过渡。即日起向 Google AI Plus、Pro 与 Ultra 订阅用户开放。

Artificial Analysis@ArtificialAnlys · 3月25日

Inworld, ElevenLabs, and MiniMax continue to lead our Text to Speech leaderboard for most preferred models Recent checkpoints from each of the labs continue to push the frontier of TTS quality, with 4 out of the top 5 models being released this year. Leading TTS models are increasingly realistic, particularly on relatively straightforward text, with preference differences increasingly coming down to affinity for different voices. Latest results also reflect stronger bot vote filtering, confirmed via triangulation against third-party evaluators. We've also added rank ranges based on each model's 95% confidence interval, showing where a model could land based on its Elo score range. Key results: ➤ Most preferred: Current top 5 per our TTS leaderboard: 1. Inworld TTS 1.5 Max (Elo of 1,238); 2. ElevenLabs Eleven v3 (1,197); 3. Inworld TTS 1 Max (1,183); 4. Inworld TTS 1.5 Mini (1,182); 5. MiniMax Speech 2.8 HD (1,175) ➤ Price: Kokoro 82M v1.0 (Replicate) leads at $0.65 per 1M characters, followed by Inworld TTS 1 and 1.5 Mini at $5, and AsyncFlow V2 at $8.33 ➤ Speed: WaveNet leads for batch generation at 419 characters processed per second, followed by Kokoro 82M v1.0 (Replicate) at 235, and Inworld TTS 1.5 Mini at 214 See below for further detail ⬇️

译Inworld、ElevenLabs 与 MiniMax 继续领跑 TTS 排行榜，今年发布的模型包揽前五中的四席。当前领先模型在简单文本上逼真度显著提升，用户偏好差异主要体现在声音风格选择上。评估方法已加强机器人投票过滤，并新增基于95%置信区间的排名范围。具体指标方面，Inworld TTS 1.5 Max 以1,238 Elo分居首，Kokoro 82M v1.0以$0.65/百万字符成为价格最低选项，WaveNet则以每秒419字符领先批处理速度。

Google Gemini@GeminiApp · 3月17日

Clever Gemini use case: Custom alarm clock sounds. Create custom tracks that actually get you moving in the Gemini app. Instructions below 👇

译在 Gemini 应用中直接生成个性化闹钟音乐，创建真正能叫醒你的专属铃声。无需专业工具，用 AI 定制专属起床音频，让早晨更容易清醒。

Satya Nadella@satyanadella · 2月24日

Today, I saw firsthand how Dragon Copilot will help physicians across the NHS spend more time on patient care and less on paperwork. The work at @MFTnhs is a powerful example of how tech can be used by clinicians to focus on what matters most. https://news.microsoft.com/source/emea/features/ai-tool-for-clinicians/

译今天，我亲眼见证了 Dragon Copilot 将如何帮助 NHS 的医生们把更多时间花在患者护理上，减少文书工作。 @MFTnhs 的工作是一个有力的例证，展示了临床医生如何利用技术专注于最重要的事情。https://news.microsoft.com/source/emea/features/ai-tool-for-clinicians/

Ilya Sutskever@ilyasut · 9月27日

In the future, once the robustness of our models will exceed some threshold, we will have *wildly effective* and dirt cheap AI therapy. Will lead to a radical improvement in people’s experience of life. One of the applications I’m most eagerly awaiting.

译未来，一旦我们模型的稳健性超过某个阈值，我们将拥有*极其有效*且极其便宜的 AI 心理咨询。这将从根本上改善人们的生活体验。这是我最热切期待的应用之一。