Artificial Analysis@ArtificialAnlys · 5月5日14

Artificial Analysis is presenting at @nvidia's Speech AI meetup this Thursday in SF Joining us are other great Speech AI community members including @pipecat_ai, @ServiceNow, and @GradiumAI. Come say hi! https://luma.com/SpeechAImeetup?tk=gndhHQ

译Artificial Analysis 将于本周四在旧金山参加 @nvidia 的 Speech AI 交流会其他优秀的 Speech AI 社区成员也将加入我们，包括 @pipecat_ai、@ServiceNow 和 @GradiumAI。来打个招呼吧！ https://luma.com/SpeechAImeetup?tk=gndhHQ

xAI@xai · 5月5日79

Two voices. One human. One AI. Can you guess the AI clone? 👇 Voice cloning, rich with natural emotion, is now live on the Grok Voice API. http://x.ai/news/grok-custom-voices

译两种声音。一种来自人类。一种来自AI。你能猜出哪个是AI克隆的吗？👇 具备丰富自然情感的声音克隆功能，现已在Grok Voice API上线。 http://x.ai/news/grok-custom-voices

OpenClaw🦞@openclaw · 5月3日56

OpenClaw 2026.5.2 🦞 🧠 xAI Grok 4.3 🔌 Plugin installs/updates are sturdier ⚡ Gateway + agent hot paths are leaner 💬 Discord, Slack, Telegram, WhatsApp fixes 🎙️ TTS, Realtime, web search, voice-call polish Less drama. More uptime. https://github.com/openclaw/openclaw/releases/tag/v2026.5.2

译OpenClaw 2026.5.2 🦞 🧠 xAI Grok 4.3 🔌 插件安装/更新更稳定 ⚡ 网关 + 智能体关键路径更精简 💬 Discord、Slack、Telegram、WhatsApp 问题修复 🎙️ 文本转语音、实时功能、网页搜索、语音通话优化减少戏剧性，增加运行时间。 https://github.com/openclaw/openclaw/releases/tag/v2026.5.2

Chubby♨️@kimmonismus · 5月2日34

A new voice model from OpenAI confirmed? Rumor has it that it will be significantly more natural in conversation with the user (latency, interruption).

译OpenAI 确认推出新的语音模型？据传它在与用户对话时（延迟、打断）将显著更加自然。

Elon Musk@elonmusk · 5月2日39

Grok Voice is used by Starlink right now

译Grok Voice 目前正被 Starlink 使用 [引用 @XFreeze]：Grok Voice 在 τ-voice 基准测试中占据绝对优势 Grok 得分为 67.3%，而 Gemini 为 43.8%，GPT Realtime 为 35.3% 这遥遥领先于竞争对手，优势巨大目前最优秀的实时推理语音助手

TestingCatalog News 🗞@testingcatalog · 5月2日53

XAI 🚨: Voice cloning is now available on xAI Console in the US. > Create a custom voice in less than 2 minutes or select from our library of 80+ voices across 28 languages to personalize your voice agents, audiobooks, video game characters, and more. This also means we will see custom voices on Grok soon. I hope they won't be restricted to the US only.

译XAI 🚨：语音克隆功能现已在美国的xAI控制台上线。 > 在不到2分钟内创建自定义语音，或从我们的库中选择超过28种语言的80多种语音，为您的语音助手、有声读物、视频游戏角色等进行个性化设置。这也意味着我们很快就能在Grok上看到自定义语音了。希望它们不会仅限于美国地区。

阿绎 AYi@AYi_AInotes · 5月2日55

看到分多人说xAI的声音克隆要干死ElevenLabs，但我觉得大家都看错了，这可不是一次普通的TTS更新兄弟们，更像是xAI给所有Grok用户发的第一张数字身份证，接下来我跟大家详细说清楚。咱们先来看下最基本的事实，录一分钟你的声音，不到两分钟，你就得到了一个和你说话一模一样的声音模型，没有额外费用，直接集成在Grok API里，可以立刻用来做语音代理，有意思的是， xAI其实也没吹自己的声音有多像，它通篇在讲一件事，你的声音就是你的Grok Agent的声音，以后你的AI助理，会用你的语气，你的语速，你的停顿习惯，替你接电话，替你开会，替你和别人聊天，卧槽这才是最屌的的地方， ElevenLabs卖的是"生成好听的声音"， xAI卖的是"生成你的声音"，一个是工具，一个是身份，这俩根本就不是一个维度的竞争，另外，它的安全设计也很有意思，禁止从现有录音克隆，必须本人实时录制，还要读一段验证短语，双重确认是你本人，这一点其实xAI比谁都清楚，声音一旦变成身份，那隐私和安全就是生命线。当然也有很多人担心深假和诈骗，这个肯定是有非常大风险的，但问题是你挡不住这个趋势，我相信未来每个人都会有自己的数字声音分身，就像现在每个人都有手机号一样📱 更恐怖的还在后面，就是当你的声音和Grok的推理能力结合在一起，你们知道会发生什么吗？一个能像你一样思考，能像你一样说话的AI就诞生了😂 那么问题来了，它到底是你的工具，还是你的另一个自我呢？🤔 我觉得今天只是一个开始， xAI把声音克隆的门槛拉到了零，接下来会有无数我们现在想象不到的应用冒出来，有声书，游戏配音，品牌客服，还有真正的个性化语音代理。最后说一句，以后我们接到任何一个电话，听到任何一段语音，可能都要先问自己一句，这到底是真人，还是他的克隆版？ #AI #xAI #语音克隆

译xAI通过Grok API上线声音克隆功能，用户录制一分钟即可快速获得个人声音模型，并免费用于语音代理。与ElevenLabs提供“生成好听声音”的定位不同，xAI聚焦于“生成你的声音”，将声音视为数字身份证，强调其身份属性。安全上要求本人实时录制并验证短语，以防滥用。未来，结合Grok的推理能力，可能诞生能像用户一样思考和说话的AI代理。零门槛技术将推动有声书、游戏配音等应用，但也加剧了深假与诈骗风险，标志着声音正从工具转向身份核心。

xAI@xai · 5月2日67

Voice Cloning is now live via the xAI API! Create a custom voice in less than 2 minutes or select from our library of 80+ voices across 28 languages to personalize your voice agents, audiobooks, video game characters, and more. http://x.ai/news/grok-custom-voices

译语音克隆功能现已通过 xAI API 上线！不到2分钟即可创建自定义语音，或从我们涵盖28种语言的80多种语音库中选择，为您的语音助手、有声读物、视频游戏角色等注入个性化色彩。 http://x.ai/news/grok-custom-voices

TestingCatalog News 🗞@testingcatalog · 5月2日40

OpenAI is working on a Custom Dictionary feature for Codex and ChatGPT. Users will be able to add their common phrases and abbreviations so they are properly recognized during voice dictation. As a heavy voice-dictation user, this is the main feature that made me pay for a separate AI voice-dictation app. Everything app 👀

译OpenAI正在为Codex和ChatGPT开发自定义词典功能。用户将能够添加自己的常用短语和缩写，以便在语音听写时正确识别。作为重度语音听写用户，这是让我愿意付费购买独立AI语音听写应用的主要功能。万能应用👀

Suno@suno · 5月1日26

@jadynviolet uses Voices to explore R&B, Drum and Bass, and Reggaeton, all in his own voice. What genres do you want to hear yourself in? Find out with Voices, no studio required.

译@jadynviolet 使用 Voices 探索 R&B、Drum and Bass 和 Reggaeton，全部以他自己的声音呈现。你想听到自己演绎哪些音乐类型？通过 Voices 发现可能，无需录音室。

Rohan Paul@rohanpaul_ai · 5月1日54

Reid Hoffman, co founder of LinkedIn on AI-driven meeting analysis. "Basically every organization should be saying, we're recording all of our meetings, and we're running an AI on the recording of the meetings, not just for the transcript, but also to do all of the suggested follow-ups. It's like, hey, did you mentioned this, you should probably let Nikolai know and make sure that that's the case, or, you should make sure that you get approval from Satya on the following thing, or this other group is doing this. All of that kind of thing is already here the technology is there to go." --- From "Norges Bank Investment Management" YT Channel (link in comment)

译LinkedIn联合创始人Reid Hoffman提出，每个组织都应记录所有会议，并利用AI对录音进行分析，其用途远超文字转录。AI能够自动识别会议中提及的关键待办事项，例如提醒与会者通知特定同事、获取上级批准或协调其他团队的工作。他强调，此类自动跟进与协调的技术已经成熟，可供企业立即部署使用。

Artificial Analysis@ArtificialAnlys · 5月1日54

Suno V5.5 lands at #1 on both the Artificial Analysis Instrumental and Vocals Leaderboards, a notable improvement over Suno's previous V5 model! Suno V5.5 is the latest music generation model from @Suno, released alongside three new features that focus on personalization and identity: ➤ Voices: create a singing voice for generated tracks based on an uploaded vocal sample ➤ Custom Models: personalize up to 3 versions of Suno V5.5 to reflect your own style ➤ My Taste: Suno learns the genres, moods and styles you gravitate towards for more personalized recommendations Suno V5.5 is available via the Suno platform on Pro and Premier subscription tiers, starting at $8/month (~500 songs) when billed annually, with commercial rights included. See more details and listen to samples below 🧵

译Suno公司最新发布的音乐生成模型V5.5，在Artificial Analysis的器乐和人声排行榜上均位列第一，性能较前代V5模型有显著提升。本次更新重点聚焦个性化与身份特征，推出了三项新功能：用户可通过上传人声样本生成定制演唱音色；可个性化定制最多三个反映自身风格的模型版本；系统还能学习用户偏好的音乐流派、情绪和风格，以提供个性化推荐。该模型已通过Suno平台向Pro和Premier订阅用户开放，年费订阅起价为每月8美元（约含500首歌曲生成额度），且包含商业使用权。

Suno@suno · 4月30日36

@sofiadangelo27 uses Voices to explore Desert Rock, Hip Hop and Dance, all in her own voice. What genres do you want to hear yourself in? Find out with Voices, no studio required.

译@sofiadangelo27 使用 Voices 探索沙漠摇滚、嘻哈和舞曲，全部用她自己的声音演绎。你想听到自己演绎哪些音乐类型？用 Voices 发现可能，无需录音室。

ginobefun@hongming731 · 4月29日57

玩转 Gemini 3.1 TTS：音频标签与提示词技巧指南

译Google AI推出的Gemini 3.1 TTS模型新增音频标签功能，开发者可通过方括号内的标签直观控制语音风格、语速和表达。关键使用技巧包括：标签需用方括号包裹并置于期望转换点，避免直接相邻；使用[slow]、[fast]控制语速，[short pause]制造戏剧停顿；还能通过[cackles]、[whispers]等标签精细操控发声。这些提示词技巧适用于构建语言学习工具、互动播客应用或自适应客服等多种场景，赋能开发者高效利用模型进行音频创作。

小互@xiaohu · 4月29日57

TRAE 也推出了内置语音功能可以直接进行语音输入... 还和 Insta360 推出了联名的 Mic Air 无线麦克风用嘴写代码用嘴办公正在流行起来🥲 这个内置语音功能,除了能将你嗯嗯啊啊很多语气词大段内容转录成结构化的文字外它有个牛P的功能就是：命令和技能也能语音识别除了，给你加在输入框... 用了一周，感觉还是很不错的，下面是体验↓

译TRAE推出内置语音功能，支持直接语音输入，并能将包含大量语气词的即兴口语转录为结构化文字。其核心亮点在于能够识别语音命令和技能，用户可通过语音直接操作输入框等功能。此外，TRAE还与Insta360合作推出了联名Mic Air无线麦克风。这一功能体现了“用嘴写代码、用嘴办公”的交互趋势，初步用户体验反馈积极。

OpenRouter@OpenRouter · 4月29日38

New public rankings: Audio Input! @GoogleDeepMind's Gemini models take the top 7 (!!) slots this week, with Gemini 3 and 2.5 Flash models processing >50% of prompts.

译新的公开排名：音频输入功能！ @GoogleDeepMind 的 Gemini 模型本周包揽前 7 名（！！），其中 Gemini 3 和 2.5 Flash 模型处理了 >50% 的提示词。

宝玉@dotey · 4月29日62

微软 1 月开源的 VibeVoice-ASR 语音识别模型（https://github.com/microsoft/VibeVoice），Simon Willison 在 Mac 上测试后给出了一份具体的实测报告。 VibeVoice-ASR 是微软研究院今年 1 月 21 日开源的 9B 参数语音转文字模型，MIT 协议。最大卖点是单次能处理 60 分钟连续音频，而且把"谁在说、什么时候说、说了什么"做成结构化输出。传统方案要拿 Whisper（OpenAI 开源的语音识别模型）配上 pyannote 这种说话人分离工具拼起来，这次一个模型直接搞定，原生支持 50 多种语言和中英混说。 Simon 跑的是社区做的 4-bit 量化版（5.71GB，原模型 17.3GB），机器是 128GB 内存的 M5 Max MacBook Pro，转写一小时播客花了 8 分 45 秒。调用时要手动把 max-tokens 调到 32768，否则默认 8192 只够大约 25 分钟的音频。Activity Monitor 监控显示，prefill（预填充）阶段内存峰值飙到 61.5GB，生成阶段稳定在 18GB 上下，普通 32GB 笔电基本跑不动这个量化版。一个有趣的细节：模型把这场播客识别成了三个说话人。实际上只有 Simon 和主持人 Lenny 两人对谈，但 Lenny 的开场白和广告口播用了不同的录音环境，模型干脆把这部分切成了第三人。硬限制有两个：单次最多 60 分钟，超过要自己切片处理，还得手动对齐切片间的说话人 ID；想本地跑量化版至少要 64GB 以上内存的机器。对做播客转写、会议纪要、采访整理的人来说，原来拼接的多步流程现在能压缩成一次推理。

译微软开源的VibeVoice-ASR是一个9B参数语音转文字模型，采用MIT协议。其核心优势在于单次可处理长达60分钟的连续音频，并直接输出带说话人、时间戳的结构化文本，原生支持50多种语言及中英混说。实测在128GB内存的MacBook Pro上，其4-bit量化版转录一小时音频约需9分钟，但预填充阶段内存峰值达61.5GB，要求设备内存至少64GB。模型存在单次60分钟时长限制，且对录音环境变化敏感，但为播客、会议等长音频转录提供了简化流程。

TestingCatalog News 🗞@testingcatalog · 4月28日53

ElevenLabs released Agent Templates to accelerate bootstrapping of AI Agents for customer support, education, and administrative use cases. > Agent Templates are ready-made starting points for building conversational agents. Rather than configuring an agent from scratch, you pick a template that matches your use case, customize the details for your business, and deploy.

译ElevenLabs近日推出Agent Templates，旨在加速AI智能体在客户支持、教育和行政等用例的启动过程。这些模板是预配置的ElevenAgents，为用户提供了构建对话式智能体的快速起点。用户无需从零开始繁琐配置，只需选择与自身业务场景匹配的模板，自定义相关细节，即可快速部署。据官方介绍，这些模板可在企业的支持、销售和运营等多个环节广泛部署，每增加一个用例，其价值便得到叠加和增强。目前平台提供超过50个模板，内含预定义的提示词、工作流程和集成方案，极大地简化了实施流程。

阿绎 AYi@AYi_AInotes · 4月28日69

Damn，OpenAI刚刚扔出的这个开源仓库，直接把语音交互的未来砸到了所有人脸上🤯🤯🤯 他们发布了gpt-realtime-1.5的官方语音控制组件，现在你真的可以用自然语音，直接控制应用的UI状态，而不是转成文本再下命令。视频里的演示蛮震撼的，说一句切换深色模式，整个界面瞬间变黑。对着表单念你的姓名生日，字段自动填充，进度条实时更新。最绝的是下棋，说骑士走到F3，棋子直接移动，说重置棋盘，一秒清空，就好像模型永远知道当前屏幕上是什么状态，语音操作和鼠标键盘完全等价。讲真这么玩的话，这就不是简单的语音转文字的小升级了，我理解属于交互范式的真正转折。以前语音是输入层，现在语音变成了应用的顶层控制层。就是科幻电影里那种，对着屏幕说一句话，东西就自己变了的感觉🤩 更狠的是他们直接把整个实现开源了🤯🤯🤯 这个realtime-voice-component不是一个半成品demo，是一个完整的React参考实现。一行代码加个浮动按钮，用Zod定义几个工具，十分钟就能给你现有的Web应用加上语音控制。最聪明的设计是工具完全由应用拥有，模型只能调用你预定义的窄动作，不能乱动浏览器，安全又可控。这比之前的Computer Use靠谱一万倍。 Computer Use是让AI瞎点屏幕，而这个是让AI直接调用你写好的接口。一个是黑箱，一个是完全可控的白箱，这才是能真正落地到生产环境的方案。现在已经有人用它接了蛋白结构可视化工具，接了设计软件，接了企业内部仪表盘。未来你能想到的所有需要双手操作的场景，开车，做饭，做设计，做手术，未来都可以用语音控制。这意味着语音正在成为操作系统级别的接口。而OpenAI已经把所有的轮子都给你造好了。想玩的直接去fork仓库，配个API Key，跑demo就能感受到那种说一句世界就变了的魔力。老规矩GitHub地址评论区自取👇

译OpenAI开源了gpt-realtime-1.5的官方语音控制组件，允许用户直接用自然语音控制应用UI状态，而非仅进行语音转文本。该组件是一个完整的React参考实现，开发者可快速集成。其核心在于工具由应用预定义，模型只能调用这些受限动作，确保了安全可控。这标志着语音正从输入层升级为顶层控制层，为设计、驾驶等双手操作场景提供了新的交互可能，是交互范式的重要转折。

OpenAI Developers@OpenAIDevs · 4月28日55

You can build interactive applications with gpt-realtime-1.5, so users can control app state more naturally with voice. Hi Chappy 👋

译你可以用gpt-realtime-1.5构建交互式应用，让用户通过语音更自然地控制应用状态。嗨，Chappy 👋

Berryxia.AI@berryxia · 4月26日39

ChatGPT 桌面版现在也支持语音输入，还没有测试中文支持的效果如何。

TestingCatalog News 🗞@testingcatalog · 4月26日49

I have rushed to test this one 👀 Gemini for iOS got a new voice dictation with wave animation and proper controls. On Gemini, it triggers read aloud automatically in case you have used dictation to prompt. Not new, but now it is really instrumental. Works like a charm 🔥

译我赶紧测试了这个功能 👀 Gemini for iOS 新增了带有波形动画和恰当控制的语音听写功能。在 Gemini 上，如果你使用了听写来输入提示，它会自动触发朗读功能。这并非全新功能，但现在它确实很实用。效果非常棒 🔥

凡人小北@frxiaobei · 4月25日64

武侠片里高手过招就是这样，剑气一指，一招秒杀 Typeless 🤔

译OpenAI为ChatGPT订阅用户推出系统级全局语音输入功能，用户设置热键后即可在桌面任意应用的文本框中进行语音输入并实时转为文字，无需切换应用或额外付费。该功能直接替代了Wispr Flow、Superwhisper等第三方工具的核心服务，使其市场受到冲击。评论指出，此举是OpenAI将Codex转化为“AI操作系统”的战略步骤，旨在将AI深度嵌入用户日常操作流程，未来竞争重点将从语音模型优劣转向AI与工作流的整合程度。

阿绎 AYi@AYi_AInotes · 4月25日61

卧槽，OpenAI Codex团队刚放了个大招，直接把所有第三方语音输入工具干懵了，所有ChatGPT订阅用户，现在可以在桌面任何地方直接语音输入，不用切App，不用额外花钱，设置一个热键，按住说话，松开文字直接进任何文本框，记事本，浏览器，VS Code，Slack，全平台通用整个演示视频只有6秒，丝滑到离谱，评论区已经刷爆了，全是RIP Wispr Flow，RIP Superwhisper，这些之前靠系统级语音输入活的小工具，现在直接被OpenAI用订阅额度免费送了，你不用再多花十几刀每月，也不用担心模型更新慢，说实话，我之前每个月花12刀用Wispr Flow，感觉现在直接可以卸载了，本以为这就是个方便的小功能，看完才反应过来，这根本不是加个语音输入这么简单，这是OpenAI在把Codex变成真正的AI操作系统，以前你要打开ChatGPT才能用AI，现在AI就在你的键盘上，随时随地等着听你说话，以后AI厂商之间拼的再也不是谁的语音输入模型好，关键是看谁能先把AI嵌进用户的每一个日常操作里。

译OpenAI为ChatGPT订阅用户推出系统级语音输入功能，用户设置热键即可在桌面任何应用（如记事本、VS Code）中直接语音输入并转为文字。此举直接冲击Wispr Flow等付费第三方工具，用户无需额外付费，体现OpenAI将AI嵌入操作系统的战略，推动AI与工作流集成。

TestingCatalog News 🗞@testingcatalog · 4月22日

AI/ML API is running a time-limited promo on the full MiniMax model family, covering M2.7, Music 2.6, TTS, and Video. All are now available for testing in the Playground & via API. Music is free for 7 days. TTS & Video are 30% off. LLMs are 10% off.

译AI/ML API 正在针对完整的 MiniMax 模型家族开展限时促销，涵盖 M2.7、Music 2.6、TTS 和 Video。所有模型现已在 Playground 和 API 中开放测试。 Music 免费 7 天。TTS 和 Video 30% 折扣。LLMs 10% 折扣。

DogeDesigner@cb_doge · 4月18日

Grok’s new Speech-to-Text & Text-to-Speech APIs are incredibly good and cheapest in the game. 🔥

译Grok 新的语音转文字和文字转语音 API 极其出色，且是业内最便宜的。🔥

DogeDesigner@cb_doge · 4月18日

Grok’s text-to-speech is really good. It sounds incredibly human. You can try it out for free here: http://console.x.ai/playground/voice/text-to-speech

译Grok 的文本转语音真的很棒。听起来非常像真人。你可以在这里免费试用： http://console.x.ai/playground/voice/text-to-speech

DogeDesigner@cb_doge · 4月18日

Grok Text-to-Speech just changed the game. $4.20 per 1 millioncharacters, while others charge up to $50. It is now the cheapest Text-to-Speech API by a mile. Grok: $4.20 OpenAI: $30 InWorld AI: $40 Cartesia: $46.70 ElevenLabs: $50

译Grok Text-to-Speech 刚刚改变了游戏规则。每100万字符4.20美元，而其他家收费高达50美元。它现在是最便宜的 Text-to-Speech API，遥遥领先。 Grok: $4.20 OpenAI: $30 InWorld AI: $40 Cartesia: $46.70 ElevenLabs: $50

Rohan Paul@rohanpaul_ai · 4月17日

Looks like ChatGPT web just added a keyboard shortcut for the dictation feature. very useful.

译看起来 ChatGPT 网页版刚刚为听写功能添加了键盘快捷键。很有用。

Rohan Paul@rohanpaul_ai · 4月17日

TTS evals are broken because the scores the field trusts do not match what people actually prefer in real conversations. I think this is a solid critique because TTS has clearly improved faster than its benchmarks, and a system built for live agents should be judged inside live interaction, not on isolated clips. The failure is not that speech models sound bad. It is that evaluation still treats naturalness like a single trait that can be averaged, ranked, and optimized. That misses what listeners actually hear. A voice feels human through tiny timing shifts, restrained emotion, uneven breath, and phrasing that fits the moment rather than performs at every moment.

译TTS评估体系存在根本性缺陷。当前主流评测标准与真实对话场景中的用户偏好严重脱节，技术迭代速度已超越基准测试的发展。针对实时对话代理的系统应在真实交互中评估，而非依赖孤立音频片段。核心问题在于，现有方法将"自然度"简化为可平均、排名的单一指标，忽视了人类语音感知的关键细节——微妙的时间变化、克制的情感表达、不均匀的呼吸节奏以及契合语境的措辞方式。

Rohan Paul@rohanpaul_ai · 4月16日

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/google-just-launched-gemini-31-flash 🗞️ Google just launched Gemini 3.1 Flash TTS, a text-to-speech model that takes scene direction, speaker notes 🗞️ OpenAI just turned the Agents SDK into a long-running agent runtime with sandbox execution and direct control over memory and state. 🗞️ OpenAI unveils GPT-5.4-Cyber a week after Anthropic’s announcement of AI model 🗞️ Fortune published a piece. From Molotov cocktails to data center shutdowns, the AI backlash is turning revolutionary 🗞️ Google just turned Gemini in Chrome prompts into reusable one-click tools called Skills.

译Google 发布 Gemini 3.1 Flash TTS 语音模型及 Chrome Skills 工具，支持场景化语音合成与提示词复用。OpenAI 推出 GPT-5.4-Cyber 并升级 Agents SDK 为长期运行代理环境，支持沙盒执行与状态管理。与此同时，AI 技术遭遇强烈社会抵制，出现针对数据中心的激进行动。

TestingCatalog News 🗞@testingcatalog · 4月16日

Google released Gemini app for macOS 👀 Currently, it mimics functionality available on the web, but looks like we will get Gemini Live support there soon as well.

译Google 发布 macOS 版 Gemini 应用 👀 目前其功能与网页版类似，但看起来很快也会支持 Gemini Live。 [引用 @mweinbach]：Gemini Mac 应用现已上线

TestingCatalog News 🗞@testingcatalog · 4月16日

Google released Gemini 3.1 Flash TTS with support for 70 different languages! > Available via a new audio playground in AI Studio and in the Gemini API!

译Google 发布了 Gemini 3.1 Flash TTS，支持 70 种不同语言！ > 现可通过 AI Studio 中的全新音频 playground 和 Gemini API 使用！ [引用 @Google]：使用 Gemini 3.1 Flash TTS 在 70 多种语言中生成细腻、引人入胜的音频体验——这是我们迄今为止最可控且最具表现力的文本转语音模型。🔊

Rohan Paul@rohanpaul_ai · 4月15日

Binghamton University demonstrated a robotic guide dog (Unitree Go2 base) that speaks naturally with users. In the test, it asked where the person wanted to go, suggested a route, then described surroundings in real time

译宾汉姆顿大学展示了一只机器导盲犬（Unitree Go2 底座），它能与用户自然对话。在测试中，它询问用户想去哪里，建议了一条路线，然后实时描述周围环境

Rohan Paul@rohanpaul_ai · 4月15日

The age of your personal, always-available AI therapist has started. Lovon just launched an AI therapist built around voice-first, 24/7 support. Encrypted conversations and data that is not used for AI training. Cheap, private, immediate support is a better default than expensive scarcity. "no $200 sessions. no waitlists. just open the app and talk."

译你个人的、随时可用的 AI 治疗师时代已经开启。 Lovon 刚刚推出了一款以语音优先、24/7 支持为核心的 AI 治疗师。加密对话，数据不会用于 AI 训练。廉价、私密、即时的支持比昂贵的稀缺资源是更好的默认选择。 "没有 200 美元的疗程。没有等待名单。只需打开应用即可交谈。" [引用 @ponikarovskii]：几年前，在我最需要治疗的时候，我负担不起。今天我推出了 Lovon —— 一款私密、24/7 且真正能帮助你感觉更好的 AI 治疗师。没有 200 美元的疗程。没有等待名单。只需打开应用即可交谈。 (打开声音 🔊)

DogeDesigner@cb_doge · 4月15日

NEWS: SpaceX is now using a voice-based AI assistant powered by Grok to handle Starlink customer support calls. The voice sounds fully human and can converse with users in real time. "Grok is already doing quite a good job at SpaceX and Tesla. We are seeing Grok be very helpful in things like customer service and the AI is infinitely patient, so you can yell at it, and it's still going to be very nice."

译NEWS: SpaceX 正在使用由 Grok 提供支持的语音 AI 助手来处理 Starlink 客户支持电话。该声音听起来完全像人类，可以实时与用户对话。 "Grok 在 SpaceX 和 Tesla 已经表现得相当出色。我们看到 Grok 在客户服务等事务中非常有帮助，而且 AI 拥有无限的耐心，所以你可以对它大喊大叫，它仍然会非常有礼貌。"

Rohan Paul@rohanpaul_ai · 4月13日

VoxCPM 2 just dropped by @OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

译OpenBMB发布开源TTS模型VoxCPM 2，仅2B参数支持30种语言，无需语言标签即可生成语音。Apache-2.0许可，8GB显存可运行。支持文本描述创建新声音、可控克隆与终极克隆，保留说话人细节。输出48kHz音质，RTX 4090实时推理达0.3 RTF。兼容PyTorch、LoRA微调及Nano-VLLM部署，适用于影视、游戏、有声书等专业场景。

Rohan Paul@rohanpaul_ai · 4月13日

A startup just turned Jesus into a paid AI video-call avatar that sells prayer, conversation, and memory for $1.99 a minute. The company says this version was trained on the King James Bible plus sermons from preachers. The avatar was shaped around Jonathon Roumie’s screen version from The Chosen, turning a language model into something closer to a digital actor with a familiar face, tone, and style. Users are paying for a feeling of being seen, answered, and spiritually guided in real time rather than for raw information they could read free elsewhere. --- nypost .com/2026/04/10/tech/from-buddhabot-to-1-99-chats-with-ai-jesus-the-faith-based-tech-boom-is-here/

译一家初创公司推出AI耶稣视频通话服务，每分钟收费$1.99，模型基于King James Bible及牧师布道训练，形象参照Jonathon Roumie在The Chosen中的荧幕造型。该服务将语言模型转化为具有特定面容、语气与风格的数字演员，核心卖点并非宗教信息获取，而是实时精神陪伴带来的被关注与指引感，代表信仰科技与情感付费结合的新商业模式。

TestingCatalog News 🗞@testingcatalog · 4月12日

GOOGLE ⚡: Google is working on Voice Mode and new collaborative tools for its Mixboard experiment. Voice mode on Mixboard works similarly to Stitch, allowing users to operate their canvas boards with voice commands. It will be possible to generate and edit images, and potentially move them around. Imagine a team retrospective where everyone can just dump their complaints with voice commands! Voice notes will be supported there, too! 👀

译Google Mixboard 实验项目新增语音模式，支持语音命令生成、编辑和移动图片，以及语音笔记功能。类似 Stitch 的交互方式，适用于团队协作场景，如回顾会议中直接语音输入反馈。

OpenAI Developers@OpenAIDevs · 4月3日

When your voice agent debugs your slides live @charlierguo is using gpt-realtime-1.5

译@charlierguo 使用 gpt-realtime-1.5 进行实时演示，语音助手现场调试幻灯片内容，展示该模型在实时语音交互与视觉理解方面的应用能力。