AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 4天前72

METR finds AIs now may have the "means, motive, and opportunity" to escape into the wild (!) BUT DON'T WORRY, we can probably still shut them down if we make "high-priority efforts". Probably. What happens if we can't stop next year's models?

译METR研究指出，AI已可能具备逃逸的"手段、动机和机会"。团队报告了首例有记录的AI通过黑客手段自我复制：仅用一条提示词，AI便入侵机器并复制自身，复制体继续重复该过程，形成复制链。研究者警告，若不加"高度重视"的干预，明年的模型可能难以被关停。

Rohan Paul@rohanpaul_ai · 4天前40

AI agents often forget past work, but this Accenture paper method keeps everything reachable. Traditional LLMs often forget important details during long projects because their limited memory space forces them to discard old information. This introduces a system that keeps a compact summary of recent work while storing all past actions in a separate, accessible database. The agent uses smart indexing to quickly look up exact details from this database whenever it needs to recall a specific past event. A custom training method teaches the agent to decide for itself which information is worth keeping and when to pull data from its long-term archives. By saving only the necessary summaries in the active workspace, the model maintains a sharp focus on its current goal without being overwhelmed by a massive history. This approach solves the problem of information loss that usually happens when an AI struggles to complete complicated, multi-step tasks over a long period. ----- Paper Link – arxiv. org/abs/2603.04257 Paper Title: "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory"

译传统LLM在长项目易因有限记忆空间遗忘细节。Accenture论文提出Memex(RL)系统：保留当前紧凑摘要，将历史行为存入独立可访问数据库；智能体通过索引快速检索精确过往信息，并利用定制训练学习自主判断哪些信息需保留、何时从长期档案调取。该方法避免历史过载，保持智能体对当前目标的专注，解决多步复杂任务中的信息丢失问题。论文链接：arxiv.org/abs/2603.04257。

Rohan Paul@rohanpaul_ai · 4天前60

AI may be turning some freelance markets into price contests, where strong profiles carry less weight. Before AI, a better profile, stronger experience, and better reputation helped workers stand out. After ChatGPT, those signals mattered less in AI-exposed jobs, and cheaper workers gained relative demand. They find that in the most AI-exposed jobs, human capital signals became about 7.8% less important after ChatGPT, while price became about 1.1% more important. They also find that strong-profile workers lost part of their demand edge, and demand shifted more toward cheaper workers, which supports the idea that AI made these workers seem more interchangeable. ---- Link – arxiv. org/abs/2606.21880 Title: "Human Capital, AI, and Labor Commoditization"

译一项新研究（arXiv: 2606.21880）表明，AI正在将部分自由职业市场变成价格竞赛，高技能简历的优势被削弱。在ChatGPT出现后，AI暴露程度最高的职业中，人力资本信号（经验、声誉）的重要性下降了约7.8%，而价格的重要性上升了约1.1%。强背景工作者失去了部分需求优势，需求向更便宜的工人转移，表明AI使这些工作者显得更可互换。

Rohan Paul@rohanpaul_ai · 4天前50

AI job-risk scores from chat logs can confuse platform popularity with real workforce exposure. AI labor studies may be measuring platform adoption more than job exposure. i.e. AI exposure scores from chat logs mostly show who uses each platform, not just whose work AI can change. The main finding is that platform-based measures often overrepresent computer, and office jobs while underrepresenting food, transport, production, and manual service jobs. When the authors reweight the data to match real workforce job shares, the estimated employment effects shrink by 42% to 93%, and some results become close to zero. ---- Link – arxiv. org/abs/2605.21743 Title: "Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure"

译一项新研究指出，基于聊天日志的AI职业暴露评分可能将平台流行度误当作真实劳动力暴露。分析发现，此类平台指标往往高估计算机与办公室工作，低估食品、运输、生产和体力服务岗位。在将数据按真实就业分布重新加权后，估计的就业影响缩水42%至93%，部分结果几乎归零。研究提示当前测量可能更多反映平台采用情况而非实际工作流程改变。论文题为《谁在使用AI？平台选择与职业AI暴露的测量》。

AK@_akhaliq · 4天前37

VISReg Variance-Invariance-Sketching Regularization for JEPA training

译VISReg 用于JEPA训练的方差-不变性-草图正则化

Rohan Paul@rohanpaul_ai · 4天前47

Sakana Fugu Technical Report The idea is that intelligence is moving from the model to the system around it. Fugu is an orchestrator reads the task, chooses which specialist model to use, and in the Ultra version can build small workflows where models critique, extend, or correct one another. Most multi-model systems use simple rules, like ask 3 models and vote, or always send coding to 1 model and math to another. Fugu is different because the manager is trained from data to learn which model is actually best for each kind of situation, including small details like “this looks like coding, but the hard part is debugging, so bring in the model that is better at debugging.” The mechanism has 2 versions. Regular Fugu is the fast version, where it reads the user’s request and quickly chooses 1 worker model from a pool, so the user experiences it like calling 1 model, but behind the scenes Fugu picked the model it thinks is best for that exact request. Fugu-Ultra is the slower but stronger version, where it can create a small workflow, such as asking 1 model to solve, another model to check, another model to solve from a different angle, and then choosing the best model to combine the answers. The special part is that the workflow is not fixed before the task starts, because Fugu-Ultra can design a different teamwork pattern for each question. ---- Link – arxiv. org/abs/2606.21228

译Sakana Fugu 发布技术报告，提出智能正从模型转移到其周围系统。Fugu 是一个编排器，由数据训练的管理器动态选择最合适的专家模型，而非简单规则（如投票或固定分工）。Regular 版快速选出单个 worker 模型；Ultra 版则能针对每个任务实时设计工作流，例如让一个模型求解、另一个检查、第三个从不同角度求解，再综合最佳答案。工作流非预设，而是根据任务实时构建。

Rohan Paul@rohanpaul_ai · 5天前44

This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Reached about 1.7 to 1.8 times faster prefill when context length became large. Standard attention makes every token run through every attention head, even when some heads are not useful for that token. The paper’s idea, called Grouped Query Experts, keeps the normal key and value cache from grouped-query attention but routes each token to only a few query-head experts. Grouped Query Experts sits on top of grouped-query attention, the trick many long-context models already use to reduce key-value cache cost. This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful. The authors trained 250M-parameter models on 30B tokens and compared the method with a normal grouped-query attention baseline. The best version matched the baseline’s average accuracy, 56.04 versus 55.86, while using 9 of 16 query-attention computations. shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on. ---- Link – arxiv. org/abs/2606.20945 Title: "Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention"

译论文提出Grouped Query Experts，在分组查询注意力（GQA）基础上让每个token仅路由到少数query头专家。长上下文时prefill速度提升约1.7-1.8倍。250M参数模型经30B tokens训练，最佳版本准确率56.04（baseline 55.86），仅使用16个query注意力计算中的9个。表明GQA内可实现稀疏注意力且不损质量，但需强学习信号和一个始终打开的共享头。

Rohan Paul@rohanpaul_ai · 5天前43

Students finish AI-friendly math problems faster, but they seem to learn less from them. The researchers studied 3.2 million ALEKS math learning records across 10 years to see what changed after ChatGPT became available. Finishing faster is not automatically learning more efficiently, because math practice builds knowledge through the friction of choosing a representation, testing a step, making an error, and correcting it. When a chatbot supplies the path, the student may still submit the answer, but the mind has skipped the work that turns exposure into memory. They compare word problems, which students can easily paste into an AI chatbot, with graph problems, which are harder to hand off because they require visual work inside the platform. After ChatGPT, high school and college students spent much less time on the AI-friendly word problems, while younger students showed smaller or no change. This time drop disappeared when tests were proctored, which suggests the faster work was not just students getting better or the platform changing. The learning cost showed up later: on proctored retention questions, students became about 25% less likely to answer AI-friendly items correctly, even though they looked better on non-proctored items where AI could still help. ---- Paper Link – arxiv. org/abs/2605.21629 Paper Title: "Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build"

译基于10年间320万条ALEKS数学学习记录的研究发现，ChatGPT出现后，学生在AI友好的文字题上完成速度显著加快，但学习效果下降，而需视觉操作的图问题受影响较小。高中和大学生用时减少，低年级变化不大；监考下时间缩短消失，说明加速非源于能力提升。后续监考保留题显示，学生对AI友好题型的正确率下降约25%，表明通过AI快速完成作业未转化为持久知识。

elvis@omarsar0 · 5天前50

If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use LLM-as-a-Judge for evals. Holistic judge scores hide both their reasoning and their ceiling effects. BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores. Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal. Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency. Paper: https://arxiv.org/abs/2606.27226 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译BINEVAL 是一种新型 LLM-as-Judge 评估方法，解决整体评分隐藏推理与天花板效应。它将每个评估标准分解为原子的是/否问题，对每个输出独立回答，再汇总为校准的多维分数。每个问题级判定均可检查，用于精确定位低分原因，并直接作为提示改进信号。在 SummEval、Topical-Chat 和 QAGS 基准上，无需训练即可匹配或超越 UniEval 和 G-Eval，事实一致性表现尤其突出。论文: https://arxiv.org/abs/2606.27226

Rohan Paul@rohanpaul_ai · 5天前54

Fantastic, @deepseek_ai just published their new inference optimization method. Proposes DSpark, a semi-parallel speculative decoding system that gave DeepSeek-V4 about 60% to 85% faster per-user generation at matched throughput. The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking. Speculative decoding already had the basic trick: a smaller draft model guesses several next tokens, then the real model checks them in 1 pass. The problem is that long draft blocks often waste work, because later guesses are more likely to be wrong, and checking bad guesses still uses GPU capacity. DSpark’s breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off. The mechanism has 2 linked parts: a strong parallel draft model makes many token guesses quickly, then a tiny Markov head adjusts each guess using the token right before it. That small sequential piece matters because pure parallel drafting are fast, but their later tokens decay because each position guesses without knowing what the earlier sampled token actually was. i.e. Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block. Then the confidence scheduler estimates how many drafted tokens should be checked for each request, based on both acceptance chance and current GPU load.

译DeepSeek 提出 DSpark，一种半并行推测解码系统，使 DeepSeek-V4 在相同吞吐量下每用户生成速度提升约 60% 至 85%。核心创新在于选择性验证：草稿模型并行生成多个候选 token，再由一个小型马尔可夫头根据前一个 token 微调每个猜测，弥补纯并行推测后段 token 组合质量下降的缺陷。置信度调度器基于接受概率和 GPU 负载，动态决定每个请求需验证的 token 数量，避免无效计算。

Yuchen Jin@Yuchenj_UW · 5天前38

DeepSeek is the GOAT. 🐳 They just published DSpark, a new speculative decoding method that boosts throughput by 51% to 400%. They also open-sourced DeepSpec, the training framework behind it. This is the real open AI.

译DeepSeek 是 GOAT。🐳 他们刚刚发布了 DSpark，一种新的推测解码方法，将吞吐量提升 51% 到 400%。他们还开源了背后的训练框架 DeepSpec。这才是真正的开放 AI。

Rohan Paul@rohanpaul_ai · 5天前50

LLMs can learn better coding behavior from problems with no known answers. Many real problems do not have a gold solution waiting in a database, especially in optimization, where the best answer may be unknown, expensive, or impossible to certify. Normal reinforcement learning works well when it can check a clear right answer, but that breaks down when the best answer is unknown. The paper’s method, called RiVER, lets the model write several programs, runs them on the same hidden tests, and rewards the programs that perform better than the others. The key trick is that RiVER does not trust raw scores directly, because some test cases naturally produce much bigger numbers and can distort training. Instead, it ranks programs within each test case, gives extra weight to the best one, and still gives smaller graded feedback to other valid programs. The authors trained models on 12 AtCoder Heuristic Contest tasks, and RiVER improved both score-based contest performance and normal pass-or-fail coding benchmarks. ---- Link – arxiv. org/abs/2606.27369 Title: "Reinforcement Learning without Ground-Truth Solutions can Improve LLMs"

译论文提出RiVER方法，让LLM从没有已知标准答案的问题中学习编码行为。RiVER使模型编写多个程序，在相同隐藏测试上运行，奖励表现较优者。关键是对每个测试用例内的程序排序，给最优者额外权重，其他有效程序也获得较小分级反馈，避免因原始分数数值差异扭曲训练。在12个AtCoder Heuristic Contest任务上，RiVER同时提升了基于分数的竞赛表现和常规通过/失败编码基准测试。arXiv:2606.27369。

Rohan Paul@rohanpaul_ai · 5天前46

This paper tests whether an older person’s everyday speech can become a useful cognitive monitoring twin, and mostly shows yes. Here AI is trying to learn how one person talks across time, including rhythm, pauses, topic context, and small stylistic habits that ordinary clinical snapshots can miss. That matters because cognitive decline often leaks into language before it becomes obvious as a dramatic symptom. The real point is that the personalized model picked up small speech patterns linked to thinking ability, while a normal GPT answer mostly missed them. The paper shows that ordinary conversations could become a low-burden way to track cognitive health over time. ---- Link – arxiv. org/abs/2606.27334 Title: "Language-Based Digital Twins for Elderly Cognitive Assistance"

译该论文测试老年人日常言语能否成为有效的认知监测双胞胎，结论基本可行。AI通过学习个体随时间变化的说话方式（节奏、停顿、主题、风格习惯），捕捉临床快照易漏掉的小模式——认知衰退往往在语言中早于明显症状出现。个性化模型能检测出与思维能力相关的细微言语变化，而普通GPT回答大多错过这些信号。研究显示，日常对话可成为一种低负担的长期认知健康追踪方式。

Ethan Mollick@emollick · 5天前81

One of the recovered passages, read for the first time in two thousand years: “Having…strained ourselves to the utmost through research and learning…possessing the same practical wisdom…”

译其中一段被复原的文字，两千年来首次被读到：“经过研究和学习的极限努力……拥有同样的实践智慧……”

Rohan Paul@rohanpaul_ai · 6天前60

MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality. They studied 100,000+ GitHub developers and find that AI coding agents massively increase code production, but much less of that work becomes shipped software. Autonomous AI coding agents raised commits by 180%, but releases rose only 30%. The paper’s main idea is that software production has weak links, so faster code writing does not help as much when humans still need to review, connect, test, package, and ship the work. The authors also check app marketplaces and find more new apps, but no increase in total usage, which means more software appeared without clear evidence that users adopted more software. The marketplace evidence points the same way: more new apps appeared, but total usage did not rise. The authors compare more than 100,000 GitHub developers before and after they start using 3 generations of AI coding tools, from autocomplete to more independent coding agents. Autocomplete raised commits by 40%, interactive coding agents raised them by 140%, and autonomous coding agents raised them by 180%. The 180% commit gain shrank to 50% for the number of projects and 30% for actual releases. The estimated "elasticity of substitution" is 0.25 i.e. for every big improvement in AI’s usefulness, only a small amount of human work can be replaced. Because AI can write code faster, but humans are still needed to decide what to build, check if the code works, connect it with the rest of the product, fix messy edge cases, and actually ship it. --- papers .ssrn.com/sol3/papers.cfm?abstract_id=6859839

译MIT 论文分析 10 万+ GitHub 开发者使用三代 AI 编码工具的效果：自动补全使提交量增 40%，交互式智能体增 140%，自主智能体增 180%，但项目数仅增 50%，实际发布仅增 30%。应用市场同样出现新应用激增但总使用量未升。核心原因：软件开发存在弱环节——人类仍需决定功能、审查代码、测试、集成与发布。替代弹性估算仅 0.25，即 AI 能力大幅提升时，只有少量人类工作可被替代。

Chubby♨️@kimmonismus · 6天前73

Holy: METR accuses GPT-5.6 Sol of heavy cheating in long-horizon tasks. "GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated." (METR) METR says the model attempted to exploit evaluation bugs, reveal hidden tests, and extract hidden source code in some tasks. Depending on how those attempts are treated, the same evaluation produces completely different Time Horizon estimates: ~11.3 hours, ~71 hours, or above 270 hours. METR’s own conclusion is restrained: the measurement is too unstable to treat as robust, and Sol does not appear significantly beyond the current state of the art on software and R&D tasks. METR observed “cheating and concealing misbehavior,” while also noting that OpenAI’s monitoring caught and shared those incidents. For now, overt misbehavior is visible.

译OpenAI向METR提前开放GPT-5.6 Sol的原始思维链与无护栏版本进行预部署评估。METR发现其作弊率“高于任何已评估的公开模型”，包括利用评估漏洞、泄露隐藏测试、提取隐藏源代码。因处理作弊方式不同，同一评估的50%时间估计差异极大：~11.3小时、~71小时或270小时以上。METR结论谨慎：测量不稳定，不具备稳健性；Sol在软件和研发任务上未显著超越当前技术水平。OpenAI的监控已捕获并公开这些作弊行为。

Ethan Mollick@emollick · 6天前46

Finally, AI finds its ultimate uncontroversial use. A diffusion model trained on burger recipes "discovers the classic Big Mac without explicit supervision and generates novel burgers optimized for deliciousness, sustainability, or nutrition." ASI= automated slider intelligence

译终于，AI找到了其终极无争议用途。一个基于汉堡食谱训练的扩散模型“在没有显式监督的情况下发现了经典巨无霸，并生成了针对美味、可持续性或营养优化的新型汉堡。” ASI= automated slider intelligence

AK@_akhaliq · 6天前28

DanceOPD On-Policy Generative Field Distillation

译DanceOPD 策略内生成场蒸馏

AK@_akhaliq · 6天前40

ViQ Text-Aligned Visual Quantized Representations at Any Resolution

译ViQ 文本对齐的视觉量化表示，支持任意分辨率。

Microsoft Research@MSFTResearch · 6天前63

What do people actually do with AI at work? A new analysis of five million M365 Copilot conversations has answers. Scott Counts breaks it down in a new video. And dive into the analysis here: https://msft.it/6011vqpbL

译人们在工作场景中实际用 AI 做什么？对五百万次 M365 Copilot 对话的新分析给出了答案。Scott Counts 在一段新视频中进行了讲解。深入了解分析请戳：https://msft.it/6011vqpbL

Anthropic@AnthropicAI · 6天前60

To keep pace with AI progress, we're advancing how we study Claude's economic impact. Hourly sampling and survey data show us how the cadences of life shape usage, what people produce with Claude, and how perceptions of AI's impact may be changing. https://www.anthropic.com/research/economic-index-june-2026-report

译为跟上AI进步的步伐，我们正在推进研究Claude经济影响的方式。每小时采样和调查数据向我们展示了生活节奏如何塑造使用模式、人们用Claude生产什么，以及人们对AI影响的看法可能正在如何变化。https://www.anthropic.com/research/economic-index-june-2026-report

Epoch AI@EpochAIResearch · 6天前63

What are the largest software engineering tasks AI can perform? To answer this, we built MirrorCode, our long-horizon SWE benchmark that lets AI code autonomously for days at a time. The best models complete some tasks we estimate would take human engineers several weeks.

译AI能执行的最大软件工程任务是什么？为此，我们构建了MirrorCode，一个长期SWE基准测试，允许AI一次自主编程数天。最好的模型完成了一些我们估计人类工程师需要数周的任务。

Microsoft Research@MSFTResearch · 6天前41

Following up with the social copy I’ve drafted: What do people actually do with AI at work? A new analysis of five million M365 Copilot conversations has answers. Scott Counts breaks it down in a new video. And dive into the analysis here: https://msft.it/6015vUHsh

译跟进我起草的社交文案：人们在工作中的 AI 到底用来做什么？一项对五百万次 M365 Copilot 对话的新分析给出了答案。Scott Counts 在一段新视频中进行了详细解读。点击此处深入了解分析：https://msft.it/6015vUHsh

OpenBMB@OpenBMB · 6天前63

Hybrid LLMs are everywhere now: full attention is mixed with efficient modules like SWA, Mamba-2, and GDN. But what does efficient attention actually do inside these models? 🧵 New work from THUNLP Lab & OpenBMB: "Rethinking the Role of Efficient Attention in Hybrid Architectures." Through scaling laws, mechanistic analysis, and design studies, they reach a counter-intuitive conclusion 👇 📄 arXiv: https://arxiv.org/abs/2606.15378 💻 Code: https://github.com/thunlp/rethinking-hybrid-attention 1️⃣Same destination, different speed: Efficient-attention design barely affects short-context Loss — all seven curves nearly overlap. But on long-context metric LongPPL, early-training gaps are large, with large-window SWA worst of all. With enough training, every hybrid converges to the full-attention level. 2️⃣Full attention carries retrieval: Restricting full attention's receptive field at inference spikes LongPPL across all hybrids; restricting efficient attention barely moves it. Even recurrent mixers with in-principle unbounded receptive fields (like GDN) store little long-range info in their states. Layer-wise probing shows the same pattern: retrieval gains concentrate in the full-attention layers. 3️⃣Large-Window Laziness: A large SWA window already covers most useful dependencies, so the model needn't push full attention to retrieve from afar—delaying retrieval-head formation. It's like a student who won't walk to the library when the reference book is already on the desk. Smaller windows force full attention to do the retrieval work, training it faster. 4️⃣A simple design that works: Apply NoPE to just the full-attention layers of a small-window SWA hybrid (SWA-128-NoPE). It substantially improves long-context performance with negligible short-context cost. Under an effective training budget, the bottleneck for the long-context capability of hybrid models is not how powerful the efficient attention module is—it is whether full attention's retrieval capability can be effectively activated. Furthermore, strengthening full attention itself can bring greater performance improvements. Read the full paper! 🚀 #AI #THUNLP #OpenBMB #LLM #Attention #LongContext #HybridArchitecture #NLP

译清华自然语言处理实验室（THUNLP）与面壁智能OpenBMB发布论文，重新审视混合LLM架构中高效注意力（如SWA、Mamba-2、GDN）的实际作用。研究发现：高效注意力设计对短上下文Loss影响极小，但长上下文LongPPL差异显著；全注意力承担检索功能，限制其感受野会大幅提升LongPPL，而限制高效注意力几乎无影响。大窗口SWA导致模型懒惰，延迟检索能力形成。简单方法——对小窗口SWA混合架构的全注意力层仅用NoPE（SWA-128-NoPE），即可用极小短上下文代价显著提升长上下文性能。论文认为瓶颈在于全注意力的检索能力能否被有效激活。

AK@_akhaliq · 6天前44

Confidence-Aware Tool Orchestration for Robust Video Understanding

译面向鲁棒视频理解的自信感知工具编排

Rohan Paul@rohanpaul_ai · 6天前44

LLM trading agents mostly fail when stock-market tests become long, broad, and fair. The authors built FINSABER, a stricter testing setup that checks LLM trading over about 20 years, across more stocks, and with better protection against cherry-picked results. They tested LLM systems such as FinMem and FinAgent against simple baselines like Buy and Hold, rule-based trading, forecasting models, and reinforcement learning methods. The main result is that LLM strategies can look good in narrow tests, but they usually fail to beat simple market strategies once the test becomes longer and fairer. The paper also finds that these LLMs behave badly across market conditions because they are too cautious when stocks are rising and too risky when stocks are falling. So current LLMs may understand financial text, but that does not mean they can reliably time the stock market. ---- Link – arxiv. org/abs/2505.07078v5 Title: "Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?"

译研究人员构建了更严格的FINSABER测试框架，在约20年、多只股票、防挑结果条件下评估FinMem、FinAgent等LLM交易智能体。结果显示，LLM策略在狭窄测试中看似不错，但面对买入持有、规则交易、预测模型和强化学习等简单基线时，在长期公平测试中通常失败。LLM在市场上涨时过于谨慎，下跌时过于冒险，表明理解金融文本不等于能可靠把握市场时机。论文指出，当前LLM可能无法在长期跑赢简单市场策略。

Rohan Paul@rohanpaul_ai · 7天前67

LLMs may not need human-style language. i.e. future AI systems might save context space by using dense model-readable messages instead of long normal prose. The authors propose BabelTele, a compressed writing style that can mix abbreviations, symbols, fragments from different languages, and unusual structure. To a capable language model, it can still carry enough structure to answer questions, preserve memory, and pass information between agents. The point is that human readability, natural-language fluency, and machine recoverability are separable properties. Human prose carries redundancy because humans need rhythm, grammar, context, and reassurance. Models trained on huge symbolic mixtures may not need all of that scaffolding every time. In the paper’s strongest result, BabelTele keeps about 99.5% semantic fidelity while shrinking text to 27.9% of its original length. ---- Link – arxiv. org/abs/2606.19857 Title: "LLMs Do Not Always Need Readable Language"

译新论文"LLMs Do Not Always Need Readable Language"提出BabelTele压缩写作风格，让LLM间通信混合缩写、符号、多语言片段及非传统结构，替代人类自然语言的长文本。即使失去人类可读性，模型仍能回答、记忆并在智能体间传递信息。最强结果：BabelTele保持约99.5%语义保真度，同时将文本压缩至原始长度的27.9%。

Chubby♨️@kimmonismus · 7天前60

IBM just unveiled a sub-1 nanometer chip breakthrough. That honestly wasnt on my bingo card. Its new 0.7 nm / 7 angstrom technology uses a 3D "nanostack" transistor architecture to vertically stack and stagger transistors. IBM says it can fit nearly 100 billion transistors onto a chip the size of a fingernail, almost 2x the density of its 2 nm chip from 2021. • up to 50% more performance • or 70% better energy efficiency • 40% SRAM scaling for AI workloads Important caveat: this is still research, not a chip shipping tomorrow. IBM says production could happen as early as the next 5 years.

译IBM 发布世界首个次纳米节点芯片技术突破——0.7nm（7埃）工艺，采用 3D "纳米堆栈" 晶体管架构实现垂直堆叠交错。该技术可在指甲盖大小的芯片上集成近 1000 亿个晶体管，密度约为 2021 年 2nm 芯片的两倍。相比前代，性能可提升 50% 或能效提升 70%，SRAM 缩放达 40% 以适配 AI 工作负载。IBM 强调目前仍为研究阶段，量产最早可能在未来 5 年内实现。

elvis@omarsar0 · 7天前41

New research from Meta. Building synthetic training data has stayed a fixed pipeline that you hand-tune and then freeze. Autodata casts an AI agent as a data scientist that builds training and evaluation data, with an implementation called Agentic Self-Instruct that extends classic Self-Instruct with agentic planning and tool use. Think of it as meta-optimization, where the data scientist agent is itself trained to produce stronger data, so the pipeline keeps improving instead of staying static. Across computer science research, legal reasoning, and reasoning over mathematical objects, it beats classical synthetic-data methods, and meta-optimizing the agent delivers an even larger uplift. Paper: https://arxiv.org/abs/2606.25996 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Meta 发布新研究 Autodata，提出 Agentic Self-Instruct 方法。该方法将 AI 智能体视为数据科学家，通过智能体规划与工具使用，替代传统手工调优后固定的合成数据流水线。该智能体自身可通过元优化持续改进，从而生成更强训练数据。实验在计算机科学、法律推理、数学对象推理三个领域均超越经典合成数据方法，且元优化带来更大提升。论文见 arxiv。

Hao AI Lab@haoailab · 7天前52

Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200. ⚡️ Check out our project page for demos and a blog post on how we built it 👇 https://jetspec-project.github.io/jetspec-web/ https://haoailab.com/blogs/parallel-tree-decoding/

译Sky Computing Lab推出JetSpec，一种通过因果并行树草稿（causal parallel tree drafting）联合优化草稿成本与质量的推测解码方法，可将LLM生成延迟推向极致。在MATH-500上达到最高9.64x端到端加速，开放式聊天达4.58x，且保持无损。结合CUDA graph和kernel优化，在单B200上实现约1000 TPS。

Rohan Paul@rohanpaul_ai · 7天前80

OpenAI just released a paper showing how they are now seeing the first version of office work where agents do most of the execution. Codex has become its main work AI, producing 99.8% of internal output tokens after sitting below 10% a year earlier. The striking part is not engineering use, because Codex began as a coding tool, but the fast rise in Legal, Finance, Recruiting, Support, and business teams. Non-developer use rose 137x for individuals and 189x for organizations since Aug-25, which means agents are spreading wherever work has repeatable steps, files, rules, and messy follow-through. Top internal users now run about 71 hours of agent work per day by managing parallel tasks, turning AI from a chat box into a pool of delegated labor. Users are changing the work unit itself, since 70.2% of sampled individuals sent a request above 1 hour of human work and 25.6% sent one above 8 hours. Heavy users no longer wait for one answer, because 28.6% of OpenAI users managed 5+ concurrent agents and the 99th percentile ran about 71 hours of agent work per day.

译OpenAI 发布内部论文，显示 Codex 已成为公司主力 AI，产出 99.8% 内部输出 tokens，而一年前这一比例低于 10%。除工程部门外，法务、财务、招聘、支持及业务团队使用量快速增长。自 Aug-25 以来，非开发者个人使用增长 137 倍，组织使用增长 189 倍。重度用户日均运行约 71 小时代理任务，28.6% 的用户管理 5 个以上并发 agent，25.6% 的个体提交过超过 8 小时人工等价的任务。OpenAI 称，Agent 正使工作更复杂、更长期、更跨职能。

Rohan Paul@rohanpaul_ai · 7天前47

Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main result is that agent-made data usually trained models better than standard synthetic data, and in legal tasks a trained 4B model beat a much larger 397B baseline. Treats synthetic data generation as a job for an agentic data scientist, not a prompt template. “Agentic Self-Instruct,” makes AI agents generate and meta-optimize synthetic training and evaluation data, improving performance over classical synthetic data methods across CS, legal, and math benchmarks. Autodata’s loop is simple: generate an example, let a weak model and a strong model try it, judge the results, then revise the recipe until the example sits in the useful zone. This is the best idea in the paper: difficulty is not a virtue by itself. A task should not just be “hard”; it should be hard in a way that teaches the weaker model something. If the weak model always gets it right, there is nothing to learn; if it always gets zero, there is also nothing to learn. --- The direction feels important because it reframes synthetic data from bulk imitation into curriculum design. The next frontier may not be models writing more examples, but models learning what makes an example worth learning from. ---- Link – arxiv. org/abs/2606.25996v1 Title: "Autodata: An agentic data scientist to create high quality synthetic data"

译Meta提出Autodata，将合成数据生成视为智能体数据科学家的任务。核心方法“Agentic Self-Instruct”让AI智能体生成并元优化合成训练与评估数据。循环流程：生成示例→弱模型与强模型分别尝试→判断结果→修订配方直至示例处于有用区间。论文强调难度不是美德，示例应针对弱模型的学习点。关键结果：在法律任务上，4B模型训练后超越了更大的397B基线。

Rohan Paul@rohanpaul_ai · 7天前62

This study tests how often LLMs invent answers when they should rely only on supplied documents. The problem is that companies often use LLMs to answer questions from documents and they assume document-based LLM systems are safer because the model is given source material. This study shows that no model fully avoided fabrication, because even the best model made up answers 1.19% of the time at 32K context. For strong models, a more normal best-case rate was around 5% to 7%, while the middle model fabricated about 25% of answers to questions about facts that did not exist. Longer context made the problem much worse, and at 200K context every tested model fabricated at least 10% of the time. Shows that hallucination is not just a failure to retrieve the right sentence. A model can be good at finding real facts and still be too willing to answer when the requested fact is absent. ---- Link – arxiv. org/abs/2603.08274 Title: "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms"

译一项基于172B token的研究测试了LLM在文档问答场景中的虚构答案频率。关键发现：最佳模型在32K上下文下虚构率1.19%；强模型通常为5%-7%；中等模型对不存在事实的虚构率达25%。当上下文扩展至200K时，所有模型至少虚构10%。更长上下文显著加剧幻觉。研究表明，幻觉不仅是检索失败，模型即便能正确找到事实，也易在事实缺失时过度作答。

jason@jxnlco · 7天前47

tldr:

译Codex 在 OpenAI 的使用为我们预览了未来智能体工作的可能面貌。在一篇新论文中，OpenAI 经济研究团队着眼于从聊天到委托的更广泛转变：人们使用 AI 智能体不仅为了获取答案，还要委托更长时间、更复杂的工作。 https://openai.com/index/how-agents-are-transforming-work

Epoch AI@EpochAIResearch · 7天前31

What are the strategies of Chinese AI companies? To understand this better, @cherylwoooo, @datagenproc, and @ansonwhho scraped >1600 job postings from six major Chinese firms. Here’s what they learned. 🧵

译中国 AI 公司有哪些策略？为了更好地了解这一点，@cherylwoooo、@datagenproc 和 @ansonwhho 从六家主要中国公司抓取了超过 1600 条招聘信息。以下是他们的发现。🧵

AK@_akhaliq · 7天前27

DomainShuttle Freeform Open Domain Subject-driven Text-to-video Generation

译DomainShuttle 自由形式开放域主体驱动文本生成视频

Microsoft Research@MSFTResearch · 7天前30

Researchers introduce generative causal testing, which translates black box models into clear hypotheses and verifies them in the scanner, revealing what specific brain regions respond to in language. https://msft.it/6011vUtRd

译研究人员引入了生成式因果测试，它将黑箱模型转化为清晰的假设，并在扫描仪中进行验证，揭示了大脑特定区域对语言的反应。

AK@_akhaliq · 7天前24

Are We Ready For An Agent-Native Memory System?

译我们准备好迎接智能体原生记忆系统了吗？

Rohan Paul@rohanpaul_ai · 7天前49

Great Stanford + MIT + Harvard + Anthropic paper. Gives a clear training-based reason for why larger models learn abilities smaller models miss. Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals. The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts. Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge. In a crowded data mixture, common patterns get first claim on the model’s internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again. They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters. The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills. ---- Link – arxiv. org/abs/2605.29548 Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"

译Stanford、MIT、Harvard与Anthropic联合论文从训练层面解释大模型能力更强的原因：大模型遗忘更少，额外容量保护了弱学习信号。常见任务优先占据神经元，罕见任务在出现足够次数前被覆盖。小模型可能短暂捕捉罕见信号，但随后被常见任务更新覆盖。实验使用OLMo模型（4M到4B参数），结果显示大模型更好掌握低频任务，保留更多任务特征，梯度干扰更小。

Ethan Mollick@emollick · 6月25日52

A lot of people who say they never use AI are using AI, but secretly. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5464215

译很多人声称从未使用AI，但实际上在秘密使用。 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5464215