DeepAdapt 发布 ACI(自适应持续智能)运行时学习层,通过将重复工作负载从 GPU 转移至标准 CPU,实现运营成本降低 82%、推理速度提升 33 倍(中位延迟 159 ms)。ACI 在推理时实时学习模型决策、人工修正与反馈,已知请求直接本地 CPU 处理,仅不确定或复杂请求回传底层 LLM。基准测试:token 消耗降 90%、生产级成本降 5.7 倍、准确率 96%(对比无 ACI 的 85%)、每千次决策能耗降 85.7%、规则违规减 4.8 倍。无需微调或重训,即插即用,GPU 依赖随系统成熟递减。该架构先用于云端 LLM 智能体,未来对个人设备同样重要。
DeepAdapt has launched a runtime intelligence layer that cuts AI operating costs by up to 82% and 33X faster inference by shifting repetitive workloads from GPUs to standard CPUs.
They are calling it Adaptive Continual Intelligence, ACI.
ACI is a runtime learning layer where analytical learning, supervised learning, and reinforcement learning work together while the system is already in production.
ACI is not caching, memory, a knowledge graph, routing, or a simple optimization trick.
This technique learns from model decisions, corrections, labels, outcomes, and experience, then serves known decisions locally on CPU. Only new, uncertain, or complex requests are routed back to the underlying model.
ACI can also be pre-trained for specific domains, making continual learning faster and cheaper.
DeepAdapt is rolling out first for cloud-based LLM agents, but the same architecture becomes even more important on personal devices, where compute, battery, latency, and local inference reliability are much tighter constraints.
In their benchmarks, ACI has shown up to 90% lower token consumption, 5.7X lower production-scale cost, 33X faster inference with 159 ms median latency, 96% accuracy vs. 85% without ACI, 85.7% lower energy per 1,000 decisions, and 4.8× fewer rule violations.
DeepAdapt intercepts user requests, serving known answers instantly from a standard CPU to completely bypass the expensive GPU.
New questions go to the GPU, but the system logs the output and any human corrections to learn for the next time.
This keeps the underlying language model entirely frozen while the outer software layer handles all real-time learning and auditing.
ACI requires zero training. No fine-tuning. No retraining pipelines. You wire it into your existing stack and it starts learning from real use on the very first request. Every improvement happens at runtime.