Rohan Paul@rohanpaul_ai

2026-05-31 04:31·33天前

AI 摘要

在MacBook Pro M5 Max 64GB上的本地测试中，Liquid的LFM2.5-8B-A1B模型在需要完成7个工具调用的旅行规划任务上，显著优于OpenAI的gpt-oss-20b。LFM2.5-8B-A1B仅使用4.8GB内存，以266tok/s的速度成功完成了全部7/7工具调用，耗时6.9秒。相比之下，gpt-oss-20b消耗了11GB内存，仅完成3/7工具调用，速度为146tok/s，耗时15秒。这表明，一个活跃参数规模更小（1B）的MoE模型，通过更精准的训练，在工具调用这一智能体任务上可以战胜活跃参数规模约其2.5倍的更大模型。

atomic【.】chat （a desktop app that runs LLMs locally） ran a very revealing comparison for local AI agents， on a MacBook Pro M5 Max， 64GB.

Liquid's much smaller LFM2.5-8B-A1B beat gpt-oss-20b by finishing every required tool call， cutting runtime by more than half， and using 4.8GB RAM instead of 11GB.

The task was not normal chat， because the model had to plan a trip by calling outside tools for 3 weather checks， 2 currency conversions， 1 email， and 1 reminder.

The striking part is that LFM2.5-8B-A1B is much smaller in active compute， yet it hit every required call at 266tok/s， while gpt-oss-20b used 11GB RAM， made only 3/7 calls， and ran at 146tok/s.

Now， tool calling is a control problem before it is a language problem.

The model has to preserve a checklist across context， decide when language should stop and action should begin， and resist the temptation to answer as if partial completion were enough.

A smaller mixture-of-experts model with only a fraction of its parameters active can win if its training shaped those control habits more sharply than a larger model's general fluency did.

atomic.chatLiquid's LFM2.5-8B-A1B smashed OpenAI's gpt-oss-20b on tool calling We ran both locally on a MacBook Pro M5 Max, 64GB, and gave each the same trip-planning requ...

MCP/工具端侧评测/基准

Rohan Paul@rohanpaul_ai · X

60导出 Markdown

2026-05-31 04:31·33天前

在 X 看原推· x.com

AI 摘要

atomic【.】chat （a desktop app that runs LLMs locally） ran a very revealing comparison for local AI agents， on a MacBook Pro M5 Max， 64GB.

Liquid's much smaller LFM2.5-8B-A1B beat gpt-oss-20b by finishing every required tool call， cutting runtime by more than half， and using 4.8GB RAM instead of 11GB.

The task was not normal chat， because the model had to plan a trip by calling outside tools for 3 weather checks， 2 currency conversions， 1 email， and 1 reminder.