# llama.cpp 服务器新增多模型管理功能

- 来源：Hugging Face：Blog（RSS）
- 发布时间：2025-12-11 23:47
- AIHOT 分数：76
- AIHOT 标记：精选
- AIHOT 链接：https://aihot.virxact.com/items/cmoegbhak00a4slxx4drrrsd4
- 原文链接：https://huggingface.co/blog/ggml-org/model-management-in-llamacpp

## 精选理由

本地跑模型终于能像 Ollama 一样热切换，开发调试效率大幅提升

## AI 摘要

llama.cpp 服务器新增了类似 Ollama 的多模型管理功能。该功能采用多进程架构，每个模型独立运行，确保单个模型崩溃不影响其他服务。系统支持自动发现本地 GGUF 模型文件、按需加载，并默认采用 LRU 机制管理最多同时加载4个模型。用户可通过请求中的模型字段路由到特定模型，并可使用 API 进行加载、卸载和列表查看。所有加载的模型可继承路由器的统一设置，也支持通过预设文件为每个模型单独配置参数。内置 Web UI 同样支持模型切换。

## 正文

New in llama.cpp: Model Management

Team Article Published December 11, 2025

Xuan-Son Nguyen

ngxson

ggml-org

Victor Mustar

victor

ggml-org

llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.

Quick Start

Start the server in router mode by not specifying a model:

llama-server

This auto-discovers models from your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you've previously downloaded models via llama-server -hf user/model, they'll be available automatically.

You can also point to a local directory of GGUF files:

llama-server --models-dir ./my-models

Features

Auto-discovery: Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files

On-demand loading: Models load automatically when first requested

LRU eviction: When you hit --models-max (default: 4), the least-recently-used model unloads

Request routing: The model field in your request determines which model handles it

Examples

Chat with a specific model

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
"messages": [{"role": "user", "content": "Hello!"}]
}'

On the first request, the server automatically loads the model into memory (loading time depends on model size). Subsequent requests to the same model are instant since it's already loaded.

List available models

curl http://localhost:8080/models

Returns all discovered models with their status (loaded, loading, or unloaded).

Manually load a model

curl -X POST http://localhost:8080/models/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model.gguf"}'

Unload a model to free VRAM

curl -X POST http://localhost:8080/models/unload \
-H "Content-Type: application/json" \
-d '{"model": "my-model.gguf"}'

Key Options

Flag Description

--models-dir PATH Directory containing your GGUF files

--models-max N Max models loaded simultaneously (default: 4)

--no-models-autoload Disable auto-loading; require explicit /models/load calls

All model instances inherit settings from the router:

llama-server --models-dir ./models -c 8192 -ngl 99

All loaded models will use 8192 context and full GPU offload. You can also define per-model settings using presets:

llama-server --models-preset config.ini

[my-model]
model = /path/to/model.gguf
ctx-size = 65536
temp = 0.7

Also available in the Web UI

The built-in web UI also supports model switching. Just select a model from the dropdown and it loads automatically.

Join the Conversation

We hope this feature makes it easier to A/B test different model versions, run multi-tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub.

Community

bukit

Dec 11, 2025

Mmproj support?

·

sbeltz

Dec 12, 2025

Supported via presets.ini, where you can specify the mmproj (and other long and short arguments) per model.

sbeltz

Dec 12, 2025

Awesome new feature! Can model selection be done on something other than requested model name? Like maybe specify the ranking in presets.ini, and then the highest ranked model that can satisfy the request will be the default. So maybe one model is best for short context, another (or the same with other settings) for when the context gets too long, and another when image input is required.

xbruce22

Dec 12, 2025

This is good addition, Thank you.

etemiz

Dec 12, 2025

•

edited Dec 12, 2025

what is the best way to get <think> </think> and the tokens in between? openAI library is removing them.. i want to run llama-server in console and talk to it using a python library that does not remove the thinking tokens.

i checked the llama-cpp-python but it does not have that.

·

xbruce22

Dec 16, 2025

llama-server by default in most implementation keeps the reasoning content in reasoning_content variable in response attribute. You can get it from there. Otherwise use reasoning-format flag and pass DeepSeek value to get pure tokens

razvanab

Dec 13, 2025

Now I can use llama.cpp all the time. A big thank you to the devs.

sbeltz

Dec 13, 2025

Is there currently a way to have a "default" model if the request doesn't specify? Could be the currently loaded model or a specific model. (Just noticed one of my apps broke because it's used to llama-server not requiring a model name.)

·

milksteak1111

Jan 14

This seems to work

[DEFAULT]
port = 8080
n-gpu-layers = -1
device = 0
flash-attn = on
chat-template = jinja
models-max = 4

eribob

Dec 14, 2025

Does it unload the current model if VRAM is full, to allow swapping to a new model?

21world

Dec 15, 2025

fun ideas , add personal avatar and p2p social network also emule p2p models storage

21world

Dec 15, 2025

This comment has been hidden (marked as Off-Topic)

JLouisBiz

Dec 26, 2025

Hey there! Just wanted to drop a quick note saying I'm really digging the new router mode in llama.cpp server. It's a game-changer for me, especially when I need to switch between different models. The auto-discovery of models and LRU eviction is pretty neat – no more manual updates or restarts needed. It's like having a dynamic model manager on-the-fly. And the request routing part? Brilliant! Makes my workflow with dmenu smoother. Check out the full experience and check out my dmenu launcher script on the project's GitHub: https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh

It's a win for sure.

melvindave

Jan 3

thanks for the update! does it now behave like ollama?

MagicMorgan

Mar 8

Thank you so much for this, it's great!

akeni23

Mar 11

I want to specifically pin models to a specific GPU (I have multiple) is that possible?

· or to comment
