# OME：以模型驱动架构革新 LLM 基础设施

- 来源：LMSYS：Blog（Chatbot Arena 团队）
- 发布时间：2025-07-08 00:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnxbjke6007nsln0wbn9aid6
- 原文链接：https://www.lmsys.org/blog/2025-07-08-ome

## AI 摘要

Oracle Cloud Infrastructure 推出 OME（Open Model Engine），一款 Kubernetes-native 的模型服务框架。该系统采用模型驱动架构，通过 BaseModel、ServingRuntime 等自定义资源将模型视为一等公民，有效弥合 ML 工程师与生产团队之间的鸿沟。OME 将模型上线周期从数月压缩至数天，显著减少配置错误，并原生支持多节点推理、Prefill-decode 分离、Serverless 自动扩缩容及 Multi-LoRA 等企业级特性，集成 SGLang 运行时，实现复杂部署策略的编码复用与一键部署。

## 正文

Contents

The Tale of Two Teams: Why Model Serving Is Broken

The Birth of OME

The OME Architecture: Models at the Center

Layer 1: Kubernetes API Layer

Custom Resources - The Foundation of Model-Driven Architecture

BaseModel/ClusterBaseModel: Models as First-Class Citizens

ServingRuntime: The Brain of Runtime Selection

InferenceService: Orchestrating Model Deployments and Ingress

BenchmarkJob: Performance Testing as a First-Class Operation

Admission Webhooks: Validation and Mutation

Layer 2: Control Plane - The Orchestrator

OME Controller Manager: The Orchestration Brain

Layer 3: Data Plane - Where Models Come to Life

Model Agent: Model Distribution

Inference Workloads

Layer 4: External Integrations - Ecosystem Power

SGLang: First-Class Runtime Support

Native Router Integration

Load Balancing Capabilities

Deployment Flexibility

Production-Grade Features: Built for Scale

Native Benchmarking

Multi-LoRA Serving: One Model, Many Adapters

High-Performance Serving at Scale

Enterprise Security: Defense in Depth

Real-World Impact: From Months to Days

Operational Transformation at Scale

Closing the Gap: How OME Bridges Two Worlds

The Path Forward: Challenges Ahead

Accelerator-Aware Runtime Selection

Multi-Cloud Support

Multi-Cluster Management

Join the Revolution

Acknowledgments

OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture

The Tale of Two Teams: Why Model Serving Is Broken

In any large organization deploying LLMs, two distinct teams emerge with conflicting needs:

The ML Engineers spend months benchmarking models, experimenting with serving technologies, and crafting optimal deployment strategies. Each model demands different configurations—tensor parallelism for Llama 70B, expert parallelism for DeepSeek V3/R1, specialized settings for multimodal models. The parameters are endless: batch sizes, KV cache configurations, quantization levels. Worse, these configurations shift dramatically across GPU types (H100 vs A100 vs L40S).

The Production Engineers and Data Scientists just want to deploy models. They shouldn’t need to understand the intricacies of tensor parallelism or why a particular model needs 4 GPUs with NVLink. They have customers waiting, applications to build, and business value to deliver.

This gap creates a fundamental problem: MLEs need a way to encode their hard-won serving knowledge into reusable blueprints. Production teams need to deploy models without becoming distributed systems experts. The missing link? A system that understands models as first-class citizens.

The Birth of OME

The Oracle Cloud Infrastructure (OCI) GenAI team faced this exact challenge at scale. Supporting numerous models across diverse GPU hardware, they watched deployment cycles stretch into months. Each new model meant:

Weeks of MLE experimentation to find optimal configurations

Complex documentation that production teams struggled to follow

Deployment failures due to misconfiguration

Inability to reuse knowledge across similar models

The breakthrough came from a simple insight: The model itself should drive the deployment.

A Llama model isn’t just a file—it contains metadata about its architecture, parameter count, and requirements. By making the system model-aware rather than deployment-driven, they could bridge the gap between ML expertise and production simplicity.

This led to OME (Open Model Engine): a Kubernetes operator that treats models as first-class resources. The results were dramatic:

Model onboarding time: Months → Days

Configuration errors: Dramatically reduced

MLE knowledge: Captured and reused automatically

Production deployment: Simple YAML with just a few lines

But here’s what makes it revolutionary: the model-driven architecture makes it easy to encode and reuse sophisticated deployment strategies:

Multi-node serving: Deploy massive models like DeepSeek V3 (685B) across multiple nodes with a simple configuration

Prefill-decode disaggregation: Separate compute-intensive prefill from memory-bound decode, with each component scaling independently

Flexible architectures: Both prefill and decode can run in single-node or multi-node configurations based on your needs

Serverless deployment: Scale-to-zero for cost efficiency when models aren’t in use

Business-driven scaling: Complex autoscaling based on KV cache, tokens/second, latency targets, or any custom metric

The model-driven approach doesn’t constrain you—it liberates you. Because OME understands models deeply, it can support any deployment pattern your MLEs design while keeping the interface simple for production teams.

Enter OME: A Kubernetes-native platform where models become first-class citizens. Let’s explore how OME’s architecture transforms the chaos of LLM deployment into an elegant, scalable system that serves everyone from ML researchers to production engineers.

The OME Architecture: Models at the Center

Layer 1: Kubernetes API Layer

While users—MLEs, data scientists, production engineers, and applications—interact with OME through simple interfaces, the real magic happens in the Kubernetes API layer below.

Custom Resources - The Foundation of Model-Driven Architecture

At the heart of OME lies its Custom Resource Definitions (CRDs), which transform Kubernetes from a generic container orchestrator into an ML platform. These aren’t just configuration files—they’re the language through which you express your ML requirements.

BaseModel/ClusterBaseModel: Models as First-Class Citizens

What is a Base Model?

A Base Model in OME is a Kubernetes resource that represents a foundation AI model (like GPT, Llama, or Mistral) that you want to use for inference workloads. Think of it as a blueprint that tells OME where to find your model, how to download it, and where to store it on your cluster nodes.

When you create a BaseModel resource, OME automatically handles the complex process of downloading the model files, parsing the model’s configuration to understand its capabilities, and making it available across your cluster nodes where AI workloads can use it.

BaseModel vs ClusterBaseModel

OME provides two types of model resources:

BaseModel is namespace-scoped, meaning it exists within a specific Kubernetes namespace. If you create a BaseModel in the “team-a” namespace, only workloads in that namespace can use it. This is perfect for team-specific models or when you want to isolate model access.

ClusterBaseModel is cluster-scoped, meaning it’s available to workloads in any namespace across your entire cluster. This is ideal for organization-wide models that multiple teams need to access, like a shared Llama-3 model that everyone uses.

Both types use exactly the same specification format—the only difference is their visibility scope.

Traditional platforms treat models as static files to be downloaded and mounted. OME revolutionizes this by making models intelligent, versioned resources that understand their own requirements:

apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-70b-instruct spec: vendor: meta modelType: llama modelArchitecture: LlamaForCausalLM modelParameterSize: "70B" quantization: fp16 storage: storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct" path: "/models/llama-3.3-70b" nodeSelector: gpu.memory: "80Gi" # Only download to nodes with sufficient GPU memory

apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-70b-instruct spec: vendor: meta modelType: llama modelArchitecture: LlamaForCausalLM modelParameterSize: "70B" quantization: fp16 storage: storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct" path: "/models/llama-3.3-70b" nodeSelector: gpu.memory: "80Gi" # Only download to nodes with sufficient GPU memory

When you create a BaseModel resource, OME’s control plane and data plane components work together to make the model available across your cluster. The BaseModel CRD acts as the declarative specification, while the actual work of downloading, parsing, and distributing models happens in the data plane through the Model Agent.

ServingRuntime: The Brain of Runtime Selection

ClusterServingRuntime is a cluster-scoped resource that manages the runtime environment for model serving. A ClusterServingRuntime defines the templates for Pods that can serve one or more particular models. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.

These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace. The only difference between ServingRuntime and ClusterServingRuntime is that one is namespace-scoped and the other is cluster-scoped.

apiVersion: ome.io/v1beta1 kind: ClusterServingRuntime metadata: name: sglang-llama-70b spec: supportedModelFormats: - modelFormat: name: safetensors modelArchitecture: LlamaForCausalLM modelSizeRange: min: "65B" max: "75B" autoSelect: true priority: 100

apiVersion: ome.io/v1beta1 kind: ClusterServingRuntime metadata: name: sglang-llama-70b spec: supportedModelFormats: - modelFormat: name: safetensors modelArchitecture: LlamaForCausalLM modelSizeRange: min: "65B" max: "75B" autoSelect: true priority: 100

Full runtime specifications for advanced deployments can be found in the OME repository:

Llama 4 Maverick PD Runtime - Prefill-decode disaggregated configuration

DeepSeek RDMA PD Runtime - Multi-node expert parallel serving with RDMA

ServingRuntimes define how to serve different model types, with the actual runtime selection logic handled by the control plane when you create an InferenceService.

Advanced Deployment Architectures

ServingRuntimes serve as blueprints for how router, engine, and decoder components are deployed. Each component (except router) can be configured for single-node, serverless, or multi-node deployment. This flexibility enables cutting-edge serving patterns: PD-Disaggregated Serving - The state-of-the-art for high-performance LLM serving at scale

This isn’t just incremental improvement—it’s a fundamental advancement in serving architecture that OME makes accessible through simple runtime configuration.

InferenceService: Orchestrating Model Deployments and Ingress

An InferenceService is the central Kubernetes resource in OME that orchestrates the complete lifecycle of model serving. It acts as a declarative specification that describes how you want your AI models deployed, scaled, and served across your cluster.

Think of InferenceService as the “deployment blueprint” for your AI workloads. It brings together models (defined by BaseModel/ClusterBaseModel), runtimes (defined by ServingRuntime/ClusterServingRuntime), and infrastructure configuration to create a complete serving solution. InferenceService is what puts models, runtimes, as well as traditional Kubernetes services, complex ingress, scheduling, auto scaling, and permission controls all together to form a complete cluster serving fleet.

Architecture Overview

OME uses a component-based architecture where InferenceService can be composed of multiple specialized components:

Model: References the AI model to serve (BaseModel/ClusterBaseModel)

Runtime: References the serving runtime environment (ServingRuntime/ClusterServingRuntime)

Engine: Main inference component that processes requests, typically an OpenAI-compatible server handling request processing, tool parsing, and model backend operations

Decoder: Optional component for disaggregated serving (prefill-decode separation)

Router: A standalone high-performance component that enables data parallelism across inference instances, supporting advanced load balancing algorithms (cache-aware, power of two, random, round robin) and acting as a specialized load balancer for prefill-decode disaggregated serving architectures

apiVersion: ome.io/v1beta1 kind: InferenceService metadata: name: production-chat-service spec: model: name: llama-3-70b-instruct engine: minReplicas: 2 maxReplicas: 10 decoder: # Only created for disaggregated deployments minReplicas: 4 maxReplicas: 20 router: # Optional optimal serving routing layer minReplicas: 2

apiVersion: ome.io/v1beta1 kind: InferenceService metadata: name: production-chat-service spec: model: name: llama-3-70b-instruct engine: minReplicas: 2 maxReplicas: 10 decoder: # Only created for disaggregated deployments minReplicas: 4 maxReplicas: 20 router: # Optional optimal serving routing layer minReplicas: 2

This component architecture enables sophisticated optimizations impossible with monolithic deployments:

Independent Scaling: Scale compute-heavy prefill separately from memory-bound decode

Resource Optimization: Routers don’t need GPUs, saving precious accelerator resources

Failure Isolation: Component failures don’t bring down the entire service

Performance Tuning: Each component can be optimized for its specific workload

BenchmarkJob: Performance Testing as a First-Class Operation

OME is the only platform that treats performance testing as a core primitive:

apiVersion: ome.io/v1beta1 kind: BenchmarkJob metadata: name: llama-70b-production-benchmark spec: # Target service to benchmark endpoint: inferenceService: name: llama-chat-optimized outputLocation: storageUri: "oci://n/benchmark-results/b/prod/o/llama-70b-bench"

apiVersion: ome.io/v1beta1 kind: BenchmarkJob metadata: name: llama-70b-production-benchmark spec: # Target service to benchmark endpoint: inferenceService: name: llama-chat-optimized outputLocation: storageUri: "oci://n/benchmark-results/b/prod/o/llama-70b-bench"

This isn’t just about running load tests. BenchmarkJob provides:

genai-bench integration: Industry-standard benchmarking tool

Realistic traffic patterns: Normal distributions, fixed patterns, long-context scenarios

Comprehensive metrics: Tokens/second, TTFT, latency percentiles

Multi-cloud storage: Results stored for historical analysis

Service metadata tracking: GPU types, engine versions for fair comparisons

Admission Webhooks: Validation and Mutation

OME’s admission webhooks act as gatekeepers in the API layer:

Validating Webhooks ensure model-runtime compatibility before resources are created, preventing runtime failures

Mutating Webhooks inject optimal configurations based on model characteristics

Pod Mutating Webhooks handle complex scenarios like: RDMA configuration for multi-node deployments GPU affinity rules for optimal memory bandwidth Security contexts for model encryption

RDMA configuration for multi-node deployments

GPU affinity rules for optimal memory bandwidth

Security contexts for model encryption

Layer 2: Control Plane - The Orchestrator

The control plane is where OME’s main operation lives. This isn’t just CRUD operations on Kubernetes resources—it’s a sophisticated system that makes optimal decisions based on model characteristics, hardware availability, and business requirements.

OME Controller Manager: The Orchestration Brain

The controller manager coordinates all OME operations with a reconciliation loop that’s aware of ML-specific concerns.

Runtime Selection Algorithm

When you deploy a model through an InferenceService, the controller:

Matches model characteristics against all available ServingRuntimes

Scores each runtime based on compatibility and optimization potential

Uses Model Size Range matching—when multiple runtimes support a model, OME selects the one with the closest size range for optimal performance

Handles edge cases like quantized models or models requiring specific GPU features

Component Orchestration

The InferenceService controller orchestrates multiple components based on your deployment requirements:

// Simplified reconciliation logic showing component-based orchestration func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch InferenceService and determine deployment mode inferenceService := &omev1beta1.InferenceService{} deploymentMode := r.inferDeploymentMode(isvc) // 2. Select optimal runtime if not specified if isvc.Spec.Runtime == nil { runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode) isvc.Spec.Runtime = runtime } // 3. Reconcile components based on deployment mode switch deploymentMode { case PDDisaggregated: // Deploy separate engine (prefill) and decoder components r.reconcileRouter(isvc) // Cache-aware routing r.reconcileEngine(isvc) // Prefill processing r.reconcileDecoder(isvc) // Token generation case MultiNode: // Deploy using LeaderWorkerSet for distributed serving r.reconcileMultiNodeComponents(isvc) default: // Standard single-component deployment r.reconcileEngine(isvc) // Handles both prefill and decode if isvc.Spec.Router != nil { r.reconcileRouter(isvc) // Optional routing layer } } }

// Simplified reconciliation logic showing component-based orchestration func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch InferenceService and determine deployment mode inferenceService := &omev1beta1.InferenceService{} deploymentMode := r.inferDeploymentMode(isvc) // 2. Select optimal runtime if not specified if isvc.Spec.Runtime == nil { runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode) isvc.Spec.Runtime = runtime } // 3. Reconcile components based on deployment mode switch deploymentMode { case PDDisaggregated: // Deploy separate engine (prefill) and decoder components r.reconcileRouter(isvc) // Cache-aware routing r.reconcileEngine(isvc) // Prefill processing r.reconcileDecoder(isvc) // Token generation case MultiNode: // Deploy using LeaderWorkerSet for distributed serving r.reconcileMultiNodeComponents(isvc) default: // Standard single-component deployment r.reconcileEngine(isvc) // Handles both prefill and decode if isvc.Spec.Router != nil { r.reconcileRouter(isvc) // Optional routing layer } } }

Deployment Mode Decision Logic

The controller automatically determines the optimal deployment pattern:

RawDeployment: Single engine for models that fit on one node

PDDisaggregated: Separate prefill/decode for high-throughput scenarios

MultiNode: Distributed serving for massive models (e.g., DeepSeek V3 685B)

Serverless: Scale-to-zero for cost optimization (via Knative integration)

Layer 3: Data Plane - Where Models Come to Life

The data plane is where OME’s architectural decisions deliver real value. This layer handles the actual model serving with sophisticated optimizations.

Model Agent: Model Distribution

The Model Agent is OME’s data plane component responsible for making models available across your cluster. When you create a BaseModel resource, the Model Agent springs into action:

What Makes Model Distribution Powerful:

Automatic Model Parsing: Downloads and parses the model’s config.json and safetensors file headers, extracting architecture details, parameter counts, supported features, and optimal serving configurations. No more manual specification of model characteristics.

config.json

safetensors

Multi-Cloud Storage Abstraction: The hf:// prefix in your BaseModel isn’t just syntactic sugar. OME supports multiple storage backends with a unified interface. Switch from HuggingFace to OCI Object Storage by changing one line—no code modifications needed.

hf://

Node-Aware Distribution: Models aren’t blindly copied everywhere. The Model Agent runs as a DaemonSet, honoring node selectors and affinity rules, only downloading models to nodes that match your specifications. This saves precious NVMe space and reduces download times.

Lifecycle Management: Models are tracked, versioned, and health-checked. If a node goes down, OME ensures model availability on other nodes. When you delete a BaseModel, cleanup happens automatically across all nodes.

The Scout-Gopher Architecture

OME’s Model Agent employs a sophisticated producer-consumer pattern:

1. Scout Component: The Distribution Layer

The Scout acts as the brain of model distribution, continuously monitoring the Kubernetes API for BaseModel and ClusterBaseModel resources.

Node-Aware Filtering: Scout evaluates node selectors and affinity rules, ensuring models are only downloaded to appropriate nodes.

Graceful Deletion Handling: When models are deleted, Scout ensures complete cleanup across all nodes before releasing resources, preventing orphaned multi-gigabyte files.

2. Gopher Component: The Task Engine

Storage Backend Performance:

OCI Object Storage: Achieves GB/s download speeds through parallel chunk downloads and 20-thread concurrency. A 140GB Llama 3 70B model downloads in minutes.

HuggingFace Hub: Production-grade Golang client with automatic retry, rate limit handling, and resume support for interrupted downloads.

Unified Interface: Switch between storage providers by changing one URI prefix—no code changes needed.

3. Model Configuration Parser

The parser automatically extracts model metadata from config.json and safetensors files, determining exact parameter counts and capabilities. This eliminates manual configuration for 30+ supported model architectures.

4. State Management & Cleanup

OME provides self-healing state management through:

ConfigMap Reconciliation: Automatically recreates deleted ConfigMaps through internal cache, ensuring model states are never lost

Node Labels: Enable pod scheduling decisions with labels like models.ome.io/basemodel.llama-3-70b=Ready

models.ome.io/basemodel.llama-3-70b=Ready

Finalizer-Based Cleanup: Ensures complete model removal across all nodes before deletion, even handling node failures gracefully

The Result: Production-Grade Model Management

This architecture delivers capabilities unmatched by traditional approaches:

Scale: Tested with large multi-gigabyte models, supporting multiple nodes downloading multiple models simultaneously

Efficiency: Models download once per node, not per pod—saving petabytes of bandwidth

Reliability: Self-healing ConfigMaps, automatic retries, and graceful error handling ensure models are always available

Performance: GB/s download speeds with OCI Object Storage, 20x faster than naive implementations

Intelligence: Automatic model understanding eliminates manual configuration errors

Inference Workloads

Based on your InferenceService specification, OME deploys different components optimized for specific workload patterns, including PD-disaggregated serving, multi-node serving, standard deployment, and serverless deployment.

ComponentWhen UsedPrimary FunctionKey OptimizationsEngineAll deploymentsInference server (prefill in PD mode, full inference otherwise)• Compute optimization• Batch processing• Tensor parallelismDecoderPD-disaggregated onlyToken generation with KV cache from engine• Memory bandwidth optimization• Efficient cache managementRouterWhen specified or PD modeOptimized request distribution• Cache-aware routing• Connection pooling• Health monitoringIngressAutomatically createdExternal API access• TLS termination• Rate limiting• Request routing

The beauty of this architecture is its flexibility—start with a simple engine-only deployment and progressively adopt advanced patterns as your needs grow.

Layer 4: External Integrations - Ecosystem Power

OME doesn’t reinvent the wheel—it deeply integrates with the Kubernetes ecosystem:

Kubernetes Ecosystem Integration: Deep integration with modern Kubernetes components including Kueue for gang scheduling of multi-pod workloads, LeaderWorkerSet for resilient multi-node deployments, KEDA for advanced custom metrics-based autoscaling, K8s Gateway API for sophisticated traffic routing, and Gateway API Inference Extension for standardized inference endpoints.

SGLang: First-Class Runtime Support

SGLang is the primary runtime in OME, with deep native integration that showcases OME’s model-driven architecture capabilities.

Native Router Integration

OME provides native integration with SGLang’s router component, implementing:

Kubernetes Service Discovery: The router automatically discovers engine and decoder pods through Kubernetes APIs, adjusting to scaling events and pod lifecycle changes without manual intervention.

Least-Privilege RBAC: Each router receives minimal permissions—only the ability to list, get, and watch pods in its namespace. This prevents cross-tenant information leakage while enabling dynamic discovery.