Oracle Cloud Infrastructure 推出 OME(Open Model Engine),一款 Kubernetes-native 的模型服务框架。该系统采用模型驱动架构,通过 BaseModel、ServingRuntime 等自定义资源将模型视为一等公民,有效弥合 ML 工程师与生产团队之间的鸿沟。OME 将模型上线周期从数月压缩至数天,显著减少配置错误,并原生支持多节点推理、Prefill-decode 分离、Serverless 自动扩缩容及 Multi-LoRA 等企业级特性,集成 SGLang 运行时,实现复杂部署策略的编码复用与一键部署。
原文 · 保持原样,未翻译
Contents
The Tale of Two Teams: Why Model Serving Is Broken
The Birth of OME
The OME Architecture: Models at the Center
Layer 1: Kubernetes API Layer
Custom Resources - The Foundation of Model-Driven Architecture
BaseModel/ClusterBaseModel: Models as First-Class Citizens
ServingRuntime: The Brain of Runtime Selection
InferenceService: Orchestrating Model Deployments and Ingress
BenchmarkJob: Performance Testing as a First-Class Operation
Admission Webhooks: Validation and Mutation
Multi-Cloud Support
Multi-Cluster Management
Join the Revolution
Acknowledgments
OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture
The Tale of Two Teams: Why Model Serving Is Broken
In any large organization deploying LLMs, two distinct teams emerge with conflicting needs:
The ML Engineers spend months benchmarking models, experimenting with serving technologies, and crafting optimal deployment strategies. Each model demands different configurations—tensor parallelism for Llama 70B, expert parallelism for DeepSeek V3/R1, specialized settings for multimodal models. The parameters are endless: batch sizes, KV cache configurations, quantization levels. Worse, these configurations shift dramatically across GPU types (H100 vs A100 vs L40S).
The Production Engineers and Data Scientists just want to deploy models. They shouldn’t need to understand the intricacies of tensor parallelism or why a particular model needs 4 GPUs with NVLink. They have customers waiting, applications to build, and business value to deliver.
This gap creates a fundamental problem: MLEs need a way to encode their hard-won serving knowledge into reusable blueprints. Production teams need to deploy models without becoming distributed systems experts. The missing link? A system that understands models as first-class citizens.
The Birth of OME
The Oracle Cloud Infrastructure (OCI) GenAI team faced this exact challenge at scale. Supporting numerous models across diverse GPU hardware, they watched deployment cycles stretch into months. Each new model meant:
Weeks of MLE experimentation to find optimal configurations
Complex documentation that production teams struggled to follow
Deployment failures due to misconfiguration
Inability to reuse knowledge across similar models
The breakthrough came from a simple insight: The model itself should drive the deployment.
A Llama model isn’t just a file—it contains metadata about its architecture, parameter count, and requirements. By making the system model-aware rather than deployment-driven, they could bridge the gap between ML expertise and production simplicity.
This led to OME (Open Model Engine): a Kubernetes operator that treats models as first-class resources. The results were dramatic:
Model onboarding time: Months → Days
Configuration errors: Dramatically reduced
MLE knowledge: Captured and reused automatically
Production deployment: Simple YAML with just a few lines
But here’s what makes it revolutionary: the model-driven architecture makes it easy to encode and reuse sophisticated deployment strategies:
Multi-node serving: Deploy massive models like DeepSeek V3 (685B) across multiple nodes with a simple configuration
Prefill-decode disaggregation: Separate compute-intensive prefill from memory-bound decode, with each component scaling independently
Flexible architectures: Both prefill and decode can run in single-node or multi-node configurations based on your needs
Serverless deployment: Scale-to-zero for cost efficiency when models aren’t in use
Business-driven scaling: Complex autoscaling based on KV cache, tokens/second, latency targets, or any custom metric
The model-driven approach doesn’t constrain you—it liberates you. Because OME understands models deeply, it can support any deployment pattern your MLEs design while keeping the interface simple for production teams.
Enter OME: A Kubernetes-native platform where models become first-class citizens. Let’s explore how OME’s architecture transforms the chaos of LLM deployment into an elegant, scalable system that serves everyone from ML researchers to production engineers.
The OME Architecture: Models at the Center
Layer 1: Kubernetes API Layer
While users—MLEs, data scientists, production engineers, and applications—interact with OME through simple interfaces, the real magic happens in the Kubernetes API layer below.
Custom Resources - The Foundation of Model-Driven Architecture
At the heart of OME lies its Custom Resource Definitions (CRDs), which transform Kubernetes from a generic container orchestrator into an ML platform. These aren’t just configuration files—they’re the language through which you express your ML requirements.
BaseModel/ClusterBaseModel: Models as First-Class Citizens
What is a Base Model?
A Base Model in OME is a Kubernetes resource that represents a foundation AI model (like GPT, Llama, or Mistral) that you want to use for inference workloads. Think of it as a blueprint that tells OME where to find your model, how to download it, and where to store it on your cluster nodes.
When you create a BaseModel resource, OME automatically handles the complex process of downloading the model files, parsing the model’s configuration to understand its capabilities, and making it available across your cluster nodes where AI workloads can use it.
BaseModel vs ClusterBaseModel
OME provides two types of model resources:
BaseModel is namespace-scoped, meaning it exists within a specific Kubernetes namespace. If you create a BaseModel in the “team-a” namespace, only workloads in that namespace can use it. This is perfect for team-specific models or when you want to isolate model access.
ClusterBaseModel is cluster-scoped, meaning it’s available to workloads in any namespace across your entire cluster. This is ideal for organization-wide models that multiple teams need to access, like a shared Llama-3 model that everyone uses.
Both types use exactly the same specification format—the only difference is their visibility scope.
Traditional platforms treat models as static files to be downloaded and mounted. OME revolutionizes this by making models intelligent, versioned resources that understand their own requirements:
apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-70b-instruct spec: vendor: meta modelType: llama modelArchitecture: LlamaForCausalLM modelParameterSize: "70B" quantization: fp16 storage: storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct" path: "/models/llama-3.3-70b" nodeSelector: gpu.memory: "80Gi" # Only download to nodes with sufficient GPU memory
apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-70b-instruct spec: vendor: meta modelType: llama modelArchitecture: LlamaForCausalLM modelParameterSize: "70B" quantization: fp16 storage: storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct" path: "/models/llama-3.3-70b" nodeSelector: gpu.memory: "80Gi" # Only download to nodes with sufficient GPU memory
When you create a BaseModel resource, OME’s control plane and data plane components work together to make the model available across your cluster. The BaseModel CRD acts as the declarative specification, while the actual work of downloading, parsing, and distributing models happens in the data plane through the Model Agent.
ServingRuntime: The Brain of Runtime Selection
ClusterServingRuntime is a cluster-scoped resource that manages the runtime environment for model serving. A ClusterServingRuntime defines the templates for Pods that can serve one or more particular models. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.
These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace. The only difference between ServingRuntime and ClusterServingRuntime is that one is namespace-scoped and the other is cluster-scoped.
ServingRuntimes define how to serve different model types, with the actual runtime selection logic handled by the control plane when you create an InferenceService.
Advanced Deployment Architectures
ServingRuntimes serve as blueprints for how router, engine, and decoder components are deployed. Each component (except router) can be configured for single-node, serverless, or multi-node deployment. This flexibility enables cutting-edge serving patterns: PD-Disaggregated Serving - The state-of-the-art for high-performance LLM serving at scale
This isn’t just incremental improvement—it’s a fundamental advancement in serving architecture that OME makes accessible through simple runtime configuration.
InferenceService: Orchestrating Model Deployments and Ingress
An InferenceService is the central Kubernetes resource in OME that orchestrates the complete lifecycle of model serving. It acts as a declarative specification that describes how you want your AI models deployed, scaled, and served across your cluster.
Think of InferenceService as the “deployment blueprint” for your AI workloads. It brings together models (defined by BaseModel/ClusterBaseModel), runtimes (defined by ServingRuntime/ClusterServingRuntime), and infrastructure configuration to create a complete serving solution. InferenceService is what puts models, runtimes, as well as traditional Kubernetes services, complex ingress, scheduling, auto scaling, and permission controls all together to form a complete cluster serving fleet.
Architecture Overview
OME uses a component-based architecture where InferenceService can be composed of multiple specialized components:
Model: References the AI model to serve (BaseModel/ClusterBaseModel)
Runtime: References the serving runtime environment (ServingRuntime/ClusterServingRuntime)
Engine: Main inference component that processes requests, typically an OpenAI-compatible server handling request processing, tool parsing, and model backend operations
Decoder: Optional component for disaggregated serving (prefill-decode separation)
Router: A standalone high-performance component that enables data parallelism across inference instances, supporting advanced load balancing algorithms (cache-aware, power of two, random, round robin) and acting as a specialized load balancer for prefill-decode disaggregated serving architectures
Multi-cloud storage: Results stored for historical analysis
Service metadata tracking: GPU types, engine versions for fair comparisons
Admission Webhooks: Validation and Mutation
OME’s admission webhooks act as gatekeepers in the API layer:
Validating Webhooks ensure model-runtime compatibility before resources are created, preventing runtime failures
Mutating Webhooks inject optimal configurations based on model characteristics
Pod Mutating Webhooks handle complex scenarios like: RDMA configuration for multi-node deployments GPU affinity rules for optimal memory bandwidth Security contexts for model encryption
RDMA configuration for multi-node deployments
GPU affinity rules for optimal memory bandwidth
Security contexts for model encryption
Layer 2: Control Plane - The Orchestrator
The control plane is where OME’s main operation lives. This isn’t just CRUD operations on Kubernetes resources—it’s a sophisticated system that makes optimal decisions based on model characteristics, hardware availability, and business requirements.
OME Controller Manager: The Orchestration Brain
The controller manager coordinates all OME operations with a reconciliation loop that’s aware of ML-specific concerns.
Runtime Selection Algorithm
When you deploy a model through an InferenceService, the controller:
Matches model characteristics against all available ServingRuntimes
Scores each runtime based on compatibility and optimization potential
Uses Model Size Range matching—when multiple runtimes support a model, OME selects the one with the closest size range for optimal performance
Handles edge cases like quantized models or models requiring specific GPU features
Component Orchestration
The InferenceService controller orchestrates multiple components based on your deployment requirements:
// Simplified reconciliation logic showing component-based orchestration func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch InferenceService and determine deployment mode inferenceService := &omev1beta1.InferenceService{} deploymentMode := r.inferDeploymentMode(isvc) // 2. Select optimal runtime if not specified if isvc.Spec.Runtime == nil { runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode) isvc.Spec.Runtime = runtime } // 3. Reconcile components based on deployment mode switch deploymentMode { case PDDisaggregated: // Deploy separate engine (prefill) and decoder components r.reconcileRouter(isvc) // Cache-aware routing r.reconcileEngine(isvc) // Prefill processing r.reconcileDecoder(isvc) // Token generation case MultiNode: // Deploy using LeaderWorkerSet for distributed serving r.reconcileMultiNodeComponents(isvc) default: // Standard single-component deployment r.reconcileEngine(isvc) // Handles both prefill and decode if isvc.Spec.Router != nil { r.reconcileRouter(isvc) // Optional routing layer } } }
// Simplified reconciliation logic showing component-based orchestration func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch InferenceService and determine deployment mode inferenceService := &omev1beta1.InferenceService{} deploymentMode := r.inferDeploymentMode(isvc) // 2. Select optimal runtime if not specified if isvc.Spec.Runtime == nil { runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode) isvc.Spec.Runtime = runtime } // 3. Reconcile components based on deployment mode switch deploymentMode { case PDDisaggregated: // Deploy separate engine (prefill) and decoder components r.reconcileRouter(isvc) // Cache-aware routing r.reconcileEngine(isvc) // Prefill processing r.reconcileDecoder(isvc) // Token generation case MultiNode: // Deploy using LeaderWorkerSet for distributed serving r.reconcileMultiNodeComponents(isvc) default: // Standard single-component deployment r.reconcileEngine(isvc) // Handles both prefill and decode if isvc.Spec.Router != nil { r.reconcileRouter(isvc) // Optional routing layer } } }
Deployment Mode Decision Logic
The controller automatically determines the optimal deployment pattern:
RawDeployment: Single engine for models that fit on one node
PDDisaggregated: Separate prefill/decode for high-throughput scenarios
MultiNode: Distributed serving for massive models (e.g., DeepSeek V3 685B)
Serverless: Scale-to-zero for cost optimization (via Knative integration)
Layer 3: Data Plane - Where Models Come to Life
The data plane is where OME’s architectural decisions deliver real value. This layer handles the actual model serving with sophisticated optimizations.
Model Agent: Model Distribution
The Model Agent is OME’s data plane component responsible for making models available across your cluster. When you create a BaseModel resource, the Model Agent springs into action:
What Makes Model Distribution Powerful:
Automatic Model Parsing: Downloads and parses the model’s config.json and safetensors file headers, extracting architecture details, parameter counts, supported features, and optimal serving configurations. No more manual specification of model characteristics.
config.json
safetensors
Multi-Cloud Storage Abstraction: The hf:// prefix in your BaseModel isn’t just syntactic sugar. OME supports multiple storage backends with a unified interface. Switch from HuggingFace to OCI Object Storage by changing one line—no code modifications needed.
hf://
Node-Aware Distribution: Models aren’t blindly copied everywhere. The Model Agent runs as a DaemonSet, honoring node selectors and affinity rules, only downloading models to nodes that match your specifications. This saves precious NVMe space and reduces download times.
Lifecycle Management: Models are tracked, versioned, and health-checked. If a node goes down, OME ensures model availability on other nodes. When you delete a BaseModel, cleanup happens automatically across all nodes.
The Scout-Gopher Architecture
OME’s Model Agent employs a sophisticated producer-consumer pattern:
Scout Component: The Distribution Layer
The Scout acts as the brain of model distribution, continuously monitoring the Kubernetes API for BaseModel and ClusterBaseModel resources.
Node-Aware Filtering: Scout evaluates node selectors and affinity rules, ensuring models are only downloaded to appropriate nodes.
Graceful Deletion Handling: When models are deleted, Scout ensures complete cleanup across all nodes before releasing resources, preventing orphaned multi-gigabyte files.
Gopher Component: The Task Engine
Storage Backend Performance:
OCI Object Storage: Achieves GB/s download speeds through parallel chunk downloads and 20-thread concurrency. A 140GB Llama 3 70B model downloads in minutes.
HuggingFace Hub: Production-grade Golang client with automatic retry, rate limit handling, and resume support for interrupted downloads.
Unified Interface: Switch between storage providers by changing one URI prefix—no code changes needed.
Model Configuration Parser
The parser automatically extracts model metadata from config.json and safetensors files, determining exact parameter counts and capabilities. This eliminates manual configuration for 30+ supported model architectures.
State Management & Cleanup
OME provides self-healing state management through:
ConfigMap Reconciliation: Automatically recreates deleted ConfigMaps through internal cache, ensuring model states are never lost
Node Labels: Enable pod scheduling decisions with labels like models.ome.io/basemodel.llama-3-70b=Ready
models.ome.io/basemodel.llama-3-70b=Ready
Finalizer-Based Cleanup: Ensures complete model removal across all nodes before deletion, even handling node failures gracefully
The Result: Production-Grade Model Management
This architecture delivers capabilities unmatched by traditional approaches:
Scale: Tested with large multi-gigabyte models, supporting multiple nodes downloading multiple models simultaneously
Efficiency: Models download once per node, not per pod—saving petabytes of bandwidth
Reliability: Self-healing ConfigMaps, automatic retries, and graceful error handling ensure models are always available
Performance: GB/s download speeds with OCI Object Storage, 20x faster than naive implementations
Intelligence: Automatic model understanding eliminates manual configuration errors
Inference Workloads
Based on your InferenceService specification, OME deploys different components optimized for specific workload patterns, including PD-disaggregated serving, multi-node serving, standard deployment, and serverless deployment.
ComponentWhen UsedPrimary FunctionKey OptimizationsEngineAll deploymentsInference server (prefill in PD mode, full inference otherwise)• Compute optimization• Batch processing• Tensor parallelismDecoderPD-disaggregated onlyToken generation with KV cache from engine• Memory bandwidth optimization• Efficient cache managementRouterWhen specified or PD modeOptimized request distribution• Cache-aware routing• Connection pooling• Health monitoringIngressAutomatically createdExternal API access• TLS termination• Rate limiting• Request routing
The beauty of this architecture is its flexibility—start with a simple engine-only deployment and progressively adopt advanced patterns as your needs grow.
Layer 4: External Integrations - Ecosystem Power
OME doesn’t reinvent the wheel—it deeply integrates with the Kubernetes ecosystem:
Kubernetes Ecosystem Integration: Deep integration with modern Kubernetes components including Kueue for gang scheduling of multi-pod workloads, LeaderWorkerSet for resilient multi-node deployments, KEDA for advanced custom metrics-based autoscaling, K8s Gateway API for sophisticated traffic routing, and Gateway API Inference Extension for standardized inference endpoints.
SGLang: First-Class Runtime Support
SGLang is the primary runtime in OME, with deep native integration that showcases OME’s model-driven architecture capabilities.
Native Router Integration
OME provides native integration with SGLang’s router component, implementing:
Kubernetes Service Discovery: The router automatically discovers engine and decoder pods through Kubernetes APIs, adjusting to scaling events and pod lifecycle changes without manual intervention.
Least-Privilege RBAC: Each router receives minimal permissions—only the ability to list, get, and watch pods in its namespace. This prevents cross-tenant information leakage while enabling dynamic discovery.
OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture
The Tale of Two Teams: Why Model Serving Is Broken
In any large organization deploying LLMs, two distinct teams emerge with conflicting needs:
The ML Engineers spend months benchmarking models, experimenting with serving technologies, and crafting optimal deployment strategies. Each model demands different configurations—tensor parallelism for Llama 70B, expert parallelism for DeepSeek V3/R1, specialized settings for multimodal models. The parameters are endless: batch sizes, KV cache configurations, quantization levels. Worse, these configurations shift dramatically across GPU types (H100 vs A100 vs L40S).
The Production Engineers and Data Scientists just want to deploy models. They shouldn’t need to understand the intricacies of tensor parallelism or why a particular model needs 4 GPUs with NVLink. They have customers waiting, applications to build, and business value to deliver.
This gap creates a fundamental problem: MLEs need a way to encode their hard-won serving knowledge into reusable blueprints. Production teams need to deploy models without becoming distributed systems experts. The missing link? A system that understands models as first-class citizens.
The Birth of OME
The Oracle Cloud Infrastructure (OCI) GenAI team faced this exact challenge at scale. Supporting numerous models across diverse GPU hardware, they watched deployment cycles stretch into months. Each new model meant:
Weeks of MLE experimentation to find optimal configurations
Complex documentation that production teams struggled to follow
Deployment failures due to misconfiguration
Inability to reuse knowledge across similar models
The breakthrough came from a simple insight: The model itself should drive the deployment.
A Llama model isn’t just a file—it contains metadata about its architecture, parameter count, and requirements. By making the system model-aware rather than deployment-driven, they could bridge the gap between ML expertise and production simplicity.
This led to OME (Open Model Engine): a Kubernetes operator that treats models as first-class resources. The results were dramatic:
Model onboarding time: Months → Days
Configuration errors: Dramatically reduced
MLE knowledge: Captured and reused automatically
Production deployment: Simple YAML with just a few lines
But here’s what makes it revolutionary: the model-driven architecture makes it easy to encode and reuse sophisticated deployment strategies:
Multi-node serving: Deploy massive models like DeepSeek V3 (685B) across multiple nodes with a simple configuration
Prefill-decode disaggregation: Separate compute-intensive prefill from memory-bound decode, with each component scaling independently
Flexible architectures: Both prefill and decode can run in single-node or multi-node configurations based on your needs
Serverless deployment: Scale-to-zero for cost efficiency when models aren’t in use
Business-driven scaling: Complex autoscaling based on KV cache, tokens/second, latency targets, or any custom metric
The model-driven approach doesn’t constrain you—it liberates you. Because OME understands models deeply, it can support any deployment pattern your MLEs design while keeping the interface simple for production teams.
Enter OME: A Kubernetes-native platform where models become first-class citizens. Let’s explore how OME’s architecture transforms the chaos of LLM deployment into an elegant, scalable system that serves everyone from ML researchers to production engineers.
The OME Architecture: Models at the Center
Layer 1: Kubernetes API Layer
While users—MLEs, data scientists, production engineers, and applications—interact with OME through simple interfaces, the real magic happens in the Kubernetes API layer below.
Custom Resources - The Foundation of Model-Driven Architecture
At the heart of OME lies its Custom Resource Definitions (CRDs), which transform Kubernetes from a generic container orchestrator into an ML platform. These aren’t just configuration files—they’re the language through which you express your ML requirements.
BaseModel/ClusterBaseModel: Models as First-Class Citizens
What is a Base Model?
A Base Model in OME is a Kubernetes resource that represents a foundation AI model (like GPT, Llama, or Mistral) that you want to use for inference workloads. Think of it as a blueprint that tells OME where to find your model, how to download it, and where to store it on your cluster nodes.
When you create a BaseModel resource, OME automatically handles the complex process of downloading the model files, parsing the model’s configuration to understand its capabilities, and making it available across your cluster nodes where AI workloads can use it.
BaseModel vs ClusterBaseModel
OME provides two types of model resources:
BaseModel is namespace-scoped, meaning it exists within a specific Kubernetes namespace. If you create a BaseModel in the “team-a” namespace, only workloads in that namespace can use it. This is perfect for team-specific models or when you want to isolate model access.
ClusterBaseModel is cluster-scoped, meaning it’s available to workloads in any namespace across your entire cluster. This is ideal for organization-wide models that multiple teams need to access, like a shared Llama-3 model that everyone uses.
Both types use exactly the same specification format—the only difference is their visibility scope.
Traditional platforms treat models as static files to be downloaded and mounted. OME revolutionizes this by making models intelligent, versioned resources that understand their own requirements:
apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-70b-instruct spec: vendor: meta modelType: llama modelArchitecture: LlamaForCausalLM modelParameterSize: "70B" quantization: fp16 storage: storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct" path: "/models/llama-3.3-70b" nodeSelector: gpu.memory: "80Gi" # Only download to nodes with sufficient GPU memory
apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-70b-instruct spec: vendor: meta modelType: llama modelArchitecture: LlamaForCausalLM modelParameterSize: "70B" quantization: fp16 storage: storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct" path: "/models/llama-3.3-70b" nodeSelector: gpu.memory: "80Gi" # Only download to nodes with sufficient GPU memory
When you create a BaseModel resource, OME’s control plane and data plane components work together to make the model available across your cluster. The BaseModel CRD acts as the declarative specification, while the actual work of downloading, parsing, and distributing models happens in the data plane through the Model Agent.
ServingRuntime: The Brain of Runtime Selection
ClusterServingRuntime is a cluster-scoped resource that manages the runtime environment for model serving. A ClusterServingRuntime defines the templates for Pods that can serve one or more particular models. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.
These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace. The only difference between ServingRuntime and ClusterServingRuntime is that one is namespace-scoped and the other is cluster-scoped.
ServingRuntimes define how to serve different model types, with the actual runtime selection logic handled by the control plane when you create an InferenceService.
Advanced Deployment Architectures
ServingRuntimes serve as blueprints for how router, engine, and decoder components are deployed. Each component (except router) can be configured for single-node, serverless, or multi-node deployment. This flexibility enables cutting-edge serving patterns: PD-Disaggregated Serving - The state-of-the-art for high-performance LLM serving at scale
This isn’t just incremental improvement—it’s a fundamental advancement in serving architecture that OME makes accessible through simple runtime configuration.
InferenceService: Orchestrating Model Deployments and Ingress
An InferenceService is the central Kubernetes resource in OME that orchestrates the complete lifecycle of model serving. It acts as a declarative specification that describes how you want your AI models deployed, scaled, and served across your cluster.
Think of InferenceService as the “deployment blueprint” for your AI workloads. It brings together models (defined by BaseModel/ClusterBaseModel), runtimes (defined by ServingRuntime/ClusterServingRuntime), and infrastructure configuration to create a complete serving solution. InferenceService is what puts models, runtimes, as well as traditional Kubernetes services, complex ingress, scheduling, auto scaling, and permission controls all together to form a complete cluster serving fleet.
Architecture Overview
OME uses a component-based architecture where InferenceService can be composed of multiple specialized components:
Model: References the AI model to serve (BaseModel/ClusterBaseModel)
Runtime: References the serving runtime environment (ServingRuntime/ClusterServingRuntime)
Engine: Main inference component that processes requests, typically an OpenAI-compatible server handling request processing, tool parsing, and model backend operations
Decoder: Optional component for disaggregated serving (prefill-decode separation)
Router: A standalone high-performance component that enables data parallelism across inference instances, supporting advanced load balancing algorithms (cache-aware, power of two, random, round robin) and acting as a specialized load balancer for prefill-decode disaggregated serving architectures
Multi-cloud storage: Results stored for historical analysis
Service metadata tracking: GPU types, engine versions for fair comparisons
Admission Webhooks: Validation and Mutation
OME’s admission webhooks act as gatekeepers in the API layer:
Validating Webhooks ensure model-runtime compatibility before resources are created, preventing runtime failures
Mutating Webhooks inject optimal configurations based on model characteristics
Pod Mutating Webhooks handle complex scenarios like: RDMA configuration for multi-node deployments GPU affinity rules for optimal memory bandwidth Security contexts for model encryption
RDMA configuration for multi-node deployments
GPU affinity rules for optimal memory bandwidth
Security contexts for model encryption
Layer 2: Control Plane - The Orchestrator
The control plane is where OME’s main operation lives. This isn’t just CRUD operations on Kubernetes resources—it’s a sophisticated system that makes optimal decisions based on model characteristics, hardware availability, and business requirements.
OME Controller Manager: The Orchestration Brain
The controller manager coordinates all OME operations with a reconciliation loop that’s aware of ML-specific concerns.
Runtime Selection Algorithm
When you deploy a model through an InferenceService, the controller:
Matches model characteristics against all available ServingRuntimes
Scores each runtime based on compatibility and optimization potential
Uses Model Size Range matching—when multiple runtimes support a model, OME selects the one with the closest size range for optimal performance
Handles edge cases like quantized models or models requiring specific GPU features
Component Orchestration
The InferenceService controller orchestrates multiple components based on your deployment requirements:
// Simplified reconciliation logic showing component-based orchestration func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch InferenceService and determine deployment mode inferenceService := &omev1beta1.InferenceService{} deploymentMode := r.inferDeploymentMode(isvc) // 2. Select optimal runtime if not specified if isvc.Spec.Runtime == nil { runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode) isvc.Spec.Runtime = runtime } // 3. Reconcile components based on deployment mode switch deploymentMode { case PDDisaggregated: // Deploy separate engine (prefill) and decoder components r.reconcileRouter(isvc) // Cache-aware routing r.reconcileEngine(isvc) // Prefill processing r.reconcileDecoder(isvc) // Token generation case MultiNode: // Deploy using LeaderWorkerSet for distributed serving r.reconcileMultiNodeComponents(isvc) default: // Standard single-component deployment r.reconcileEngine(isvc) // Handles both prefill and decode if isvc.Spec.Router != nil { r.reconcileRouter(isvc) // Optional routing layer } } }
// Simplified reconciliation logic showing component-based orchestration func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch InferenceService and determine deployment mode inferenceService := &omev1beta1.InferenceService{} deploymentMode := r.inferDeploymentMode(isvc) // 2. Select optimal runtime if not specified if isvc.Spec.Runtime == nil { runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode) isvc.Spec.Runtime = runtime } // 3. Reconcile components based on deployment mode switch deploymentMode { case PDDisaggregated: // Deploy separate engine (prefill) and decoder components r.reconcileRouter(isvc) // Cache-aware routing r.reconcileEngine(isvc) // Prefill processing r.reconcileDecoder(isvc) // Token generation case MultiNode: // Deploy using LeaderWorkerSet for distributed serving r.reconcileMultiNodeComponents(isvc) default: // Standard single-component deployment r.reconcileEngine(isvc) // Handles both prefill and decode if isvc.Spec.Router != nil { r.reconcileRouter(isvc) // Optional routing layer } } }
Deployment Mode Decision Logic
The controller automatically determines the optimal deployment pattern:
RawDeployment: Single engine for models that fit on one node
PDDisaggregated: Separate prefill/decode for high-throughput scenarios
MultiNode: Distributed serving for massive models (e.g., DeepSeek V3 685B)
Serverless: Scale-to-zero for cost optimization (via Knative integration)
Layer 3: Data Plane - Where Models Come to Life
The data plane is where OME’s architectural decisions deliver real value. This layer handles the actual model serving with sophisticated optimizations.
Model Agent: Model Distribution
The Model Agent is OME’s data plane component responsible for making models available across your cluster. When you create a BaseModel resource, the Model Agent springs into action:
What Makes Model Distribution Powerful:
Automatic Model Parsing: Downloads and parses the model’s config.json and safetensors file headers, extracting architecture details, parameter counts, supported features, and optimal serving configurations. No more manual specification of model characteristics.
config.json
safetensors
Multi-Cloud Storage Abstraction: The hf:// prefix in your BaseModel isn’t just syntactic sugar. OME supports multiple storage backends with a unified interface. Switch from HuggingFace to OCI Object Storage by changing one line—no code modifications needed.
hf://
Node-Aware Distribution: Models aren’t blindly copied everywhere. The Model Agent runs as a DaemonSet, honoring node selectors and affinity rules, only downloading models to nodes that match your specifications. This saves precious NVMe space and reduces download times.
Lifecycle Management: Models are tracked, versioned, and health-checked. If a node goes down, OME ensures model availability on other nodes. When you delete a BaseModel, cleanup happens automatically across all nodes.
The Scout-Gopher Architecture
OME’s Model Agent employs a sophisticated producer-consumer pattern:
Scout Component: The Distribution Layer
The Scout acts as the brain of model distribution, continuously monitoring the Kubernetes API for BaseModel and ClusterBaseModel resources.
Node-Aware Filtering: Scout evaluates node selectors and affinity rules, ensuring models are only downloaded to appropriate nodes.
Graceful Deletion Handling: When models are deleted, Scout ensures complete cleanup across all nodes before releasing resources, preventing orphaned multi-gigabyte files.
Gopher Component: The Task Engine
Storage Backend Performance:
OCI Object Storage: Achieves GB/s download speeds through parallel chunk downloads and 20-thread concurrency. A 140GB Llama 3 70B model downloads in minutes.
HuggingFace Hub: Production-grade Golang client with automatic retry, rate limit handling, and resume support for interrupted downloads.
Unified Interface: Switch between storage providers by changing one URI prefix—no code changes needed.
Model Configuration Parser
The parser automatically extracts model metadata from config.json and safetensors files, determining exact parameter counts and capabilities. This eliminates manual configuration for 30+ supported model architectures.
State Management & Cleanup
OME provides self-healing state management through:
ConfigMap Reconciliation: Automatically recreates deleted ConfigMaps through internal cache, ensuring model states are never lost
Node Labels: Enable pod scheduling decisions with labels like models.ome.io/basemodel.llama-3-70b=Ready
models.ome.io/basemodel.llama-3-70b=Ready
Finalizer-Based Cleanup: Ensures complete model removal across all nodes before deletion, even handling node failures gracefully
The Result: Production-Grade Model Management
This architecture delivers capabilities unmatched by traditional approaches:
Scale: Tested with large multi-gigabyte models, supporting multiple nodes downloading multiple models simultaneously
Efficiency: Models download once per node, not per pod—saving petabytes of bandwidth
Reliability: Self-healing ConfigMaps, automatic retries, and graceful error handling ensure models are always available
Performance: GB/s download speeds with OCI Object Storage, 20x faster than naive implementations
Intelligence: Automatic model understanding eliminates manual configuration errors
Inference Workloads
Based on your InferenceService specification, OME deploys different components optimized for specific workload patterns, including PD-disaggregated serving, multi-node serving, standard deployment, and serverless deployment.
ComponentWhen UsedPrimary FunctionKey OptimizationsEngineAll deploymentsInference server (prefill in PD mode, full inference otherwise)• Compute optimization• Batch processing• Tensor parallelismDecoderPD-disaggregated onlyToken generation with KV cache from engine• Memory bandwidth optimization• Efficient cache managementRouterWhen specified or PD modeOptimized request distribution• Cache-aware routing• Connection pooling• Health monitoringIngressAutomatically createdExternal API access• TLS termination• Rate limiting• Request routing
The beauty of this architecture is its flexibility—start with a simple engine-only deployment and progressively adopt advanced patterns as your needs grow.
Layer 4: External Integrations - Ecosystem Power
OME doesn’t reinvent the wheel—it deeply integrates with the Kubernetes ecosystem:
Kubernetes Ecosystem Integration: Deep integration with modern Kubernetes components including Kueue for gang scheduling of multi-pod workloads, LeaderWorkerSet for resilient multi-node deployments, KEDA for advanced custom metrics-based autoscaling, K8s Gateway API for sophisticated traffic routing, and Gateway API Inference Extension for standardized inference endpoints.
SGLang: First-Class Runtime Support
SGLang is the primary runtime in OME, with deep native integration that showcases OME’s model-driven architecture capabilities.
Native Router Integration
OME provides native integration with SGLang’s router component, implementing:
Kubernetes Service Discovery: The router automatically discovers engine and decoder pods through Kubernetes APIs, adjusting to scaling events and pod lifecycle changes without manual intervention.
Least-Privilege RBAC: Each router receives minimal permissions—only the ability to list, get, and watch pods in its namespace. This prevents cross-tenant information leakage while enabling dynamic discovery.