In January 2026, Microsoft and NVIDIA released the second iteration of the NVIDIA Dynamo Planner—a groundbreaking tool for optimizing large language model (LLM) inference on Azure Kubernetes Service (AKS). This collaboration addresses one of the most challenging aspects of production AI: efficiently scaling GPU resources to balance cost, latency, and throughput. This comprehensive guide explores Dynamo Planner’s architecture, deployment patterns, and configuration strategies for enterprise LLM workloads.
The LLM Inference Challenge
Running LLMs in production presents unique operational challenges that traditional auto-scaling cannot address:
- GPU memory constraints: Models like LLaMA 70B require 140GB+ of GPU memory
- Variable request latency: Token generation time varies with sequence length
- Batching complexity: Optimal batch size depends on request mix and model size
- Cold start overhead: Model loading takes 30-60 seconds per GPU
- Cost pressure: A100/H100 GPUs cost $2-10/hour each
Dynamo Planner solves these challenges with AI-driven resource planning that understands LLM-specific workload patterns.
Dynamo Planner Architecture
graph TB
subgraph AKS ["Azure Kubernetes Service"]
subgraph Dynamo ["Dynamo Planner"]
Predictor["Workload Predictor"]
Optimizer["Resource Optimizer"]
Scheduler["GPU Scheduler"]
end
subgraph InferencePool ["Inference Pool"]
GPU1["Node: 4x A100"]
GPU2["Node: 4x A100"]
GPU3["Node: 8x H100"]
end
subgraph Models ["Model Deployments"]
LLM1["LLaMA 70B"]
LLM2["Mistral 8x7B"]
Embed["Embedding Model"]
end
end
subgraph External ["External"]
Metrics["Azure Monitor"]
Traffic["Inference Requests"]
end
Traffic --> Scheduler
Metrics --> Predictor
Predictor --> Optimizer
Optimizer --> Scheduler
Scheduler --> GPU1
Scheduler --> GPU2
Scheduler --> GPU3
GPU1 --> LLM1
GPU2 --> LLM2
GPU3 --> Embed
style Dynamo fill:#E8F5E9,stroke:#2E7D32
style InferencePool fill:#E3F2FD,stroke:#1565C0
Core Components
| Component | Function | Key Features |
|---|---|---|
| Workload Predictor | Forecasts inference demand | Time-series ML, pattern recognition |
| Resource Optimizer | Calculates optimal GPU allocation | Cost-aware, SLO-driven |
| GPU Scheduler | Places workloads on nodes | Tensor parallelism aware, memory packing |
| Autoscaler Controller | Manages node pool scaling | Predictive scale-up, graceful drain |
Deploying Dynamo Planner on AKS
Prerequisites
# Create an AKS cluster with GPU node pool
az aks create --resource-group rg-llm-production --name aks-llm-cluster --node-count 2 --node-vm-size Standard_D4s_v3 --generate-ssh-keys
# Add GPU node pool with A100 GPUs
az aks nodepool add --resource-group rg-llm-production --cluster-name aks-llm-cluster --name gpupool --node-count 0 --node-vm-size Standard_NC96ads_A100_v4 --enable-cluster-autoscaler --min-count 0 --max-count 10 --node-taints "nvidia.com/gpu=true:NoSchedule"
# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace
Installing Dynamo Planner
# Add the NVIDIA Dynamo Helm repository
helm repo add nvidia-dynamo https://helm.ngc.nvidia.com/nvidia/dynamo
# Install Dynamo Planner with Azure integration
helm install dynamo-planner nvidia-dynamo/dynamo-planner --namespace dynamo-system --create-namespace --set cloud.provider=azure --set cloud.azure.subscriptionId=$AZURE_SUBSCRIPTION_ID --set cloud.azure.resourceGroup=rg-llm-production --set cloud.azure.aksCluster=aks-llm-cluster --set metrics.azureMonitor.enabled=true --set metrics.azureMonitor.workspaceId=$LOG_ANALYTICS_WORKSPACE_ID --set optimizer.costOptimization.enabled=true --set optimizer.costOptimization.maxHourlyCost=500
Configuring Model Deployments
Dynamo Planner uses custom resources to define LLM deployments with inference-specific requirements:
apiVersion: dynamo.nvidia.com/v1
kind: InferenceDeployment
metadata:
name: llama-70b-chat
namespace: llm-inference
spec:
model:
name: meta-llama/Llama-3-70B-Instruct
source: huggingface
quantization: awq-int4 # 4-bit quantization for memory efficiency
serving:
engine: vllm # or tensorrt-llm, triton
maxConcurrentRequests: 256
maxSequenceLength: 8192
resources:
gpu:
type: nvidia-a100-80gb
count: 4 # Tensor parallel across 4 GPUs
memoryFraction: 0.9
scaling:
minReplicas: 1
maxReplicas: 8
targetLatencyP99: 2000ms
targetThroughput: 100 # tokens/second per replica
slo:
availability: 99.9
latencyP50: 500ms
latencyP99: 2000ms
cost:
maxHourlyCost: 100
preferSpotInstances: true
spotFallbackToOnDemand: true
Multi-Model Deployment
apiVersion: dynamo.nvidia.com/v1
kind: InferencePool
metadata:
name: production-llm-pool
spec:
deployments:
- name: llama-70b-chat
weight: 60 # 60% of traffic
priority: high
- name: mistral-8x7b-instruct
weight: 30 # 30% of traffic
priority: medium
- name: embedding-model
weight: 10 # 10% of traffic
priority: low
canShareGPU: true # Allow co-location with other models
routing:
strategy: latency-aware # or round-robin, cost-optimized
stickySession: false
sharedResources:
nodePool: gpupool
maxNodes: 10
enablePacking: true # Pack small models on same GPU
Advanced Optimization Features
Predictive Scaling
Dynamo Planner uses ML to predict traffic 15 minutes ahead, pre-warming GPU nodes before demand spikes:
apiVersion: dynamo.nvidia.com/v1
kind: ScalingPolicy
metadata:
name: predictive-scaling
spec:
targetRef:
kind: InferenceDeployment
name: llama-70b-chat
predictive:
enabled: true
lookAheadMinutes: 15
confidenceThreshold: 0.8
historicalDataDays: 30
schedules:
# Pre-warm for known traffic patterns
- name: business-hours
cron: "0 8 * * 1-5" # 8 AM weekdays
minReplicas: 4
- name: weekend-reduction
cron: "0 0 * * 0,6" # Midnight Saturday/Sunday
maxReplicas: 2
reactive:
scaleUpThreshold: 70 # CPU/GPU utilization %
scaleDownThreshold: 30
scaleUpCooldown: 60s
scaleDownCooldown: 300s
Cost Optimization
apiVersion: dynamo.nvidia.com/v1
kind: CostPolicy
metadata:
name: production-cost-controls
spec:
budget:
hourly: 500
daily: 10000
monthly: 200000
strategies:
- name: spot-instances
enabled: true
maxSpotPercentage: 70
fallbackToOnDemand: true
interruptionHandling: graceful-drain
- name: gpu-consolidation
enabled: true
consolidationWindow: 5m
minUtilizationForConsolidation: 30
- name: model-offloading
enabled: true
offloadIdleModelsAfter: 10m
offloadTarget: cpu # or disk
alerts:
- threshold: 80 # % of budget
action: notify
channels: ["slack", "pagerduty"]
- threshold: 95
action: scale-down-non-critical
- threshold: 100
action: reject-new-requests
Enable model-offloading for development and staging environments. Dynamo Planner can offload models to CPU RAM or NVMe during idle periods, reducing GPU costs by up to 80% for non-production workloads.
Monitoring and Observability
Dynamo Planner integrates with Azure Monitor for comprehensive observability:
apiVersion: dynamo.nvidia.com/v1
kind: ObservabilityConfig
metadata:
name: production-observability
spec:
metrics:
azureMonitor:
enabled: true
customMetrics:
- name: llm_tokens_per_second
- name: llm_time_to_first_token
- name: llm_queue_depth
- name: gpu_memory_utilization
tracing:
enabled: true
samplingRate: 0.1 # 10% of requests
exporter: azure-monitor
logging:
level: info
includePrompts: false # Privacy: don't log prompts
includeTokenCounts: true
dashboards:
grafana:
enabled: true
autoProvision: true
azureDashboard:
enabled: true
resourceGroup: rg-llm-production
Key Metrics to Monitor
| Metric | Description | Target |
|---|---|---|
| Time to First Token (TTFT) | Latency before first token generated | <500ms |
| Tokens Per Second (TPS) | Generation throughput | >50/request |
| Queue Depth | Pending requests | <100 |
| GPU Memory Utilization | VRAM usage percentage | 80-90% |
| Cost Per 1K Tokens | Inference cost efficiency | <$0.01 |
Integration with Azure AI Services
// C# client using Dynamo-managed endpoint
using Azure.AI.Inference;
var client = new ChatCompletionsClient(
new Uri("https://aks-llm-cluster.eastus.inference.ml.azure.com"),
new AzureKeyCredential(Environment.GetEnvironmentVariable("DYNAMO_API_KEY"))
);
var response = await client.CompleteAsync(new ChatCompletionsOptions
{
DeploymentName = "llama-70b-chat", // Maps to InferenceDeployment
Messages =
{
new ChatRequestSystemMessage("You are a helpful assistant."),
new ChatRequestUserMessage("Explain Kubernetes pod scheduling.")
},
MaxTokens = 1000,
Temperature = 0.7f
});
Console.WriteLine(response.Value.Choices[0].Message.Content);
Console.WriteLine($"Tokens used: {response.Value.Usage.TotalTokens}");
Console.WriteLine($"Latency: {response.Value.Usage.CompletionTokens} tokens in {stopwatch.ElapsedMilliseconds}ms");
Key Takeaways
- NVIDIA Dynamo Planner provides AI-driven resource optimization specifically designed for LLM inference workloads on Kubernetes.
- Predictive scaling pre-warms GPU nodes before traffic spikes, eliminating cold start delays.
- Cost controls enable budget limits, spot instance utilization, and automatic model offloading for idle workloads.
- Multi-model deployments with GPU packing maximize utilization across heterogeneous model sizes.
- Azure integration provides native monitoring, managed identity authentication, and AKS node pool management.
Conclusion
NVIDIA Dynamo Planner addresses the operational complexity of running LLMs in production on Kubernetes. By combining predictive scaling, cost optimization, and GPU-aware scheduling, it enables enterprises to deploy large language models with confidence. The tight integration with Azure Kubernetes Service and Azure Monitor makes it particularly attractive for organizations already invested in the Microsoft ecosystem. For teams struggling with GPU utilization, inference latency, or cloud costs, Dynamo Planner represents a significant step toward production-grade AI infrastructure.
References
- NVIDIA Dynamo Planner Documentation
- Microsoft Azure Blog: Dynamo Planner on AKS
- AKS GPU Cluster Documentation
- vLLM Documentation
Discover more from C4: Container, Code, Cloud & Context
Subscribe to get the latest posts sent to your email.