DIY LLMOps: Building Your Own AI Platform with Kubernetes and Open Source

After building LLMOps platforms on Kubernetes with fully
open-source tooling, I’ve learned that you don’t need expensive vendor platforms to run production LLM applications.
This guide shows how to build a complete LLMOps platform using Kubernetes, GitHub Actions, and open-source
tools—achieving enterprise-grade capabilities at a fraction of the cost.

1. Why DIY LLMOps?

Commercial LLMOps platforms (Databricks, Azure ML, Vertex AI) are powerful but expensive:

High cost: $50K-500K+/year for platform fees alone
Vendor lock-in: Proprietary APIs make migration difficult
Over-engineered: Most teams don’t need 90% of features
Limited customization: Can’t modify to fit your workflows

Open-source alternatives provide 80% of functionality at 20% of cost.

2. Architecture: Complete LLMOps Stack

2.1 Component Overview

Infrastructure: Kubernetes (EKS/GKE/AKS)
Model Registry: MLflow
Training Orchestration: Kubeflow / Ray
Serving: vLLM / Text Generation Inference
Monitoring: Prometheus + Grafana
CI/CD: GitHub Actions
Storage: S3 / GCS / Azure Blob

3. Foundation: Kubernetes Setup

# Create EKS cluster with Terraform
terraform init
terraform apply

# eks-cluster.tf
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = "llmops-cluster"
  cluster_version = "1.28"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # Node groups for different workloads
  eks_managed_node_groups = {
    # General purpose nodes
    general = {
      instance_types = ["m5.2xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 3
    }
    
    # GPU nodes for training/inference
    gpu = {
      instance_types = ["p3.2xlarge"]
      min_size       = 0
      max_size       = 5
      desired_size   = 1
      
      labels = {
        workload = "gpu"
      }
      
      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NoSchedule"
      }]
    }
  }
}

4. Model Registry: MLflow Setup

# mlflow-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
  namespace: llmops
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.9.0
        args:
          - server
          - --backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow
          - --default-artifact-root=s3://my-mlflow-artifacts
          - --host=0.0.0.0
          - --port=5000
        ports:
        - containerPort: 5000
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key-id
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-access-key
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
spec:
  selector:
    app: mlflow
  ports:
  - port: 80
    targetPort: 5000
  type: LoadBalancer

5. Model Training with Ray

# train_llm.py - Distributed training with Ray
import ray
from ray import train
from ray.train import ScalingConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import mlflow

@ray.remote(num_gpus=1)
class LLMTrainer:
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.model = None
        self.tokenizer = None
    
    def load_model(self):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
    
    def train(self, dataset, output_dir: str):
        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=3,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            learning_rate=2e-5,
            fp16=True,
            logging_steps=100,
            save_steps=1000,
            evaluation_strategy="steps",
            eval_steps=500
        )
        
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=dataset["train"],
            eval_dataset=dataset["validation"]
        )
        
        # Train with MLflow tracking
        with mlflow.start_run():
            mlflow.log_params({
                "model_name": self.model_name,
                "learning_rate": training_args.learning_rate,
                "batch_size": training_args.per_device_train_batch_size
            })
            
            result = trainer.train()
            
            mlflow.log_metrics({
                "train_loss": result.training_loss,
                "eval_loss": trainer.evaluate()["eval_loss"]
            })
            
            # Save model to MLflow
            mlflow.transformers.log_model(
                self.model,
                "model",
                registered_model_name="my-llm"
            )
        
        return result

# Run distributed training
ray.init(address="ray://ray-head:10001")

trainer = LLMTrainer.remote("meta-llama/Llama-2-7b-hf")
ray.get(trainer.load_model.remote())

# Load dataset
from datasets import load_dataset
dataset = load_dataset("your-dataset")

# Train
result = ray.get(trainer.train.remote(dataset, "/models/output"))

6. Model Serving with vLLM

# vllm-deployment.yaml - High-performance LLM serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: llmops
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        workload: gpu
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model=/models/my-llm
          - --tensor-parallel-size=1
          - --max-model-len=4096
          - --gpu-memory-utilization=0.9
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

7. CI/CD Pipeline with GitHub Actions

# .github/workflows/llmops-pipeline.yml
name: LLMOps Pipeline

on:
  push:
    branches: [main]

jobs:
  validate-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Validate model config
        run: python scripts/validate_model_config.py
      
      - name: Run model tests
        run: pytest tests/model_tests/

  train-model:
    needs: validate-model
    runs-on: ubuntu-latest
    steps:
      - name: Trigger Ray training job
        run: |
          ray job submit --address=http://ray-head:8265 \
            --runtime-env-json='{"working_dir": "./"}' \
            -- python train_llm.py
      
      - name: Wait for training completion
        run: python scripts/wait_for_training.py
  
  register-model:
    needs: train-model
    runs-on: ubuntu-latest
    steps:
      - name: Register model in MLflow
        run: |
          python scripts/register_model.py \
            --run-id=${{ env.MLFLOW_RUN_ID }} \
            --model-name=production-llm \
            --stage=Staging
  
  deploy-staging:
    needs: register-model
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/vllm-server-staging \
            vllm=vllm/vllm-openai:latest \
            --namespace=llmops-staging
          
          kubectl rollout status deployment/vllm-server-staging \
            --namespace=llmops-staging
      
      - name: Run integration tests
        run: pytest tests/integration/
  
  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Promote model to production
        run: |
          python scripts/promote_model.py \
            --model-name=production-llm \
            --stage=Production
      
      - name: Deploy to production
        run: |
          kubectl set image deployment/vllm-server \
            vllm=vllm/vllm-openai:latest \
            --namespace=llmops-prod
          
          kubectl rollout status deployment/vllm-server \
            --namespace=llmops-prod

8. Monitoring Stack

# Install Prometheus + Grafana
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values monitoring-values.yaml

# monitoring-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi

grafana:
  adminPassword: your-secure-password
  dashboardProviders:
    dashboards.yaml:
      apiVersion: 1
      providers:
      - name: 'llm-dashboards'
        folder: 'LLM Monitoring'
        type: file
        options:
          path: /var/lib/grafana/dashboards

9. Cost Analysis

9.1 DIY LLMOps Platform Cost (AWS)

EKS Cluster: $70/month (control plane)
Worker Nodes: $800/month (3x m5.2xlarge)
GPU Nodes: $2,400/month (1x p3.2xlarge on-demand)
Storage (S3): $100/month (models + artifacts)
RDS PostgreSQL: $150/month (MLflow backend)
Data Transfer: $50/month

Total: ~$3,570/month = $42,840/year

9.2 Comparison vs Commercial Platforms

Databricks Lakehouse Platform: $100K-300K/year
Azure ML: $80K-200K/year
Vertex AI: $75K-250K/year

Savings: 60-85% cost reduction

10. Case Study: Production Implementation

10.1 Deployment Stats

Models: 15 LLMs in production
Throughput: 50M tokens/day
Latency: p95 < 500ms
Uptime: 99.8%
Team size: 2 ML engineers

10.2 Results

✅ $120K annual savings vs Databricks
✅ 2-week setup time (vs 3-6 months for custom build)
✅ Full control over infrastructure
✅ No vendor lock-in

11. Best Practices

Start simple: MLflow + vLLM gets you 80% there
Use managed Kubernetes: EKS/GKE/AKS saves ops overhead
Spot instances for training: 70% cost savings
Monitor from day one: Prometheus + Grafana critical
Automate everything: GitHub Actions for CI/CD

12. Conclusion

Building a DIY LLMOps platform with open-source tools provides:

60-85% cost savings vs commercial platforms
Full control and customization
No vendor lock-in
Production-grade capabilities

Perfect for startups and mid-size companies not ready for $100K+ platform fees.

References

MLflow. (2025). “MLflow Documentation.” https://mlflow.org/docs/latest/index.html
vLLM. (2025). “vLLM Documentation.” https://docs.vllm.ai/
Ray. (2025). “Ray Train Documentation.” https://docs.ray.io/en/latest/train/train.html
Kubernetes. (2025). “Production Best Practices.” https://kubernetes.io/docs/setup/best-practices/

Written for ML platform engineers and technical leaders building cost-effective LLMOps infrastructure.

Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in