How to deploy an AI model on Kubernetes, step by step

2026-06-12

In my homelab I serve a local AI model behind an OpenAI-compatible API, running on Kubernetes with a consumer GPU. This post walks you through building it from scratch, with the same steps (and the same stumbles) I went through.

What we're building

A llama.cpp Deployment serving a GGUF model (Gemma 4 12B in my case), exposed as a /v1/chat/completions API. Anything that speaks OpenAI can speak to your cluster.

Prerequisites

A Kubernetes cluster. I use single-node K3s: it auto-detects the NVIDIA runtime if nvidia-container-toolkit is installed.
A GPU with enough VRAM (we'll do the math below) and its drivers.
The NVIDIA device plugin deployed: it's what advertises nvidia.com/gpu as a schedulable resource.

Step 1: pick a model and quantization

Rule of thumb for a 4-bit GGUF (Q4_K_M): ~0.6 GB of VRAM per billion parameters, plus the KV cache (depends on context size). A 12B model in Q4 is ~7.5 GB of weights; on my 16 GB RTX 5060 Ti it fits with 128K context to spare.

Download the GGUF (Hugging Face: look for the official repo or community ones like bartowski) and drop it on a node disk, e.g. /data/models/.

Step 2: a PersistentVolume for the models

GGUFs weigh gigabytes: they don't belong inside the image. A local PV pointing at the disk folder, plus its PVC:

yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-models
spec:
  capacity: { storage: 30Gi }
  accessModes: [ReadWriteOnce]
  storageClassName: local-hdd
  local: { path: /data/models }
  nodeAffinity:        # a local PV lives on ONE specific node
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - { key: kubernetes.io/hostname, operator: In, values: [my-node] }

Step 3: the llama.cpp Deployment

The centerpiece. What matters: llama.cpp's server-cuda image, the nvidia.com/gpu: 1 resource, and the server flags.

yaml

apiVersion: apps/v1
kind: Deployment
metadata: { name: llm }
spec:
  replicas: 1
  strategy: { type: Recreate }   # the GPU can't back 2 replicas at once
  template:
    spec:
      runtimeClassName: nvidia
      containers:
        - name: llama-server
          image: ghcr.io/ggml-org/llama.cpp:server-cuda  # pin to a digest in production
          args:
            - --model
            - /models/gemma-4-12B-it-Q4_K_M.gguf
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --n-gpu-layers
            - "99"            # every layer on the GPU
            - --ctx-size
            - "32768"
            - --flash-attn
            - "on"
          env:
            - name: LLAMA_API_KEY   # protect the endpoint
              valueFrom: { secretKeyRef: { name: llm-secrets, key: API_KEY } }
          resources:
            limits: { memory: 8Gi, nvidia.com/gpu: "1" }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 30   # loading gigabytes from disk takes a while
          volumeMounts:
            - { name: models, mountPath: /models, readOnly: true }
      volumes:
        - name: models
          persistentVolumeClaim: { claimName: llm-models }

Details that matter:

Generous probes: the server takes a while to load the model from disk; if initialDelaySeconds is too short, Kubernetes will kill the pod before it finishes.
API key as a Secret, never in the manifest.
--ctx-size sets the context window and the KV cache size: more context = more VRAM. Start conservative and raise it while measuring.

Step 4: Service and verification

A ClusterIP Service on port 8000, then test it:

bash

kubectl exec -it deploy/any-pod -- \
  curl http://llm.my-namespace.svc:8000/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hi"}]}'

And watch real VRAM with nvidia-smi: it's your source of truth for tuning context and quantization.

Stumbles I learned from

The GPU can be shared: with time-slicing in the device plugin, the same chip serves the LLM and transcodes video (Jellyfin). VRAM isn't partitioned though: do your own math.
Every model has its own sampling: I migrated from Qwen to Gemma and answers degenerated into repetition loops — Gemma needs --temp 1.0. Read the model card.
GitOps here too: my manifest lives in Git and ArgoCD applies it. Swapping models is a two-line edit and a push.

Questions while building yours? Reach out from the home page — and if you want to see this setup in action, this site's chat runs exactly like this.