How to deploy an AI model on Kubernetes, step by step
In my homelab I serve a local AI model behind an OpenAI-compatible API, running on Kubernetes with a consumer GPU. This post walks you through building it from scratch, with the same steps (and the same stumbles) I went through.
What we're building
A llama.cpp Deployment serving a GGUF model (Gemma 4 12B in my case), exposed as a /v1/chat/completions API. Anything that speaks OpenAI can speak to your cluster.
Prerequisites
- A Kubernetes cluster. I use single-node K3s: it auto-detects the NVIDIA runtime if
nvidia-container-toolkitis installed. - A GPU with enough VRAM (we'll do the math below) and its drivers.
- The NVIDIA device plugin deployed: it's what advertises
nvidia.com/gpuas a schedulable resource.
Step 1: pick a model and quantization
Rule of thumb for a 4-bit GGUF (Q4_K_M): ~0.6 GB of VRAM per billion parameters, plus the KV cache (depends on context size). A 12B model in Q4 is ~7.5 GB of weights; on my 16 GB RTX 5060 Ti it fits with 128K context to spare.
Download the GGUF (Hugging Face: look for the official repo or community ones like bartowski) and drop it on a node disk, e.g. /data/models/.
Step 2: a PersistentVolume for the models
GGUFs weigh gigabytes: they don't belong inside the image. A local PV pointing at the disk folder, plus its PVC:
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-models
spec:
capacity: { storage: 30Gi }
accessModes: [ReadWriteOnce]
storageClassName: local-hdd
local: { path: /data/models }
nodeAffinity: # a local PV lives on ONE specific node
required:
nodeSelectorTerms:
- matchExpressions:
- { key: kubernetes.io/hostname, operator: In, values: [my-node] }
Step 3: the llama.cpp Deployment
The centerpiece. What matters: llama.cpp's server-cuda image, the nvidia.com/gpu: 1 resource, and the server flags.
apiVersion: apps/v1
kind: Deployment
metadata: { name: llm }
spec:
replicas: 1
strategy: { type: Recreate } # the GPU can't back 2 replicas at once
template:
spec:
runtimeClassName: nvidia
containers:
- name: llama-server
image: ghcr.io/ggml-org/llama.cpp:server-cuda # pin to a digest in production
args:
- --model
- /models/gemma-4-12B-it-Q4_K_M.gguf
- --host
- "0.0.0.0"
- --port
- "8000"
- --n-gpu-layers
- "99" # every layer on the GPU
- --ctx-size
- "32768"
- --flash-attn
- "on"
env:
- name: LLAMA_API_KEY # protect the endpoint
valueFrom: { secretKeyRef: { name: llm-secrets, key: API_KEY } }
resources:
limits: { memory: 8Gi, nvidia.com/gpu: "1" }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 30 # loading gigabytes from disk takes a while
volumeMounts:
- { name: models, mountPath: /models, readOnly: true }
volumes:
- name: models
persistentVolumeClaim: { claimName: llm-models }
Details that matter:
- Generous probes: the server takes a while to load the model from disk; if
initialDelaySecondsis too short, Kubernetes will kill the pod before it finishes. - API key as a Secret, never in the manifest.
--ctx-sizesets the context window and the KV cache size: more context = more VRAM. Start conservative and raise it while measuring.
Step 4: Service and verification
A ClusterIP Service on port 8000, then test it:
kubectl exec -it deploy/any-pod -- \
curl http://llm.my-namespace.svc:8000/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Hi"}]}'
And watch real VRAM with nvidia-smi: it's your source of truth for tuning context and quantization.
Stumbles I learned from
- The GPU can be shared: with time-slicing in the device plugin, the same chip serves the LLM and transcodes video (Jellyfin). VRAM isn't partitioned though: do your own math.
- Every model has its own sampling: I migrated from Qwen to Gemma and answers degenerated into repetition loops — Gemma needs
--temp 1.0. Read the model card. - GitOps here too: my manifest lives in Git and ArgoCD applies it. Swapping models is a two-line edit and a push.
Questions while building yours? Reach out from the home page — and if you want to see this setup in action, this site's chat runs exactly like this.