A plain-English guide to Large Language Models — what they are, how they actually work under the hood, and how to run your own on Kubernetes. Written for non-technical managers and engineers alike: skim the analogies, then drop into the YAML when you are ready to deploy.

1. What is a Large Language Model, really?
Imagine the autocomplete on your phone. You type "I'll call you when I get" and it suggests "home". That is a tiny language model: it has seen a lot of text messages and learned which words tend to follow which. A Large Language Model is the same idea scaled up by a factor of millions — trained not on your texts but on a large slice of the public internet, books, code, and documentation.
The word "large" is doing real work. These models contain billions of internal numbers (called parameters) that get tuned during training. When people say a model is "7B" or "70B", they mean 7 billion or 70 billion parameters. More parameters generally means more capability — and a bigger appetite for memory and compute.
A useful mental model for managers: an LLM is a tireless, very well-read graduate assistant. It has read more than any human ever could, drafts quickly, and never gets bored — but it sometimes states things with total confidence that are simply wrong, and it has no memory of yesterday unless you remind it.
2. How they actually work
Under the hood there are five ideas. You do not need maths to follow them — each has an everyday analogy.
2.1 Tokens — chopping text into pieces
Models do not read whole words. They break text into tokens: common chunks of characters. "Kubernetes" might become Kub + ernetes; "running" might be run + ning. Roughly, 1 token ≈ 0.75 English words, so 1,000 tokens is about 750 words. This matters commercially: hosted APIs bill per token, and a model's context window (how much it can "see" at once) is measured in tokens.
2.2 Embeddings — turning words into coordinates of meaning
Each token is converted into a long list of numbers — an embedding — that represents its meaning as a point in space. Words used in similar ways end up near each other. "King" and "queen" sit close together; "king" and "banana" sit far apart. This is how a machine that only does arithmetic can manipulate meaning: meaning has been turned into geometry.
2.3 The Transformer and "attention" — the 2017 breakthrough
The architecture behind every modern LLM is the Transformer. Its key trick is called attention: when processing a word, the model looks back at every other word in the sentence and decides which ones are relevant. In "The trophy did not fit in the suitcase because it was too big," attention is what lets the model work out that "it" refers to the trophy, not the suitcase. Attention is why these models handle context, nuance, and long-range references so well — and why they need so much compute, because every word attends to every other word.
2.4 Training — three stages
- Pre-training: the model is shown billions of sentences with the last word hidden and asked to guess it. Get it wrong, nudge the parameters, repeat — trillions of times. This is where it absorbs grammar, facts, reasoning patterns, and code. It is also the expensive part, costing millions of pounds in GPU time.
- Fine-tuning: the raw model is then trained on curated examples of helpful question-and-answer behaviour, so it acts like an assistant rather than an autocomplete.
- Alignment (RLHF): finally, humans rank the model's answers and that feedback is used to make it more helpful, honest, and safe. This is why a chat model declines harmful requests and adopts a consistent tone.
2.5 Inference — how it answers you
When you send a prompt, the model generates the reply one token at a time: it predicts the most likely next token, appends it, then predicts the next, and so on — like a fast, well-read person writing word by word without planning the whole sentence first. A setting called temperature controls how adventurous it is: low temperature gives safe, repeatable answers (good for code and extraction); high temperature gives creative, varied answers (good for brainstorming). This token-by-token process is called inference, and it is the part you pay for in production — every request burns GPU cycles.
3. What LLMs are good and bad at
Knowing the edges of the tool prevents expensive mistakes.
| Genuinely good at | Be careful with |
|---|---|
| Drafting, rewriting and summarising text | Hallucination — inventing plausible but false facts, citations, or APIs |
| Explaining concepts and answering FAQs | Knowledge cutoff — it does not know events after its training date |
| Writing and reviewing code | Exact arithmetic and counting (use a tool/calculator instead) |
| Classifying, extracting and reformatting data | Anything where a confident wrong answer is dangerous without review |
The fix for most of these is RAG (Retrieval-Augmented Generation): instead of trusting the model's memory, you fetch relevant documents from your own systems and paste them into the prompt, so the model answers from your data. This is how you build a chatbot over your internal wiki without the model making things up.
4. The vocabulary, decoded
| Term | What it means in plain English |
|---|---|
| Parameters (7B/70B) | The tuned internal numbers. More = smarter but heavier. |
| Context window | How much text it can hold in working memory at once (in tokens). |
| Inference | Running the model to get an answer — your recurring compute cost. |
| Quantization | Compressing the model (e.g. to 4-bit) so it fits on smaller GPUs with a small quality trade-off. |
| Fine-tuning | Further training on your own examples to specialise behaviour. |
| RAG | Feeding the model your documents at query time so it answers from facts, not memory. |
| VRAM | GPU memory. The single biggest constraint on which models you can run. |
5. Why run your own LLM?
Hosted APIs (OpenAI, Anthropic, and others) are the fastest way to start and are excellent. But there are four reasons organisations choose to self-host an open-weights model such as Llama, Mistral, Qwen, or Gemma:
- Data privacy & compliance. Sensitive data never leaves your network — important for healthcare, finance, legal, and public sector.
- Cost at scale. Above a certain steady request volume, a GPU you own can be cheaper per token than paying an API.
- Control & stability. The model never changes underneath you, and you are not subject to a vendor's rate limits or deprecations.
- Latency & offline use. Inference next to your application, or on-premises with no internet dependency.
The trade-off is that you now own the hardware, scaling, and reliability — which is exactly what Kubernetes is good at.
6. The hardware reality (read this before you deploy)
An LLM's weights must fit in GPU memory (VRAM). A rough rule of thumb:
| Model size | Full precision (FP16) | Quantized (4-bit) |
|---|---|---|
| 7–8B (e.g. Mistral 7B, Llama 3 8B) | ~16 GB VRAM | ~5–6 GB VRAM |
| 13–14B | ~28 GB VRAM | ~10 GB VRAM |
| 70B | ~140 GB (multi-GPU) | ~40 GB VRAM |
Two serving engines dominate real deployments:
- Ollama — the easiest on-ramp. Great for development, internal tools, and CPU or single-GPU nodes. Pulls quantized models with one command.
- vLLM — the production workhorse. High-throughput, batches many requests, and exposes an OpenAI-compatible API so your existing code works with a one-line URL change. (Hugging Face TGI is a close alternative.)
7. Deploying your own LLM on Kubernetes
Everything below assumes a cluster with at least one GPU node and the NVIDIA device plugin installed, which exposes GPUs as the schedulable resource nvidia.com/gpu. We will build it up piece by piece.
7.1 A namespace and a place to store model weights
Model files are large (gigabytes) and slow to download, so we cache them on a PersistentVolumeClaim rather than re-pulling on every pod restart.
apiVersion: v1
kind: Namespace
metadata:
name: llm
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: llm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi # weights are big; size generously
# storageClassName: fast-ssd # use an SSD class if your cluster offers one
7.2 Option A — Ollama (the easy start)
Ollama is ideal for a first deployment or an internal tool. This runs it with one GPU; delete the nvidia.com/gpu limit to run CPU-only on a beefy node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: llm
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
nvidia.com/gpu: 1 # remove this line for CPU-only
memory: 16Gi
volumeMounts:
- name: models
mountPath: /root/.ollama
readinessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 10
periodSeconds: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: llm
spec:
selector:
app: ollama
ports:
- port: 80
targetPort: 11434
Once the pod is running, pull a model into the cache and chat with it:
# pull a quantized model into the PVC (one-off)
kubectl -n llm exec deploy/ollama -- ollama pull llama3
# ask it something from inside the cluster
kubectl -n llm exec deploy/ollama -- \
ollama run llama3 "Explain Kubernetes in one sentence."
7.3 Option B — vLLM (production throughput, OpenAI-compatible)
For real traffic, vLLM serves many concurrent requests efficiently and speaks the OpenAI API dialect. Gated models (like Llama) need a Hugging Face token, stored as a Secret.
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: llm
type: Opaque
stringData:
token: "hf_xxxxxxxxxxxxxxxxxxxxxxxx" # your Hugging Face access token
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-mistral
namespace: llm
labels:
app: vllm-mistral
spec:
replicas: 1
selector:
matchLabels:
app: vllm-mistral
template:
metadata:
labels:
app: vllm-mistral
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "mistralai/Mistral-7B-Instruct-v0.3"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.90"
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60 # first start downloads weights
periodSeconds: 15
failureThreshold: 40
volumes:
- name: cache
persistentVolumeClaim:
claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
name: vllm-mistral
namespace: llm
spec:
selector:
app: vllm-mistral
ports:
- port: 80
targetPort: 8000
Because vLLM is OpenAI-compatible, application code only needs the in-cluster URL — no SDK changes:
curl http://vllm-mistral.llm.svc.cluster.local/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Summarise our refund policy."}],
"temperature": 0.2
}'
7.4 Exposing it with an Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-api
namespace: llm
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300" # long generations
spec:
ingressClassName: nginx
rules:
- host: llm.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-mistral
port:
number: 80
7.5 Scaling — and why it is different with GPUs
You can autoscale, but remember each replica needs its own whole GPU — you cannot fractionally share one in the simple case, and there is no point scaling beyond the GPUs you physically have. Scale on a queue or request-rate signal (KEDA is excellent here) rather than CPU.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-mistral
namespace: llm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-mistral
minReplicas: 1
maxReplicas: 4 # never exceed the number of GPUs in the cluster
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# For real LLM scaling, prefer KEDA on a queue depth or
# requests-per-second metric exported by vLLM, not CPU.
8. A pragmatic rollout checklist
- Prototype on a hosted API to validate the use case before buying GPUs.
- Pick an open-weights model that fits your hardware (start with a 7–8B model, quantized).
- Start with Ollama for an internal pilot; graduate to vLLM when you need throughput.
- Add RAG over your own documents to cut hallucinations and keep answers current.
- Keep a human in the loop wherever a wrong answer carries real cost.
- Measure inference cost and latency per request — that is your true unit economics.
