Workstation Logo
Solutions IA
Stations de Travail IAIA PrivéeClusters GPUIA EdgeLaboratoire IA EntrepriseIA par Industrie
Produits
CRMMarketingAgents OpenAI
À Propos
PartenairesTémoignages Clients
Articles
Documentation
Nous ContacterLogin
Workstation

AI workstations, GPU infrastructure, and intelligent agent solutions for modern businesses.

UK: 77-79 Marlowes, Hemel Hempstead HP1 1LF

Brussels: Workstation SRL, Rue Vanderkindere 34, 1180 Uccle
BE 0751.518.683

AI Solutions

AI WorkstationsPrivate AIGPU ClustersEdge AIEnterprise AI

Resources

ArticlesDocumentationBlogSearch

Company

About UsPartnersContact

© 2026 Workstation AI. All rights reserved.

PrivacyCookies
Home / Articles / Technology

Large Language Models Explained: How LLMs Work and How to Run Your Own on Kubernetes

Tokens, embeddings, transformers, training and inference — explained for managers and engineers — plus production-ready Kubernetes YAML to deploy your own LLM with Ollama and vLLM

June 5, 2026Technology8 min read
Large Language Models Explained: How LLMs Work and How to Run Your Own on Kubernetes
AIMLOpsKubernetes

A plain-English guide to Large Language Models — what they are, how they actually work under the hood, and how to run your own on Kubernetes. Written for non-technical managers and engineers alike: skim the analogies, then drop into the YAML when you are ready to deploy.

Large Language Models (LLMs) — deep learning, neural networks, generative AI

The one-paragraph version. A Large Language Model is a very large pattern-matching machine that has read an enormous amount of text and learned to predict what word comes next. By predicting the next word over and over, it can write essays, answer questions, summarise documents, and generate code. It does not "understand" in the human sense — it is statistics at extraordinary scale — but the results are good enough to be genuinely useful, provided you know its limits.

1. What is a Large Language Model, really?

Imagine the autocomplete on your phone. You type "I'll call you when I get" and it suggests "home". That is a tiny language model: it has seen a lot of text messages and learned which words tend to follow which. A Large Language Model is the same idea scaled up by a factor of millions — trained not on your texts but on a large slice of the public internet, books, code, and documentation.

The word "large" is doing real work. These models contain billions of internal numbers (called parameters) that get tuned during training. When people say a model is "7B" or "70B", they mean 7 billion or 70 billion parameters. More parameters generally means more capability — and a bigger appetite for memory and compute.

A useful mental model for managers: an LLM is a tireless, very well-read graduate assistant. It has read more than any human ever could, drafts quickly, and never gets bored — but it sometimes states things with total confidence that are simply wrong, and it has no memory of yesterday unless you remind it.

2. How they actually work

Under the hood there are five ideas. You do not need maths to follow them — each has an everyday analogy.

2.1 Tokens — chopping text into pieces

Models do not read whole words. They break text into tokens: common chunks of characters. "Kubernetes" might become Kub + ernetes; "running" might be run + ning. Roughly, 1 token ≈ 0.75 English words, so 1,000 tokens is about 750 words. This matters commercially: hosted APIs bill per token, and a model's context window (how much it can "see" at once) is measured in tokens.

2.2 Embeddings — turning words into coordinates of meaning

Each token is converted into a long list of numbers — an embedding — that represents its meaning as a point in space. Words used in similar ways end up near each other. "King" and "queen" sit close together; "king" and "banana" sit far apart. This is how a machine that only does arithmetic can manipulate meaning: meaning has been turned into geometry.

2.3 The Transformer and "attention" — the 2017 breakthrough

The architecture behind every modern LLM is the Transformer. Its key trick is called attention: when processing a word, the model looks back at every other word in the sentence and decides which ones are relevant. In "The trophy did not fit in the suitcase because it was too big," attention is what lets the model work out that "it" refers to the trophy, not the suitcase. Attention is why these models handle context, nuance, and long-range references so well — and why they need so much compute, because every word attends to every other word.

2.4 Training — three stages

  1. Pre-training: the model is shown billions of sentences with the last word hidden and asked to guess it. Get it wrong, nudge the parameters, repeat — trillions of times. This is where it absorbs grammar, facts, reasoning patterns, and code. It is also the expensive part, costing millions of pounds in GPU time.
  2. Fine-tuning: the raw model is then trained on curated examples of helpful question-and-answer behaviour, so it acts like an assistant rather than an autocomplete.
  3. Alignment (RLHF): finally, humans rank the model's answers and that feedback is used to make it more helpful, honest, and safe. This is why a chat model declines harmful requests and adopts a consistent tone.

2.5 Inference — how it answers you

When you send a prompt, the model generates the reply one token at a time: it predicts the most likely next token, appends it, then predicts the next, and so on — like a fast, well-read person writing word by word without planning the whole sentence first. A setting called temperature controls how adventurous it is: low temperature gives safe, repeatable answers (good for code and extraction); high temperature gives creative, varied answers (good for brainstorming). This token-by-token process is called inference, and it is the part you pay for in production — every request burns GPU cycles.

Manager's takeaway. Training builds the model once (someone else usually pays for that). Inference is the recurring cost you own when you run it: it scales with how many people use it and how long the answers are. Most "how much will our AI cost?" questions are really inference questions.

3. What LLMs are good and bad at

Knowing the edges of the tool prevents expensive mistakes.

Genuinely good at Be careful with
Drafting, rewriting and summarising textHallucination — inventing plausible but false facts, citations, or APIs
Explaining concepts and answering FAQsKnowledge cutoff — it does not know events after its training date
Writing and reviewing codeExact arithmetic and counting (use a tool/calculator instead)
Classifying, extracting and reformatting dataAnything where a confident wrong answer is dangerous without review

The fix for most of these is RAG (Retrieval-Augmented Generation): instead of trusting the model's memory, you fetch relevant documents from your own systems and paste them into the prompt, so the model answers from your data. This is how you build a chatbot over your internal wiki without the model making things up.

4. The vocabulary, decoded

Term What it means in plain English
Parameters (7B/70B)The tuned internal numbers. More = smarter but heavier.
Context windowHow much text it can hold in working memory at once (in tokens).
InferenceRunning the model to get an answer — your recurring compute cost.
QuantizationCompressing the model (e.g. to 4-bit) so it fits on smaller GPUs with a small quality trade-off.
Fine-tuningFurther training on your own examples to specialise behaviour.
RAGFeeding the model your documents at query time so it answers from facts, not memory.
VRAMGPU memory. The single biggest constraint on which models you can run.

5. Why run your own LLM?

Hosted APIs (OpenAI, Anthropic, and others) are the fastest way to start and are excellent. But there are four reasons organisations choose to self-host an open-weights model such as Llama, Mistral, Qwen, or Gemma:

  • Data privacy & compliance. Sensitive data never leaves your network — important for healthcare, finance, legal, and public sector.
  • Cost at scale. Above a certain steady request volume, a GPU you own can be cheaper per token than paying an API.
  • Control & stability. The model never changes underneath you, and you are not subject to a vendor's rate limits or deprecations.
  • Latency & offline use. Inference next to your application, or on-premises with no internet dependency.

The trade-off is that you now own the hardware, scaling, and reliability — which is exactly what Kubernetes is good at.

6. The hardware reality (read this before you deploy)

An LLM's weights must fit in GPU memory (VRAM). A rough rule of thumb:

Model size Full precision (FP16) Quantized (4-bit)
7–8B (e.g. Mistral 7B, Llama 3 8B)~16 GB VRAM~5–6 GB VRAM
13–14B~28 GB VRAM~10 GB VRAM
70B~140 GB (multi-GPU)~40 GB VRAM

Two serving engines dominate real deployments:

  • Ollama — the easiest on-ramp. Great for development, internal tools, and CPU or single-GPU nodes. Pulls quantized models with one command.
  • vLLM — the production workhorse. High-throughput, batches many requests, and exposes an OpenAI-compatible API so your existing code works with a one-line URL change. (Hugging Face TGI is a close alternative.)

7. Deploying your own LLM on Kubernetes

Everything below assumes a cluster with at least one GPU node and the NVIDIA device plugin installed, which exposes GPUs as the schedulable resource nvidia.com/gpu. We will build it up piece by piece.

7.1 A namespace and a place to store model weights

Model files are large (gigabytes) and slow to download, so we cache them on a PersistentVolumeClaim rather than re-pulling on every pod restart.

apiVersion: v1
kind: Namespace
metadata:
  name: llm
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: llm
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi          # weights are big; size generously
  # storageClassName: fast-ssd   # use an SSD class if your cluster offers one

7.2 Option A — Ollama (the easy start)

Ollama is ideal for a first deployment or an internal tool. This runs it with one GPU; delete the nvidia.com/gpu limit to run CPU-only on a beefy node.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: llm
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: 8Gi
            limits:
              nvidia.com/gpu: 1        # remove this line for CPU-only
              memory: 16Gi
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          readinessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: llm
spec:
  selector:
    app: ollama
  ports:
    - port: 80
      targetPort: 11434

Once the pod is running, pull a model into the cache and chat with it:

# pull a quantized model into the PVC (one-off)
kubectl -n llm exec deploy/ollama -- ollama pull llama3

# ask it something from inside the cluster
kubectl -n llm exec deploy/ollama -- \
  ollama run llama3 "Explain Kubernetes in one sentence."

7.3 Option B — vLLM (production throughput, OpenAI-compatible)

For real traffic, vLLM serves many concurrent requests efficiently and speaks the OpenAI API dialect. Gated models (like Llama) need a Hugging Face token, stored as a Secret.

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: llm
type: Opaque
stringData:
  token: "hf_xxxxxxxxxxxxxxxxxxxxxxxx"   # your Hugging Face access token
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-mistral
  namespace: llm
  labels:
    app: vllm-mistral
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-mistral
  template:
    metadata:
      labels:
        app: vllm-mistral
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "mistralai/Mistral-7B-Instruct-v0.3"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.90"
          ports:
            - containerPort: 8000
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: cache
              mountPath: /root/.cache/huggingface
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60      # first start downloads weights
            periodSeconds: 15
            failureThreshold: 40
      volumes:
        - name: cache
          persistentVolumeClaim:
            claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-mistral
  namespace: llm
spec:
  selector:
    app: vllm-mistral
  ports:
    - port: 80
      targetPort: 8000

Because vLLM is OpenAI-compatible, application code only needs the in-cluster URL — no SDK changes:

curl http://vllm-mistral.llm.svc.cluster.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Summarise our refund policy."}],
    "temperature": 0.2
  }'

7.4 Exposing it with an Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-api
  namespace: llm
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"   # long generations
spec:
  ingressClassName: nginx
  rules:
    - host: llm.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-mistral
                port:
                  number: 80

7.5 Scaling — and why it is different with GPUs

You can autoscale, but remember each replica needs its own whole GPU — you cannot fractionally share one in the simple case, and there is no point scaling beyond the GPUs you physically have. Scale on a queue or request-rate signal (KEDA is excellent here) rather than CPU.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-mistral
  namespace: llm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-mistral
  minReplicas: 1
  maxReplicas: 4          # never exceed the number of GPUs in the cluster
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  # For real LLM scaling, prefer KEDA on a queue depth or
  # requests-per-second metric exported by vLLM, not CPU.

8. A pragmatic rollout checklist

  1. Prototype on a hosted API to validate the use case before buying GPUs.
  2. Pick an open-weights model that fits your hardware (start with a 7–8B model, quantized).
  3. Start with Ollama for an internal pilot; graduate to vLLM when you need throughput.
  4. Add RAG over your own documents to cut hallucinations and keep answers current.
  5. Keep a human in the loop wherever a wrong answer carries real cost.
  6. Measure inference cost and latency per request — that is your true unit economics.
Bottom line. An LLM is next-word prediction at a scale that becomes genuinely useful. Treat it as a brilliant but unreliable assistant: lean on it for drafting, summarising, classifying, and coding; verify anything that matters; ground it in your own data with RAG; and when privacy, cost, or control demand it, run your own on Kubernetes with Ollama to start and vLLM to scale.

Key Industry Statistics

85%

Adoption Rate

$2.3B

Market Size

45%

Growth Rate

Share this article:

Latest Trends 2024

  • AI-Powered Automation: 300% increase in adoption
  • Cloud-Native Solutions: 85% of enterprises migrating
  • Zero-Trust Security: $45B market by 2025
  • Edge Computing: 50% reduction in latency
  • MLOps Adoption: 200% growth year-over-year

Industry Insights

Market Opportunity

Global market expected to reach $500B by 2025, growing at 35% CAGR

Talent Demand

500K+ job openings for AI/DevOps engineers in 2024

Compliance

GDPR, SOC 2, and ISO 27001 certification becoming standard

Need Expert Help?

Our team of experts can help you implement these solutions in your organization.

Schedule ConsultationExplore Solutions

Stay Updated

Subscribe to receive the latest insights and trends

Related Articles in Technology

Polyglot Benchmarks: Choosing the Right Tool for the Right Job
Polyglot Benchmarks: Choosing the Right Tool for the Right Job

Six runtimes, seven HTTP tests, reproducible Docker harness: decision matrix, ARB evidence, workflow-examples repo, and polyglot-benchmarks.fictionally.org live dashboard

Read More
Can an AI System Process and File a Self Assessment Tax Return to HMRC Automatically?
Can an AI System Process and File a Self Assessment Tax Return to HMRC Automatically?

FileMyTax: six APIs from bank PDF upload to HMRC MTD Self Assessment filing — architecture, OAuth, AI benefits, limitations, and why human review still gates submission

Read More
Building a VAT Return System with Claude Code from Your Invoices and Expenses Database
Building a VAT Return System with Claude Code from Your Invoices and Expenses Database

From Postgres invoices + expenses DB to HMRC Box 1 to 9: prompts, the deterministic Python engine, the OAuth2 MTD submission, and the edge cases (Brexit, NI protocol, reverse charge, partial exemption, flat-rate, multi-currency, credit notes)

Read More