BASEL.FM

Personal dispatches & reflections

Running LLMs in Secure & Air-Gapped Infrastructure

There's a version of this job that's clean and simple: you pick a model, call an API, and ship. OpenAI, Anthropic, Google — they handle the infrastructure, you handle the application. Done.

Then there's the other version. The one where the client is a defense contractor, a hospital, a government agency, or any institution that operates under strict data sovereignty requirements. The one where the hardware lives in a room you need a badge to enter, and that hardware has never seen the internet — and never will.

That's what this post is about.


What is an Air-Gapped Environment?

Before anything else, let's make sure we're speaking the same language.

An air-gapped system is a network or machine that is completely physically isolated from external networks — no internet, no Wi-Fi, no Bluetooth, nothing. The term comes from the literal "air gap" between the machine and any unsecured network. To move data in or out, you need physical media: a USB drive, a hard drive, a disc. There's no remote connection to exploit.

This sounds extreme, but it's standard practice in different domains. The security benefit is obvious: you can't hack what you can't reach. The operational cost is also obvious: you have to do everything manually — including deploying AI.


Prerequisites

Before you attempt this, you should be comfortable with the following:

  • Linux — you'll be living in the terminal
  • Docker (Dockerfile + Compose) — the entire deployment is containerized
  • LLM fundamentals — quantization, VRAM budgeting, tensor parallelism

If any of those feel shaky, shore them up first. The air-gapped deployment layer adds complexity on top of all three.


The LLM Serving Framework Landscape

There are several frameworks for running LLMs locally on your own hardware. Here's how they compare:

FrameworkBest ForKey Trait
llama.cppCPU / edge / no-GPUExtreme portability, C++ core, minimal dependencies
OllamaLocal dev & prototypingSingle-command simplicity, wraps llama.cpp
vLLMMulti-user production (no K8s)High throughput via PagedAttention, OpenAI-compatible
NVIDIA TritonNVIDIA-native productionDeep NVIDIA ecosystem integration
TensorRT-LLMMax performance on NVIDIARequires model compilation, highest throughput ceiling
Ray ServeDistributed / cluster-basedNative Kubernetes + distributed compute
AIBrixServerless LLM on K8sKubernetes-native, autoscaling

After working through most of these, my recommendation for bare-metal GPU clusters without Kubernetes is vLLM. It hits the right balance: production-grade performance without requiring a container orchestration setup, an OpenAI-compatible API out of the box, and solid multi-GPU support.

A quick note on the others: Ollama is fantastic for personal use and prototyping, but under concurrent load its static memory allocation model creates a throughput ceiling. llama.cpp shines on CPU-only or edge deployments. TensorRT-LLM can be faster, but it requires model compilation and is deeply tied to the NVIDIA toolchain — more engineering overhead than most teams want.


Why vLLM?

vLLM was built at UC Berkeley and introduced PagedAttention — an attention mechanism that manages GPU memory the way operating systems manage RAM: using pages.

Instead of pre-allocating a contiguous memory block for each request's KV cache (which wastes 60–80% of GPU memory due to fragmentation), PagedAttention allocates memory on demand in fixed-size blocks. The result: near-zero memory waste, higher batch sizes, and 2–4× higher throughput compared to previous systems at the same latency.

In practice, this matters a lot when multiple users are hitting the same endpoint simultaneously. vLLM handles concurrent requests gracefully. Ollama doesn't.


The Full Deployment Workflow

Let's go through each step.


Step 1 — Choose Your Model

Start at the vLLM supported models list to confirm your model is supported. Then browse Hugging Face for the details you'll need:

  • Parameter count → determines base VRAM requirements
  • Number of attention heads / shards → determines valid tensor-parallel-size values
  • Quantization optionsbitsandbytes, AWQ, GPTQ, etc.
  • Architecture → some newer architectures require the latest vLLM image

Step 2 — Pull the vLLM Docker Image

vLLM distributes official Docker images via Docker Hub. The image must be compatible with your hardware's CUDA version — check nvidia-smi before pulling.

docker pull vllm/vllm-openai:latest

Note: When a new model architecture is released, you sometimes need the latest vLLM image to support it. If a model fails to load with a cryptic error, pulling a newer image is often the fix before going deeper into debugging.


Step 3 — Download the Model

Run this on your staging machine (internet-connected). vLLM will download the model weights from Hugging Face into the mounted volume.

# docker-compose.staging.yml
services:
  llm-server:
    container_name: llm-server
    image: vllm/vllm-openai:latest
    network_mode: host
    ipc: host                        # full shared memory access for multi-GPU comms
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2               # adjust to your GPU count
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1     # GPU IDs to use
    volumes:
      - model-cache:/root/.cache/huggingface
    entrypoint: python3
    command: >
      -m vllm.entrypoints.openai.api_server
      --model <model_id>
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --quantization bitsandbytes
      --gpu-memory-utilization 0.95

volumes:
  model-cache:

Understanding --tensor-parallel-size

This flag tells vLLM how many GPUs to split the model across. The hard constraint: the model's number of attention heads must be evenly divisible by this number.

Check the model card before setting this value. If you get it wrong, vLLM will tell you on startup.


Step 4 — Move to Air-Gapped Infrastructure

This is where it gets more involved. You have two environments:

Option A — Internal Docker Registry:

# Tag for your internal registry
docker tag vllm/vllm-openai:latest <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest

# Push to the internal registry (accessible within the air-gapped network)
docker push <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest

Option B — Tarball transfer (truly offline):

# Export the image on staging
docker save vllm/vllm-openai:latest | gzip > vllm-image.tar.gz

# Load it on the air-gapped machine after physical transfer
docker load < vllm-image.tar.gz

Move the model files:

# If the two environments share a network segment (staging to production)
rsync -avz /path/to/model-cache/ user@production-host:/path/to/model-cache/

# If truly offline: copy to external drive, physically walk it over, copy off

Step 5 — Serve on Air-Gapped Infrastructure

The production Compose file is nearly identical to staging — two things change: the image source and the volume mount.

# docker-compose.production.yml
services:
  llm-server:
    container_name: llm-server
    image: <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest  # from internal registry
    network_mode: host
    ipc: host
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    volumes:
      - /data/models:/root/.cache/huggingface  # local path to copied model files
    entrypoint: python3
    command: >
      -m vllm.entrypoints.openai.api_server
      --model <model_id>
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --quantization bitsandbytes
      --gpu-memory-utilization 0.95

Start it detached:

docker compose -f docker-compose.production.yml up -d

Watch startup logs:

docker compose logs -f llm-server

You'll see vLLM loading the model shards across GPUs, memory allocation logs, and finally a message confirming the FastAPI server is running on your configured port.


Step 6 — Use the API

vLLM exposes a fully OpenAI-compatible REST API. You can use it exactly like the OpenAI API — just point your client at the local endpoint.

from openai import OpenAI

client = OpenAI(
    base_url="http://<HOST>:<PORT>/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="<model_id>",
    messages=[
        {"role": "user", "content": "Explain transformer attention in one paragraph."}
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

The full list of available endpoints:

EndpointDescription
GET /v1/modelsList loaded models
POST /v1/chat/completionsChat (OpenAI-compatible)
POST /v1/completionsText completion
GET /metricsPrometheus-compatible metrics
GET /healthHealth check

Wrapping Up

This workflow gives you a fully self-contained LLM serving stack that:

  • Runs entirely on-premise — no outbound network calls, ever
  • Scales across GPUs via tensor parallelism
  • Exposes a standard API your existing application code can use without changes
  • Works where cloud is a non-starter — classified environments, healthcare, finance

The operational overhead is real, you own the updates, the image management, the model versioning, and the hardware. But in contexts where data sovereignty isn't negotiable, this is the tradeoff you're making, and vLLM makes it about as smooth as it can be without Kubernetes in the picture.

If your environment does have Kubernetes, the conversation shifts to Ray Serve or AIBrix. That's a different story.


Have questions or ran into something I didn't cover? Feel free to reach out.