IntelNav: running a 33B model across three 8GB GPUs

A single 8GB consumer GPU cannot hold DeepSeek-Coder-33B at Q4_K_M. Three of them, stitched together over the network, can. That's the one-line pitch for IntelNav — and after months of pulling on that thread, I want to write down what the system actually looks like under the hood.

The shape of the chain

A transformer is a stack of identical-looking blocks. If you cut the stack into contiguous ranges and put each range on a different machine, the only thing you have to ship between machines is the hidden state at the cut. That hidden state is just a tensor — a few KB per token at typical hidden sizes — which is small enough that a residential uplink doesn't kneecap you.

Concretely, this is what local-swarm.sh spins up against Qwen 2.5 · 0.5B:

chat client      peer A          peer B          tail peer
─────────────    ──────────      ──────────      ──────────────
tokenize         forward         forward         forward
+ embed          layers 6..12    layers 12..18   layers 18..24
forward                                          + lm-head
layers 0..6                                      + sample
   │                │                │               │
   ▼                ▼                ▼               ▼
   └── ForwardHidden ─→ ForwardHidden ─→ ForwardHidden ─→ token ──┐
                                                                   │
   ┌────────────────────── render ←──────────────────────────────┘

Twenty-four blocks of Qwen, four parties, one prompt. The chat client owns the tokenizer, the embedding, and the first slice. Each peer owns a contiguous range. The tail peer owns the lm-head, samples a token, and streams it back upstream so the next forward can start.

Same protocol scales to DeepSeek-Coder-33B. Same shape: just longer, and each slice is bigger.

Why no single peer is enough

llama.cpp at Q4_K_M needs roughly 19GB for DeepSeek-Coder-33B's weights, plus headroom for KV cache and activations. An RTX 3070 with 8GB cannot hold even a quarter of that. Two of them still can't. Three of them, dividing the layer stack at sensible boundaries, can — because each only ever loads its own slice's weights.

This is the key trick. A peer's VRAM budget bounds the slice it can host, not the size of the model the swarm can serve. As long as some union of peers covers the layer range [0, N), a chain exists.

What crosses the wire

Mid-chain peers never see plaintext. What flows between them is a CBOR-framed ForwardHidden message — the tensor of activations after the previous slice's blocks. The chat client tokenizes and embeds locally; only the entry peer ever decrypts the prompt itself, and it does that under an ephemeral X25519 exchange whose shared secret feeds an AES-256-GCM session key.

So the boundary is the chain, not a vendor's TLS terminator. If a hop is curious, it sees opaque activations, not "what is the strategy doc you haven't shown your boss yet." Hidden states are not cryptographically protected — that's an honest line in the threat model — but reconstructing the prompt from a mid-chain residual is meaningfully harder than reading it off a vendor log.

Identities are Ed25519. The peer ID is multihash(pubkey). No bearer tokens, no API keys, no account.

Why llama.cpp instead of PyTorch

PyTorch was the obvious choice and I deliberately did not pick it. Three reasons:

Vendor-agnostic GPU. llama.cpp's ggml backend supports CUDA, ROCm, Vulkan, SYCL, and Apple Silicon from the same source tree. A swarm has to absorb whatever GPU a volunteer happens to own. Forcing CUDA only would shrink the population to one vendor.
Single-file weights. A GGUF is a self-describing file with quantization baked in. Chunking it for distribution is a content-hash problem, not a serialization problem.
Memory predictability. ggml's memory layout for a layer range is deterministic — I can probe a peer's free VRAM and tell them exactly which slices fit. PyTorch's autograd graphs and CUDA caching allocator make that prediction noisy.

The price is that I had to fork llama.cpp to expose layer-range forward and a partial-model loader. That fork lives at IntelNav/llama.cpp and CI builds prebuilt tarballs that intelnav-node dlopens at runtime, so the Rust crate never has to link the C++ build directly.

Peer discovery

Slices are advertised on a Kademlia DHT. A provider record is keyed by

blake3("intelnav/shard/v1|<model_cid>|<start>|<end>")

and carries the peer's id, its listen multiaddrs, an optional chunks_url for downloading the shard, and an optional forward_url for inference. Multiple peers PUT under the same key — Kademlia stores them as separate records, the consumer dedupes on peer_id and freshness-ranks on minted_at.

When a chat client boots, it walks the DHT, builds a ChainTarget by greedy-picking one provider per range, ranks candidates by TCP probe latency, and hands the result to the chain driver. If a hop dies mid-stream, the driver fails over to the next-best candidate for that range without restarting the turn.

Two binaries, one library

The codebase has two binaries (intelnav and intelnav-node) sharing one app library. The split exists for a single user-visible reason: closing the chat window can't take you off the swarm. If hosting and chatting were the same process, every user would inadvertently leave the moment they closed their terminal, and the swarm would be entirely transient.

The chat binary spawns a client-only libp2p host — DHT queries, no announce loop. The node binary is the daemon: full libp2p, announce loop every 5 minutes, in-process chunk HTTP server, in-process inference forward listener, drain watchdog, control RPC over a Unix socket. It runs as a systemd user unit, so it survives reboots without root.

Both binaries load the same Ed25519 seed from ~/.local/share/intelnav/peer.key, so they appear to the swarm as the same peer with the same id. No double identity, no IPC ceremony.

The contribution gate

There is no read-only mode. On first launch, after the TUI generates a config and pulls a signed bootstrap seed list, the user picks a slice their machine can host or takes the relay-only path. Chat doesn't open until that call is made. A swarm that reads without giving back is just the people who give back; the gate makes that explicit.

Where this sits in the literature

IntelNav is in the same family as Petals (Borzunov et al. 2022) and SWARM Parallelism (Ryabinin et al. 2023). The shared idea — pipeline-parallel transformer inference across volunteer machines — is a decade-old research thread now. What's specific to IntelNav is the combination of a vendor-agnostic ggml runtime, a content-addressed model store with on-the-fly chunking, a Kademlia shard index, and a hard contribution gate. The first three are engineering. The fourth is a social design choice.

What's slow, and why I'm fine with that

A four-hop chain is slower than a hosted call to a vendor model. Of course it is. Every hop adds a round trip and an activation copy; residential uplinks are not datacenter NICs. The interesting question is not whether a swarm can beat a datacenter today — it can't — but whether the protocol scales as the population grows.

Tor was unusably slow in 2003. BitTorrent was unusably slow in 2002. Both were population problems, not protocol problems. IntelNav is in the same place. If you want fast inference today, use a vendor; if you want inference that no single party logs, run a node.

What's next

The forward channel currently runs over plain TCP framed with CBOR; an upcoming patch wraps it in Noise XX so the inference path matches the libp2p hops. After that: signed bootstrap manifests (today is trust-on-first-fetch with caching), and a real evaluation pass on the 33B reference deployment.

If you want to try it on Linux:

curl -fsSL https://intelnav.net/install.sh | sh

Or pull the source and run a three-peer chain on one machine:

bash scripts/local-swarm.sh setup
bash scripts/local-swarm.sh start
bash scripts/local-swarm.sh ask "what is 17 squared?"

The protocol is the real one. The sandbox is just that all four parties live on the same box.