hi i’m sathwick.

I Rebuilt YouTube’s Load Balancing Algorithm in Go

2026-04-20T00:00:00+00:00

If you had to guess how a system like YouTube distributes traffic across millions of backend servers, you’d probably default to a classic approach like round-robin load balancing.

But Prequal challenges that intuition. Instead of balancing traffic evenly, it focuses on balancing wait time, routing requests based on how quickly they can actually be served rather than just spreading them uniformly.

According to Google, this approach is already deployed across 20+ services, including YouTube’s serving stack (NSDI ‘24 paper).

Over the past few weeks, I’ve been reimplementing this algorithm in Go partly to understand it deeply, and partly for the bragging rights of building my own load balancer from scratch.

This post is a technical walkthrough of both the paper and codebase:

I believe the most interesting part of this repo is not just that it implements Prequal. It is that the repo preserves the engineering process of getting to a result you can trust. There are wrong runs, methodological mistakes, a regime pivot, overhead profiling, and a final bounded claim rather than a “it worked on my machine”.

Key Takeaways

Prequal is a load-balancing algorithm Google reports deploying across 20+ services, including YouTube’s serving stack (NSDI ‘24 paper).

This Go reimplementation, packaged as a Kubernetes ingress controller, cuts p99 tail latency by 8.6x vs round-robin in a paper-aligned heterogeneous regime (16 backends, 16x service-time skew, I/O-bound).

In a small CPU-bound regime (4 backends), the same implementation is roughly 25% slower than round-robin. The negative case is published alongside the positive one.

The interesting engineering story is not the algorithm itself. It is the benchmark protocol, investigation trail, and regime pivot that turned a 10x-worse false negative into a bounded, defensible claim.

What problem does Prequal solve?

Prequal, introduced at NSDI ‘24 by Wydrowski et al., replaces CPU-based balancing with active probing of two per-backend signals: requests-in-flight (RIF) and recent latency. A hot-cold lexicographic rule picks the lowest-latency backend below an RIF quantile threshold (default 0.75), falling back to lowest-RIF when every candidate is congested.

Prequal’s central claim, from the paper is that the right signal for load balancing is not CPU utilization but expected wait time, and the paper reports that Google runs this approach across 20+ services including YouTube. The algorithm replaces smoothed load metrics with active probes of requests-in-flight and latency, then uses a hot-cold lexicographic rule on those two signals to pick a backend.

The paper starts from a real production observation inside Google: in large multi-tenant systems, balancing CPU evenly across replicas is not the same thing as minimizing latency. A backend can look “lightly loaded” according to a smoothed resource metric and still be a bad place to send the next request because it is on a noisy host, has a growing queue, or has just crossed into a regime where service time gets ugly.

That is part of what makes the paper compelling. The authors are not proposing a clever synthetic algorithm in the abstract. They are describing the load-balancing approach Google says it uses in production, especially in YouTube’s serving stack, after living with the failure modes of more conventional strategies.

Prequal’s answer is to use two signals:

RIF: requests in flight
latency: a backend-reported estimate of recent service latency

And then to sample those signals by probing backends asynchronously.

The selection rule from the paper is the part worth remembering. Prequal does not combine latency and RIF into one score by default. It uses a lexicographic rule:

Split candidates into “cold” and “hot” using an RIF quantile threshold.
If any cold candidates exist, pick the one with the lowest latency.
If every candidate is hot, pick the one with the lowest RIF.

That rule matters because it captures something simple and useful:

latency is the best tie-breaker among backends that are not yet visibly congested
once everything is congested, queue depth wins and you should pick the least loaded one

The paper calls this the hot-cold lexicographic rule, or HCL. In this repo, that is the heart of the algorithm.

If you have read about the power of two choices algorithm before, Prequal lives in the same family. Power of two picks two random backends and sends the request to whichever has fewer in-flight connections. It is cheap, surprisingly close to optimal, and widely deployed. HAProxy’s own benchmark (Power of Two Load Balancing) shows it beating round-robin on peak connection skew but still losing a few percent to a full least-connections scan. Prequal generalizes the idea: instead of two random picks checked synchronously, it keeps a small pool of asynchronous probe results and selects from that pool using both RIF and latency, not just connection count.

The other big paper idea is async probing. Synchronous probing would put an extra network hop in the critical path of every request. Prequal instead probes off the request path, stores recent probe observations in a bounded pool, and reuses them enough to be cheap without letting them go stale.

What I’ve actually built

At runtime this project is one Go binary with two jobs:

a Kubernetes controller that watches Ingress and EndpointSlice
an HTTP reverse proxy that receives requests and selects backends

Around that, the repo includes:

a Rust backend used for controlled benchmarks
benchmark manifests for uniform and heterogeneous workloads
k6 scripts for open-loop, ramp, burst, overload, multi-route, and long-duration tests
Prometheus and Grafana assets
frozen benchmark reports and investigation logs

The top-level structure is clean and maps well to the architecture:

controller/      Kubernetes reconciliation and route state
server/          Reverse proxy and request-path selection
loadbalancer/    Prequal, least-connections, round-robin, probe logic, RIF, latency, pools
backend/         Rust benchmark backend exposing /work and /prequal/probe
observability/   Prometheus metrics
tree/            Host/path trie for ingress routing
benchmark/       Manifests, k6 scripts, dashboards, reports, investigations, raw results

The entrypoint in main.go wires all of that together:

store := controller.NewBackendIPStore()
ctrl := controller.NewController(factory, store, queue)

tracker := &loadbalancer.RIFTracker{}
latencyTracker := loadbalancer.NewLatencyTracker()

cfg := loadbalancer.DefaultProbeConfig()
cfg.ApplyEnv()

pools := pool.NewRoutePools(pool.PoolConfig{
    MaxSize:     cfg.PoolMaxSize,
    MaxAge:      cfg.PoolMaxAge,
    ReuseLimit:  cfg.PoolReuseLimit,
    QRIF:        cfg.QRIF,
    MaxProbeAge: cfg.MaxProbeAge,
}, cfg.PoolMaintenanceInterval)

prober := loadbalancer.NewProber(pools, store, cfg, stop)
proxyServer := server.NewProxyServerWithConfig(
    ctrl.GetRouter(), store, selectors, tracker, latencyTracker, pools, prober, serverCfg,
)

That composition is the design of the loadbalancer:

the controller owns route and endpoint discovery
the proxy owns request forwarding
the load balancer owns route-local state and selection policy

Control plane: translating Kubernetes into route state

The control plane lives in controller/. It uses shared informers to watch Ingress and EndpointSlice, then builds two in-memory structures:

a host/path router
a route-key to endpoint list store

The important point is that the controller does not directly configure nginx or write files. It builds local state for the in-process proxy.

Watching ingress and endpoints

controller.NewController registers event handlers for both resource types:

ingress add, update, delete
endpointslice add, update, delete

On an ingress event, the controller queues the ingress key for reconciliation. On an endpoint event, it finds which ingresses depend on the service and requeues those.

That dependency mapping is stored in serviceToIngress, which is what lets endpoint churn trigger only the routes that care about it.

The ingress reconciliation path in controller/controller.go does four things:

filters to this controller’s class
parses rules into route specs
updates the router
refreshes the endpoint store for each referenced service/port

The filtering logic accepts:

spec.ingressClassName == "prequal"
legacy kubernetes.io/ingress.class: prequal
a fallback label ingress.class=prequal

Route matching with a trie

The router in controller/router.go delegates path matching to tree/, which implements a segment trie. Each host gets a HostConfig with a list of paths plus a trie built from those paths.

tree.Match supports:

exact paths
prefix paths
longest-prefix semantics
default-host fallback when a specific host is missing

That means route resolution is:

match host
walk the trie by URL segments
prefer exact matches, otherwise keep the best prefix match

Kubernetes objects are converted once into RouteSpec, and the request path never has to understand Kubernetes types.

Endpoint storage is route-local

The controller stores endpoints in BackendIPStore, keyed by a route key derived from namespace, service, and port:

namespace/service
namespace/service:port
namespace/service:portName

That matters because the load-balancing state is also route-local. If two ingress routes point at different services, their probe history does not mix. If two routes point at the same service but different ports, their state stays separate too.

Dataplane: the reverse proxy request path

The dataplane lives in server/server.go. This is where a request enters the proxy, gets matched to a route, resolves candidate backends, triggers async probes, selects one backend, and is forwarded with httputil.ReverseProxy. For the full upstream context of how a request reaches this point (DNS, service mesh, kube-proxy, endpoints), see my earlier walkthrough on the request flow from a user to a Kubernetes pod.

The flow is:

normalize host
match route from the trie
fetch candidate backends from the endpoint store
trigger async probes for that route
pick a backend using the requested algorithm
increment RIF counters
proxy the request
record observed latency locally

That is all in one request handler, which makes the architecture easy to follow.

The selection branch is especially important:

func (p *ProxyServer) selectBackend(routeKey, algo string, backends []*controller.Endpoint) (*controller.Endpoint, error) {
    switch algo {
    case "prequal", "":
        entry, err := p.pools.Select(routeKey, backends)
        if err != nil {
            return nil, err
        }
        p.pools.IncrementRIF(routeKey, entry.Backend)
        return entry.Endpoint, nil
    default:
        sel, exists := p.selectors[algo]
        if !exists {
            return p.selectBackend(routeKey, "prequal", backends)
        }
        return sel.Select(backends)
    }
}

A few design choices here that i’ve made are:

First, Prequal is the default. If the ingress annotation lb/algo is empty, the proxy uses Prequal.

Second, round-robin algo has been included as well.

This makes the benchmark harness clean. The same controller, proxy, transport, and backend stack can be benchmarked with different selection rules by patching one ingress annotation.

Third, the proxy increments both a global RIF tracker and the selected pool entry’s RIF view. That keeps the request path’s immediate state and the pool’s sampled state reasonably aligned.

How is Prequal implemented in Go?

The implementation is split across:

loadbalancer/prober.go
loadbalancer/pool/pool.go
loadbalancer/pool/pools.go
loadbalancer/rif.go
loadbalancer/latency.go
loadbalancer/config.go

This is the core of the repo.

Route-local probe pools

The paper’s async probing design only works if sampled state is bounded and per-route. That is what RoutePools does: it owns one ProbePool per route key.

Each ProbeEntry holds:

backend address
endpoint pointer
RIF
latency
probe timestamp
UsesLeft

UsesLeft is the local implementation of probe reuse. A probe can be selected a limited number of times before it is evicted.

The pool is bounded by both size and age:

MaxSize
MaxAge
MaxProbeAge

That gives the lb three protection mechanisms against stale decisions:

cap how many probe samples are retained
remove entries that are too old in wall-clock terms
remove entries once they have been reused enough

HCL in code

The HCL selection rule is implemented directly in loadbalancer/pool/pool.go. This is the most important code in the project.

The key selection logic looks like this:

threshold := rifs[idx]

var bestCold *ProbeEntry
var bestHot *ProbeEntry
allHot := true

for i, e := range pool.entries {
    if e.RIF <= threshold {
        allHot = false
        if bestCold == nil || e.Latency < bestCold.Latency {
            bestCold = e
            bestColdIndex = i
        }
        continue
    }
    if bestHot == nil || e.RIF < bestHot.RIF {
        bestHot = e
        bestHotIndex = i
    }
}

selected := bestCold
if allHot {
    selected = bestHot
}

That is a direct implementation of the paper’s idea:

compute the RIF quantile threshold
treat RIF <= threshold as cold
among cold entries, minimize latency
if nothing is cold, minimize RIF

It is not trying to be clever, which I think is the right call here. Algorithm code gets dangerous when it becomes hard to explain. This one stays direct.

The fallback behavior is also worth noting. If the pool has fewer than two entries, the code falls back to a random backend from the full backend list. That is how the system behaves before warmup or after starvation:

if len(pool.entries) < 2 {
    ep := allBackends[rand.Intn(len(allBackends))]
    return &ProbeEntry{Backend: ep.String(), Endpoint: ep}, len(pool.entries), nil
}

The benchmark campaign explicitly tracks how often that fallback happens. In the decisive Prequal runs, it is zero, which matters because otherwise a “Prequal win” might secretly be a random-selection run.

Async probing

loadbalancer.Prober is the other half of the design. It is responsible for:

sending HTTP probes to /prequal/probe
decoding backend-reported RIF and latency
rejecting stale probe responses
feeding entries into the route-local pool
keeping pools warm in the background

The request path never blocks on probe completion. Instead, TriggerProbes(routeKey) enqueues work:

func (pr *Prober) TriggerProbes(routeKey string) {
    if routeKey == "" {
        return
    }
    n := pr.probesForQuery()
    for i := 0; i < n; i++ {
        pr.enqueueProbe(routeKey)
    }
}

Workers consume those route keys, sample a backend for the route, call the probe endpoint, and add a ProbeEntry to the right pool.

The configuration lives in loadbalancer/config.go, and the defaults are important because they define the repo’s behavior:

pool size: 16
pool max age: 1s
reuse limit: 3
QRIF: 0.75
probes per query: 1.0
probe workers: 8
background interval: 100ms
probe timeout: 100ms
max probe age: 2s

Those values are not arbitrary, but they are also not identical to the paper’s defaults. More on this later

RIF and latency tracking

Besides backend-reported probe data, the proxy maintains its own local trackers:

RIFTracker uses sync.Map and atomic.Int64
LatencyTracker keeps a per-backend circular buffer and reports a local median

These are used for two things:

supporting least-connections
optionally seeding the probe pool when the async prober is disabled

The pool can be bootstrap-seeded from local observations if the async prober is absent, but once the prober exists, backend probes become the authoritative source.

The benchmark backend

The Rust backend in backend/src/main.rs is part of the implementation model.

It exposes three endpoints:

POST /work
GET /health
GET /prequal/probe

/work simulates the backend’s actual service time. /prequal/probe exposes the two signals Prequal needs.

The backend’s internal state is visible in exactly the way the algorithm expects. Which makes controlled experiments possible at all.

Work mode: CPU-bound or I/O-bound

The backend has two modes:

CPU-bound SHA256 loop
I/O-bound sleep mode

That switch is controlled by IO_BOUND_MODE.

In CPU-bound mode, each request burns CPU with repeated hashing. In I/O-bound mode, each request sleeps for iterations * IO_BOUND_BASE_US.

That one switch ends up being central to the benchmark story. On small CPU-bound fleets, Prequal loses. In the paper-aligned I/O-bound skewed regime, it wins decisively.

Probe responses are RIF-conditioned

The backend tracks current RIF with an atomic counter and stores recent latency samples in five buckets:

0
1
2..3
4..7
8+

The probe handler looks at the current RIF, chooses the matching bucket, and returns the median latency from that bucket or the nearest non-empty bucket.

This means the probe latency signal is not a raw median across all recent requests. It is conditioned on queue depth.

The probe response shape is simple:

{
  "rif": 3,
  "latency_median_ms": 12.5,
  "timestamp_ms": 1710000000000
}

The Go prober then turns that into a ProbeEntry, rejects it if the timestamp is too stale, and inserts it into the right pool.

Fault injection is built in

The backend also supports probe-fault modes:

timeout
HTTP 500
malformed JSON
stale timestamp

This makes it possible to test whether the prober correctly records failures, drops bad data, and avoids poisoning the pool with stale observations.

The observability path

The controller and proxy expose Prometheus metrics for request latency, per-algorithm selections, probe success and failure counts, probe queue depth, pool occupancy, and active backend count. A debug server exposes /metrics, /routes, /healthz, /readyz, and pprof endpoints. That is enough to explain benchmark results, not just report them.

This repo instruments the controller and proxy heavily enough that you can explain a benchmark result rather than just report it.

The Prometheus metrics in observability/metrics.go cover:

total requests and request latency
backend selections by route and algorithm
no-route and no-backend events
reconciliation counts and durations
probes sent, succeeded, failed, dropped
probe queue depth
pool occupancy
selection algorithm usage
active backend count

And the debug server in debug.go exposes:

/metrics
/routes
/healthz
/readyz
pprof endpoints

That instrumentation is what makes the benchmark investigation credible. The repo can answer questions like:

Did Prequal actually avoid the slow replicas?
Was the pool starving?
Were requests falling back to random?
Was the controller CPU-bound?
Was mutex contention the problem?

The observability here is how the benchmark results got debugged, not dashboard decoration.

Benchmarking

A controlled protocol (interleaved algorithm order, controller rollout-restart per run, warmup period, full metadata capture) turned an early 10x-worse false negative into a reproducible 8.6x p99 improvement. The methodology fix alone, with zero algorithm code changed, produced a 56x p99 reduction in the heterogeneous run.

The benchmark harness in benchmark/ is extensive enough that it deserves to be treated as part of the software, not just support files.

There are several traffic models:

steady_state.js
open_loop.js
rate_ramp.js
burst.js
overload.js
long_duration.js
multi_route.js

And there are matching manifests for:

uniform workloads
heterogeneous workloads
multi-route workloads
route-scale
long-duration
fault injection

The comparisons are made by holding everything constant except the algorithm annotation on the ingress.

That gives a fair comparison:

same controller binary
same route matching
same transport
same backend images
same cluster
same load script
same observability stack

Only the backend selection rule changes.

The controlled protocol

The methodology fix alone, with zero algorithm code changed, produced a 56x p99 reduction in the E-B heterogeneous run. The seven competing hypotheses are walked in here

An early heterogeneous run made Prequal look dramatically worse than the baselines, but it turned out the main problem was protocol:

algorithms were run sequentially instead of interleaved
controller state leaked across runs
probe pools were not reset cleanly between repetitions

The fix became the canonical benchmark protocol:

interleaved algorithm order
controller rollout restart before every run
warmup period before measurement
full metadata capture per run

That protocol is encoded in benchmark/scripts/run_interleaved_campaign.sh and benchmark/scripts/run_campaign.sh.

Why open-loop matters

The decisive C2 benchmark uses k6 constant-arrival-rate mode:

export const options = {
  scenarios: {
    open_loop: {
      executor: 'constant-arrival-rate',
      rate,
      timeUnit: '1s',
      duration,
      preAllocatedVUs,
      maxVUs,
    },
  },
}

This is a better choice than closed-loop when the point is to expose queueing behavior. Closed-loop traffic self-throttles when latency rises. Open-loop keeps pushing at the target rate and makes tail failures visible.

The ramp benchmark does the complementary thing: it increases arrival rate in stages until the system crosses its comfortable regime.

Together, those two tests are enough to answer the important question: does Prequal help under skew and near saturation, which is exactly where the paper says it should?

How much faster is Prequal than round-robin?

On a 16-backend cluster with 16x service-time skew (I/O-bound, open-loop), this Go Prequal implementation cut p99 tail latency by 8.6x versus round-robin and 8.5x versus least-connections. On a 4-backend CPU-bound cluster, the same implementation was roughly 25% slower than round-robin on throughput. Prequal is a tail-latency tool, not a small-fleet tool.

Across five-run interleaved campaigns on a 16-backend cluster with 16x service-time skew, this Prequal implementation cut p99 tail latency by 8.6x against round-robin and 8.5x against least-connections on an open-loop workload, and by 6.8x against round-robin on a rate ramp. On a small 4-backend CPU-bound workload, the same implementation was roughly 25% slower than round-robin on throughput. The repo’s final claim is deliberately bounded, and I think it is the right one.

Small CPU-bound fleet: Prequal loses

In the small-fleet CPU-bound regime, this implementation does not win.

The report’s C1 numbers are:

algorithm	throughput rps	p99 ms
prequal	4162	29.86
round-robin	5366	20.22
least-connections	4965	23.48

That is roughly a 25% throughput deficit versus round-robin.

I profiled it and the overhead investigation concludes that:

there is no hot path dominating controller CPU
mutex contention is negligible
heap use is tiny
the measurable cost is mostly diffuse probe/network competition in a regime where the algorithm does not have enough diversity or skew to pay for itself

Paper-aligned skewed I/O-bound regime: Prequal wins hard

The story changes once the benchmark is moved closer to the paper’s assumptions:

16 backends instead of 4
14 fast, 2 slow
16x service-time skew
I/O-bound backend mode
open-loop and ramp traffic

In the E-B heterogeneous open-loop campaign, the median p99 numbers are:

algorithm	p99 ms	p99.9 ms
prequal	94.20	272.38
round-robin	807.32	887.90
least-connections	802.84	1006.78

That is an 8.6x p99 improvement versus the best baseline.

In the E-B ramp campaign:

algorithm	p99 ms	p99.9 ms
prequal	123.39	824.08
round-robin	831.58	1250.41
least-connections	867.93	1596.51

That is still a 6.8x p99 improvement.

The selection-rate data explains why. Prequal pushes traffic almost entirely to the fast backends and drives the two slow replicas down to nearly zero selections per second. Round-robin, by definition, keeps giving the slow pair their fair share. Least-connections improves the p95, but still reacts too slowly to avoid queueing at the slow replicas, so the tail remains pinned near their service time.

This mechanism lines up with the result

The win is in the tail, not the center

One subtle but important point from the data is that p50 is basically the same across algorithms in the winning regime. The advantage is almost entirely in p99 and p99.9.

That is exactly what you would expect if the algorithm is avoiding pathological queueing rather than making the median request faster.

It is also why Prequal is interesting. If your median is already fine, the only remaining reason to build a more sophisticated load balancer is to keep a minority of requests from getting stuck behind bad backend choices.

How faithful is this Go implementation to the Prequal paper?

This Go implementation defaults to Q_RIF = 0.75 versus the paper’s 0.84, probes-per-query = 1.0 versus the paper’s 3 (testbed) and 5 (YouTube production), and hardcodes probe reuse at 3 rather than deriving it from pool size and fleet size. The frozen benchmark report documents every divergence explicitly.

This repo is faithful to the paper’s central ideas, but it is not a line-by-line reproduction. The frozen report documents the differences clearly, and they matter.

The most important divergences are:

`QRIF` default is `0.75`, not the paper’s `~0.84`

The paper’s baseline uses Q_RIF = 2^(-0.25) ≈ 0.84. This repo defaults to 0.75.

That is within the paper’s recommended band and probably a minor difference, but it is still a difference.

Probes per query is `1.0`, not `3` or `5`

The paper uses 3 probes per query in testbed experiments and mentions 5 in YouTube production. This repo defaults to 1.0.

That choice was motivated by keeping probe overhead reasonable on a small local cluster, but it is a substantial departure.

Probe reuse is a fixed constant

The paper derives reuse behavior from a formula involving pool size, fleet size, probe rate, and removal rate. This repo hardcodes PoolReuseLimit = 3.

That is a reasonable engineering choice for a local implementation, but it means the repo is approximating one part of the paper’s mechanics rather than reproducing it exactly.

Probe removal is maintenance-driven

The paper frames removal as a per-query process. This repo performs cleanup and “remove worst” behavior on a maintenance tick plus reuse depletion.

That keeps work off the request hot path, which is sensible for Go code in a proxy, but it is another behavioral difference.

Backend probing is not yet sampling without replacement

The report points out a latent issue: ProbeRandom picks one backend at a time with rand.Intn, so if probes-per-query were raised above 1, the implementation would not yet match the paper’s “sample without replacement” requirement.

Final thoughts

Writing this was a lot of fun—and watching Claude run tests and benchmarks made it even better. There were a couple of ideas I wanted to explore further but didn’t get to, mainly because I was short on time and wanted to move on to the next project.

One idea was to build a sidecar container or probe that uses eBPF to attach to the main backend container, collect both latency and RIF, and expose that data through a path that the ingress controller could use to make smarter routing decisions. Another was to design a centralized probe pool that multiple ingress controller pods could read from, or to implement a gossip protocol between controller pods to share this information in a decentralized way.

But eventually, it was time to move on. As always, the next project always tends to pull more than the last.

Until next time.

References

Wydrowski, Kleinberg, Rumble, Archer. Load is not what you should balance: Introducing Prequal, NSDI 2024.
Repo benchmark report: benchmark/REPORT.md
Tail-spike investigation: benchmark/investigations/2026-04-19-c2-tail-spike.md
Regime pivot investigation: benchmark/investigations/2026-04-20-regime-pivot.md
Overhead profiling investigation: benchmark/investigations/2026-04-20-prequal-overhead-profiling.md
HAProxy Technologies. Power of Two Load Balancing — context for why Prequal’s quantile-over-pool design is a more sophisticated variant of the same “sample a subset, don’t score every backend” idea.
Full source code: github.com/sathwick-p/prequal — the controller, reverse proxy, Rust benchmark backend, k6 scripts, Prometheus/Grafana assets, and frozen benchmark reports all live here.

Reverse-Engineering Claude Code: A Deep Dive into Anthropic’s AI-Powered CLI

2026-03-31T00:00:00+00:00

Introduction: What is Claude Code?
High-Level Architecture
Startup: The Race Against Time
The Query Engine: Brains of the Operation
The Tool System: 60+ Tools Behind a Single Interface
The Permission System: Safety at Every Layer
Terminal UI: React, but for Your Terminal
The Command System: 100+ Slash Commands
Skills, Plugins, and MCP: The Extensibility Trifecta
Context Management: Fighting the Token Limit
State Management: Immutable Store for a Mutable World
Session Persistence and History
Multi-Agent Architecture: Subagents, Swarms, and Worktrees
Error Recovery: A System That Refuses to Crash
Cost Tracking and Telemetry
Execution Modes: One Codebase, Many Faces
BUDDY: A Tamagotchi-Style AI Pet
KAIROS: Persistent Assistant Mode and Auto-Dreaming
ULTRAPLAN: Remote Planning Sessions
Coordinator Mode: Multi-Agent Orchestrator
The Memory System: Persistent AI Memory
Hooks: User-Defined Automation
Voice Mode, Bridge, and Infrastructure
Vim Mode, Keybindings, and Developer Ergonomics
Key Engineering Patterns and Takeaways
Conclusion

1. Introduction: What is Claude Code?

Claude Code is Anthropic’s official CLI tool — an interactive, AI-powered development assistant that lives in your terminal. It lets developers have natural-language conversations with Claude to edit files, run shell commands, search codebases, manage Git workflows, create pull requests, debug issues, and much more.

But underneath the conversational interface lies a remarkably sophisticated piece of software engineering: a custom React-based terminal renderer, a multi-layered permission system, an elastic tool discovery mechanism, a self-healing query loop with automatic context compression, and an extensibility framework spanning skills, plugins, and the Model Context Protocol (MCP).

This article is a deep technical analysis of the Claude Code source code — approximately 330+ utility files, 45+ tool implementations, 100+ slash commands, 146 UI components, and a custom terminal rendering framework — all written in TypeScript with React, running on Bun.

Let’s take it apart, piece by piece.

2. High-Level Architecture

Claude Code follows a layered architecture where each layer has clear responsibilities:

Tech Stack

Layer	Technology
Language	TypeScript (strict mode)
Runtime	Bun (with Node.js 18+ compatibility)
UI Framework	React 18 with custom terminal reconciler
Layout Engine	Yoga (Facebook’s flexbox implementation)
API Client	`@anthropic-ai/sdk`
Extensibility	Model Context Protocol (MCP) SDK
Validation	Zod (schema-driven I/O for all tools)
CLI Framework	Commander.js
Linting	Biome + ESLint

Directory Structure

src/
├── main.tsx                 # Application entry (~800KB, bootstraps everything)
├── QueryEngine.ts           # Conversation management & API orchestration
├── query.ts                 # Query loop state machine (retries, compaction, recovery)
├── Tool.ts                  # Unified tool interface (generic over Input/Output/Progress)
├── tools.ts                 # Tool registry with feature-gated loading
├── commands.ts              # Command registry with lazy dispatch
├── context.ts               # System/user context builder (git, CLAUDE.md, date)
├── cost-tracker.ts          # Per-model usage accumulation and display
├── history.ts               # Session history (JSONL, dedup, paste refs)
├── setup.ts                 # Pre-action configuration and auth
├── entrypoints/             # CLI, SDK, MCP entry points
├── tools/                   # 45+ tool implementations (Bash, FileRead, Agent, etc.)
├── commands/                # 100+ slash command implementations
├── components/              # 146 React terminal components
├── ink/                     # Custom terminal rendering framework (~90 files)
├── services/                # API, analytics, MCP, compact, plugins
├── hooks/                   # 85+ hook implementations
├── state/                   # AppState store (Zustand-like)
├── utils/                   # 330+ utilities (git, config, permissions, etc.)
├── skills/                  # Skill loading, bundled skills
├── keybindings/             # Dynamic keybinding system
├── vim/                     # Full vi/vim mode
├── bridge/                  # CCR bridge (WebSocket to claude.ai)
├── coordinator/             # Multi-agent coordination
├── remote/                  # Remote session management
├── tasks/                   # Background task system
├── migrations/              # Versioned data migrations
└── types/                   # Shared type definitions

3. Startup: The Race Against Time

Claude Code’s startup is aggressively optimized. The goal: minimize time-to-first-render so the developer is never left staring at a blank terminal.

3.1 Parallelized Prefetching

Before any module imports happen, three critical operations fire in parallel:

// main.tsx — lines 1-20, before any other imports
profileCheckpoint('main_tsx_entry')
startMdmRawRead()        // macOS MDM policy read (subprocess)
startKeychainPrefetch()   // OAuth + API key keychain reads (2 subprocesses)

This exploits a clever insight: TypeScript module evaluation takes ~135ms anyway (sequential by nature). By spawning subprocesses immediately, macOS keychain reads (~65ms total) run entirely in parallel with import resolution, becoming effectively free.

3.2 Initialization Sequence

The init() function (memoized to prevent re-entrancy) orchestrates 16 setup stages:

Config validation — Parse and validate all JSON config files
Safe environment variables — Apply non-sensitive env vars before trust dialog
CA certificates — Load extra root CAs before first TLS handshake
Graceful shutdown handlers — Register SIGINT/SIGTERM handlers
OAuth population — Async account info fetch
IDE detection — JetBrains, VS Code identification
Remote settings — Fetch managed settings from server (async, awaited later)
Policy limits — Load org-enforced limits (async)
First-start timestamp — Analytics marker
mTLS configuration — Client certificate setup
Proxy agents — Configure HTTP/HTTPS proxies
API preconnection — TCP+TLS handshake overlaps with remaining init
Upstream proxy (CCR) — CONNECT relay for organization credentials
Shell detection — Windows-specific shell resolution
LSP manager — Language Server Protocol cleanup handlers
Team cleanup — Multi-agent swarm cleanup on shutdown

3.3 Fast Paths

Before full initialization, fast paths handle quick-exit commands:

--version — Print version and exit (no init, no React)
--dump-system-prompt — Output the system prompt and exit
mcp serve — Start MCP server mode (different init path)

3.4 Startup Profiling

A sampled profiler (startupProfiler.ts) measures every phase:

100% of internal builds get sampled
0.5% of external users are sampled
CLAUDE_CODE_PROFILE_STARTUP=1 forces full profiling with memory snapshots

The decision is made once at module load time — non-sampled users pay zero profiling overhead.

3.5 Entrypoint Resolution

The system identifies its execution context early and sets CLAUDE_CODE_ENTRYPOINT:

Value	Context
`cli`	Interactive terminal session
`sdk-cli`	Non-interactive (print mode, piped)
`mcp`	Running as an MCP server
`local-agent`	Spawned as a subagent
`claude-code-github-action`	GitHub Actions CI

This gates feature loading — for example, REPL components only load in interactive mode.

4. The Query Engine: Brains of the Operation

The query engine is the core loop that manages conversations with Claude. It’s split across two files: QueryEngine.ts (session-level orchestration) and query.ts (per-turn state machine).

4.1 QueryEngine: The Session Coordinator

The QueryEngine class is a singleton per conversation. It persists state across turns and coordinates:

System context building (git status, CLAUDE.md files, date)
Message management (accumulation, normalization, persistence)
API calls (streaming, retries, fallback)
Permission tracking (denial counts for SDK reporting)
Cost accumulation (per-model token tracking)

Key method: submitMessage(prompt, options) — an AsyncGenerator that yields SDK messages throughout the turn. Before entering the query loop, it:

Creates a file history snapshot (for undo/restore)
Records the transcript to disk before the API call (even if the process is killed mid-request, the conversation is resumable)
Wraps canUseTool to track permission denials

4.2 The Query Loop: A Resilient State Machine

The query() function in query.ts is where the magic happens. It’s a while(true) loop managing a mutable state object:

queryLoop():
  while(true):
Prefetch memory + skills (parallel)
Apply message compaction (snip, microcompact, context collapse)
Call API with streaming
Handle streaming errors (fallback, retry)
Execute tools (concurrent or serial)
Check recovery paths (compact, collapse drain, token escalation)
Continue loop or return

The state object tracks everything needed across iterations:

type State = {
  messages: Message[]
  toolUseContext: ToolAvailabilityContext
  maxOutputTokensRecoveryCount: number  // 0–3 limit
  autoCompactTracking: CompactState     // Compaction state + failure count
  pendingToolUseSummary: Promise<...>   // Async tool summaries
  transition: TransitionReason          // Why the loop didn't terminate
}

4.3 Streaming and Tool Execution

The query loop streams API responses and processes them incrementally:

Stream start — Yields stream_request_start event
Accumulation — Collects assistantMessages, toolUseBlocks, toolResults
Usage tracking — Tracks currentMessageUsage and lastStopReason
Tool dispatch — Routes tool calls to the orchestrator

Tool execution uses a sophisticated concurrency model:

partitionToolCalls(blocks[]):
  ├─ Batch 1: Read-only tools A, B, C  → runConcurrently(max=10)
  ├─ Batch 2: Write tool D              → runSerially()
  ├─ Batch 3: Read-only tools E, F      → runConcurrently(max=10)
  └─ ...

Each tool’s isConcurrencySafe() method determines if it can run in parallel. Read-only tools (glob, grep, file reads) run concurrently; write tools (edits, bash with side effects) run serially with context propagation between batches.

A streaming tool executor can even begin executing tools while the model is still streaming, reducing latency by overlapping computation and I/O.

4.4 Token Budget Continuation

When the model’s output budget is approaching exhaustion but the task isn’t complete, the engine:

Injects an invisible meta-message: “Resume directly — no apology, no recap”
Continues the loop with a token_budget_continuation transition
Tracks cumulative tokens without interrupting the user
Detects diminishing returns to avoid infinite loops

Maximum 3 consecutive output-token recovery attempts before surfacing the stop reason.

5. The Tool System: 60+ Tools Behind a Single Interface

Every tool in Claude Code conforms to a single generic interface:

interface Tool<Input, Output, Progress> {
  name: string
  description(): string          // Dynamic, permission-context-aware
  prompt(): string               // System prompt additions
  inputSchema: ZodSchema<Input>  // Zod → JSON Schema for API

  call(input: Input, context: ToolContext): Promise<ToolResult<Output>>
  checkPermissions(input: Input): PermissionResult
  validateInput(input: Input): ValidationResult
  isConcurrencySafe(input: Input): boolean

  // 4-tier rendering
  renderToolUseMessage(input: Input): ReactNode
  renderToolUseProgressMessage(input: Input, progress: Progress): ReactNode
  renderToolResultMessage(output: Output): ReactNode
  renderToolUseErrorMessage(error: Error): ReactNode

  mapToolResultToToolResultBlockParam(output: Output, id: string): ToolResultBlockParam
}

5.1 The Tool Registry

Tools are loaded through a feature-gated registry:

assembleToolPool(permissionContext, mcpTools):
getTools(permissionContext)        // Filter built-ins by deny rules
filterToolsByDenyRules()           // Remove blanket-denied MCP tools
uniqBy(name)                       // Deduplicate (built-ins win)
sort(name)                         // Alphabetical for prompt cache stability

Sorting by name is a subtle but important optimization: it keeps the tool list in the same order across requests, maximizing prompt cache hit rates on the API side.

5.2 Deferred Tool Discovery

Not all 60+ tools are sent to the model in every request. Tools marked shouldDefer: true are hidden until the model explicitly searches for them via ToolSearchTool:

Model: "I need to create a task..."
  → Calls ToolSearchTool("task create")
  → Returns TaskCreateTool schema
  → Model calls TaskCreateTool in the same turn

~18 tools are deferred: LSP, TaskCreate, MCPTool, SkillTool, EnterPlanMode, etc. This keeps the base prompt under 200K tokens while allowing elastic discovery.

5.3 Key Tool Implementations

BashTool — Command Execution with Guardrails

The most frequently used tool runs shell commands with extensive safety:

30K character result limit — Large outputs persist to disk with a preview
Sandbox awareness — Detects containerized vs. native execution
Background tasks — Auto-backgrounds commands exceeding 15 seconds
Search classification — Marks ls, grep, cat output as collapsible in the UI
Permission dialogs — sed edits show a preview before execution

FileEditTool — Precision String Replacement

Rather than rewriting entire files, the edit tool does surgical string replacement:

Old/new string matching — Finds exact occurrences, replaces one or all
1 GiB size limit — Prevents OOM on massive files
Git-aware diffing — Shows before/after diff via gitDiff()
Undo integration — Plugs into FileHistory for one-click undo

AgentTool — Subagent Spawning

Claude Code can spawn child agents for parallel work:

Isolation modes — Worktree (isolated git branch) or remote (CCR)
Model selection — Override with opus | sonnet | haiku
Background execution — Agents run async with notification on completion
Named addressing — SendMessage to named agents for multi-agent coordination
Permission inheritance — Child agents inherit or restrict parent permissions

GrepTool — Content Search (Ripgrep Wrapper)

Wraps rg with sensible defaults for LLM use:

250-line default limit — Prevents context flooding
Multiline mode — rg -U --multiline-dotall for cross-line patterns
VCS exclusion — Auto-skips .git, .svn, .hg
Three output modes — Content, file paths only, or match counts

LSPTool — Language Intelligence

9 operations powered by Language Server Protocol:

goToDefinition, findReferences, hover
documentSymbol, workspaceSymbol
goToImplementation, prepareCallHierarchy
incomingCalls, outgoingCalls

Only loaded when an LSP server is connected. Deferred by default.

WebSearchTool — Native Web Search

Server-side web search (beta feature):

Max 8 searches per invocation
Domain filtering — allowed_domains and blocked_domains parameters
Streaming results — Interleaves text and citation blocks

5.4 Tool Result Budgeting

Every tool has a maxResultSizeChars limit:

Tool	Limit
BashTool	30,000 chars
GrepTool	20,000 chars
FileReadTool	Infinity (never persists)

When output exceeds the limit, it’s saved to ~/.claude/tool-results/{uuid}/output.txt and the model receives a preview with a file reference. FileReadTool is exempt because persisting its output would create a circular dependency (Read → persist → model reads persisted file → …).

5.5 Lazy Schemas

Tool input schemas use a lazySchema() factory that defers Zod instantiation:

const schema = lazySchema(() => z.object({
  command: z.string(),
  timeout: z.number().optional(),
}))

This prevents circular import cycles (Tool.ts ← tools/ ← Tool.ts) and enables mid-session schema changes when feature flags flip.

6. The Permission System: Safety at Every Layer

Claude Code’s permission system is one of its most sophisticated subsystems — a multi-layered defense that balances safety with developer productivity.

6.1 Permission Modes

Five public modes control the default behavior:

Mode	Behavior
`default`	Ask for destructive operations
`plan`	Read-only + AskUserQuestion (design phase)
`acceptEdits`	Auto-approve file edits, ask for shell
`bypassPermissions`	Full access (dangerous, opt-in)
`dontAsk`	Auto-deny unsafe commands

Plus two internal modes:

auto — ML classifier evaluates each command
bubble — Internal delegation to parent agent

6.2 Rule System

Permission rules form a priority cascade:

type PermissionRule = {
  source: 'userSettings' | 'projectSettings' | 'localSettings' | 'cliArg' | 'session'
  ruleBehavior: 'allow' | 'deny' | 'ask'
  ruleValue: { toolName: string, ruleContent?: string }
}

Rules support glob patterns: Bash(git push*) allows any git push command, Bash(python:*) allows all Python commands.

6.3 Decision Pipeline

For every tool call:

1. validateInput()        → Tool-specific validation (size limits, blocked patterns)
2. checkPermissions()     → Rule matching + classifier + hooks
3. Decision:
   ├─ allow  → Execute immediately
   ├─ deny   → Return error to model
   └─ ask    → Show permission dialog to user
4. Pre/Post hooks         → Can modify input or block execution

6.4 Dangerous Pattern Detection

The system identifies permission rules that are too broad to auto-allow:

Tool-level allow (no content restriction) — Would allow ALL commands
Interpreter prefixes — python:*, node:*, ruby:* (arbitrary code execution)
Wildcards — *, python* (too permissive)

6.5 Three-Way Permission Result

Every permission check returns a typed union:

type PermissionResult =
  | { behavior: 'allow', updatedInput?: Input }   // Hooks can modify input
  | { behavior: 'ask', message: string }           // Prompt user
  | { behavior: 'deny', message: string }          // Block with explanation

The updatedInput field is powerful: pre-execution hooks can transparently modify tool parameters (e.g., adding safety flags to shell commands).

7. Terminal UI: React, but for Your Terminal

Perhaps the most impressive subsystem in Claude Code is its custom terminal rendering framework — a complete reimplementation of React rendering for terminal environments, rivaling web browsers in sophistication.

7.1 The Rendering Pipeline

React Components
    ↓
Custom React Reconciler (createReconciler API)
    ↓
Virtual DOM Tree (ink-box, ink-text, ink-root, ink-link)
    ↓
Yoga Layout Engine (flexbox calculations)
    ↓
Output Builder (write / blit / clip / clear / shift operations)
    ↓
Screen Buffer (2D cell array with interned styles + hyperlinks)
    ↓
Diff Engine (compare with previous frame)
    ↓
ANSI Escape Sequences → TTY

7.2 Custom React Reconciler

Claude Code implements a custom React host configuration using createReconciler:

Element types:

ink-root — Root container
ink-box — Flexbox layout container (like
)
ink-text — Text content
ink-virtual-text — Nested text (layout optimization)
ink-link — OSC 8 hyperlinks
ink-progress — Progress indicators
ink-raw-ansi — Raw ANSI passthrough (bypasses measurement)

The reconciler tracks three categories of changes separately:

Styles — Passed to Yoga for layout recalculation
Text styles — Colorization, bold, italic, etc.
Event handlers — Stored separately to prevent handler identity changes from invalidating the dirty flag

7.3 Yoga Layout Engine

Rather than manual ANSI cursor positioning, Claude Code uses Yoga — Facebook’s cross-platform flexbox implementation — for layout:

<Box flexDirection="row" gap={1} paddingX={2}>
  <Box flexGrow={1}>
    <Text>Left panelText>
  Box>
  <Box width={30}>
    <Text>Right sidebarText>
  Box>
Box>

This brings responsive, declarative layouts to the terminal. Text nodes register measure functions with Yoga:

node.yogaNode.setMeasureFunc((width, measureMode) => {
  const wrapped = wrapText(text, width)
  return { width: actualWidth, height: numLines }
})

A generational reset pattern prevents memory leaks from native Yoga bindings:

if (now - lastPoolResetTime > SESSION_POOL_RESET_MS) {
  migrateScreenPools()  // Free and recreate all Yoga nodes
}

7.4 The Dirty Flag Cascade

Nodes track a dirty flag that cascades upward:

function markDirty(node: DOMElement) {
  node.dirty = true
  if (node.parentNode) markDirty(node.parentNode)
}

Only subtrees with dirty ancestors are re-laid out, providing incremental performance.

7.5 Double Buffering and Blitting

The renderer uses classic graphics techniques:

Double buffering:

private frontFrame: Frame   // Currently displayed
private backFrame: Frame    // Being rendered into
// After render: swap pointers
[this.frontFrame, this.backFrame] = [this.backFrame, this.frontFrame]

Blitting (copy unchanged regions):

blit(src: Screen, x, y, width, height)
// If a region hasn't changed, copy from previous frame
// instead of re-rendering — the "GPU blit" technique for terminals

When a selection overlay is applied, it “contaminates” the frame, disabling blit for the next render to prevent visual artifacts.

7.6 Screen Buffer: The 2D Cell Model

The screen is a 2D array of cells:

type Cell = {
  char: string          // Interned via CharPool
  width: CellWidth      // 1 (normal), 2 (wide/CJK/emoji), -1 (tail of wide char)
  styleId: number       // Interned via StylePool
  hyperlink?: number    // Interned via HyperlinkPool
}

Three interning pools minimize memory and enable O(1) comparisons:

CharPool — Deduplicates character strings, returns integer IDs
StylePool — Deduplicates ANSI style combinations, pre-computes transition sequences
HyperlinkPool — Deduplicates OSC 8 URLs (reset every 5 minutes to bound growth)

The style pool’s transition() method is especially clever:

// Pre-computed: "how to go from style A to style B"
transition(fromId: number, toId: number): string {
  const key = fromId * 0x100000 + toId
  return transitionCache.get(key)  // O(1) vs. diffing AnsiCode arrays
}

7.7 Scroll Optimization

ScrollBox uses hardware scroll regions when available:

CSI top;bottom r    → Set scroll region
CSI n S             → Scroll up n lines (DECSTBM)

This is dramatically faster than rewriting 50+ rows of content. For smooth animation, scroll deltas accumulate and drain at terminal-specific rates:

// Native terminals: proportional drain (~3/4 per frame)
const step = Math.max(MIN, (abs * 3) >> 2)

// xterm.js: adaptive (instant for ≤5, smaller steps for fast scrolls)
const step = abs <= 5 ? abs : abs < 12 ? 2 : 3

7.8 Event System

Events follow DOM semantics with capture and bubble phases:

function collectListeners(target, event): DispatchListener[] {
  // Walk from target to root
  // Capture handlers: root-first
  // Bubble handlers: target-first
}

Event priority mirrors web browsers:

Priority	Events
Discrete (sync)	`keydown`, `keyup`, `click`, `focus`, `blur`, `paste`
Continuous (batched)	`resize`, `scroll`, `mousemove`

7.9 Text Selection

Full text selection with word and line modes:

Character mode — Drag selects character by character
Word mode — Double-click selects word; subsequent drag extends by word boundaries
Line mode — Triple-click selects line; drag extends by lines
Scroll tracking — Text that scrolls off-screen is accumulated for correct copy
Soft-wrap handling — Wrapped lines are joined into logical lines when copying

7.10 Keyboard Input Parsing

Terminal keyboard input is notoriously ambiguous. The parser handles multiple protocols:

Kitty Keyboard Protocol — CSI u with codepoint + modifiers
xterm modifyOtherKeys — CSI 27; modifier; keycode ~
Legacy function keys — F1-F12 with their many escape sequence variants
SGR mouse events — CSI < button; col; row M/m
Terminal identity detection — XTVERSION response parsing for feature detection

8. The Command System: 100+ Slash Commands

8.1 Architecture

Commands use a declarative registration model with three types:

Type	Execution Model	Example
`PromptCommand`	Expands to text sent to Claude	`/commit`, `/review`
`LocalCommand`	Synchronous text output, no UI	`/clear`, `/help`, `/status`
`LocalJSXCommand`	React component rendered to terminal	`/config`, `/mcp`, `/doctor`

The command registry is memoized and lazy-loaded:

const COMMANDS = memoize(() => [
  // Static commands array — module imports deferred until first call
])

const loadAllCommands = memoize((cwd: string) => {
  // Merges: COMMANDS() + skills + plugins + workflows + MCP commands
})

8.2 Command Discovery Pipeline

getCommands(cwd)
  ├─ loadAllCommands(cwd) [memoized by CWD]
  │   ├─ getSkills()          → Disk, bundled, plugin, MCP skills
  │   ├─ getPluginCommands()  → Marketplace + built-in plugins
  │   ├─ getWorkflowCommands()→ Automation workflows [feature-gated]
  │   └─ COMMANDS()           → Static built-in commands
  ├─ getDynamicSkills()       → Session-discovered skills
  ├─ Filter by availability   → Auth provider gating
  ├─ Filter by isEnabled()    → Feature flag gating
  └─ Dedupe + sort

8.3 Remote and Bridge Filtering

Commands are pre-filtered based on execution context:

Remote mode — Only REMOTE_SAFE_COMMANDS (session, exit, clear, help, theme, cost…)
Bridge mode — Only BRIDGE_SAFE_COMMANDS (prompt-type skills, plus text-output locals like clear, cost, summary)
Local JSX commands — Always blocked over bridge (can’t render React over WebSocket)

8.4 Notable Command Implementations

`/commit` — Git Safety Protocol

The commit command enforces strict safety rules:

Never git commit --amend (only create new commits)
Never skip hooks (--no-verify, --no-gpg-sign)
Never use -i flags (interactive mode unsupported)
Warn on secrets (.env, credentials.json)
Restricted tool access: only Bash(git add:*), Bash(git status:*), Bash(git commit:*)

`/init` — Interactive Project Setup

Multi-phase onboarding:

Ask what to set up (CLAUDE.md, skills, hooks)
Survey codebase (manifest files, README, CI, existing config)
Interview user on gaps
Synthesize proposal and create artifacts

`/doctor` — Self-Diagnostics

Checks system health: API connectivity, auth status, model availability, MCP server connections, permission configuration.

9. Skills, Plugins, and MCP: The Extensibility Trifecta

9.1 Skills

Skills are markdown-based prompt templates with frontmatter metadata:

---
name: my-skill
description: What this skill does
whenToUse: When Claude should invoke it
allowedTools: [Bash, Read, Edit]
model: claude-sonnet-4-6
userInvocable: true
---

Skill prompt content here...

Discovery sources (5):

.claude/skills/ — Project-level skills
~/.claude/skills/ — User-level skills
Bundled skills — Compiled into the binary
Plugin skills — From installed plugins
MCP skill builders — Auto-generated from MCP servers with Prompt capability

Forked execution: Skills with context: 'fork' run in isolated subagents with their own token budgets, preventing large skills from consuming session context.

Bundled skills support lazy extraction of reference files to disk with per-process nonce-based path protection (defends against symlink/TOCTOU attacks).

9.2 Plugins

Plugins bundle skills, hooks, and MCP servers:

Plugin
├─ Skills (markdown files)
├─ Hooks (pre/post tool execution)
├─ MCP Servers (tool providers)
└─ Options (user-configurable variables)

Types:

Built-in plugins — Pre-installed, togglable, {name}@builtin
Marketplace plugins — Installed to ~/.claude/plugins, versioned
Project plugins — --plugin-dir for session-only plugins

Plugin variables are substituted into prompts at invocation time via substitutePluginVariables().

9.3 Model Context Protocol (MCP)

MCP is the primary extensibility mechanism for bringing external tools into Claude Code.

Supported transports:

stdio — Local subprocess
sse / http / ws — Network-based (with optional OAuth/XAA)
sdk — Embedded SDK
claudeai-proxy — Claude.ai tunnel

Config scopes (priority order):

local — .mcp.json in project root
project — .claude/.mcp.json
user — ~/.claude/.mcp.json
userSettings — settings.json mcpServers
policySettings — Managed organizational policy
enterprise — Enterprise-managed
claudeai — Claude.ai-managed
dynamic — Runtime-injected

Connection lifecycle:

MCPServerConnection =
  | ConnectedMCPServer     → Ready to use
  | FailedMCPServer        → Connection error
  | NeedsAuthMCPServer     → Awaiting OAuth
  | PendingMCPServer       → Reconnecting (max attempts)
  | DisabledMCPServer      → Explicitly disabled

MCP tools are normalized and prefixed: mcp__server__toolname. They receive the same permission checks, deny rules, and analytics as built-in tools.

10. Context Management: Fighting the Token Limit

With conversations that can last hours and generate hundreds of tool calls, managing the context window is critical. Claude Code uses a multi-strategy approach.

10.1 Auto-Compaction

When token count exceeds context_window - 13,000:

Strip images/documents from older messages (replace with [image] markers)
Group messages by API round (assistant + tool results)
Call the compaction model to generate a summary
Replace old messages with a CompactBoundaryMessage
Re-inject up to 5 files + skills post-compaction (50K token budget for files, 25K for skills)

A circuit breaker prevents thrashing: max 3 consecutive compaction failures before giving up.

10.2 Microcompaction

Lighter-weight compression for tool results:

Time-based — Clear tool results older than a TTL
Size-based — Truncate when accumulated tool result tokens exceed threshold
Tool-specific — Only compacts: FileRead, Bash, Grep, Glob, WebSearch, WebFetch, FileEdit, FileWrite
Cache-aware — A “cached” variant preserves prompt cache integrity via CacheEditsBlock

10.3 Snip Compaction

A history truncation strategy (feature-gated):

Remove old messages beyond a snip boundary
Preserve the assistant’s “protected tail” for context continuity
Track tokens freed for accurate token budget calculations
Full history preserved in REPL for UI scrollback (non-destructive)

10.4 Context Collapse

Staged collapses are committed lazily — only when the API returns a 413 (prompt too long):

API 413 → Collapse drain (commit staged collapses)
        → If insufficient → Reactive compact (full summarization)
        → If still insufficient → Surface error to user

10.5 System Context

Two tiers of context are injected into every request:

System context (memoized per session):

Git status (branch, recent commits, file status — truncated at 2000 chars)
Cache breaker (optional debug injection)

User context (memoized per session):

CLAUDE.md file contents (auto-discovered from project + parent directories)
Current date (ISO format)

11. State Management: Immutable Store for a Mutable World

11.1 The Store

Claude Code uses a minimal, Zustand-like store:

type Store<T> = {
  getState: () => T
  setState: (updater: (prev: T) => T) => void
  subscribe: (listener: Listener) => () => void
}

No middleware
Synchronous updates
Identity comparison (Object.is) gates listener invocation
React integration via useSyncExternalStore

11.2 AppState: The Unified State Object

The AppState object contains everything:

Core settings:

settings — User preferences (theme, model, etc.)
mainLoopModel — Current AI model for the session
toolPermissionContext — Safety mode and rules
expandedView — ‘none’ ‘tasks’ ‘teammates’

Bridge state (Claude.ai integration):

replBridgeEnabled / replBridgeConnected / replBridgeSessionActive
replBridgeConnectUrl / replBridgeError

Multi-agent state:

tasks: { [taskId: string]: TaskState }
agentNameRegistry: Map
foregroundedTaskId / viewingAgentTaskId

MCP state:

mcp.clients: MCPServerConnection[]
mcp.tools, mcp.commands, mcp.resources

Speculation state (parallel model execution):

type SpeculationState =
  | { status: 'idle' }
  | { status: 'active', messagesRef, writtenPathsRef, boundary, isPipelined }

Speculation is a latency optimization: while the user is still typing, the model begins generating a response speculatively. File writes go to an overlay filesystem (writtenPathsRef), and on completion, the overlay is either committed (if the user’s actual input matches the speculation boundary) or discarded. isPipelined indicates whether a suggestion was already generated and is queued for display.

11.3 Centralized Side Effects

All state mutations that affect external systems flow through onChangeAppState():

Permission mode changes → Notify CCR bridge
Model changes → Persist to user settings
Settings mutations → Clear auth caches
View changes → Persist UI state

One choke point, no scattered side effects.

12. Session Persistence and History

12.1 Transcript Recording

The engine records transcripts with ordering guarantees:

Assistant messages — Fire-and-forget (lazy JSON stringify with 100ms drain)
User/boundary messages — Blocking await (ordering guarantee)
Pre-compact flush — Writes preserved segment before compaction boundary

Even if the process is killed mid-request, the conversation is resumable from the last recorded transcript.

12.2 History System

Two-level history with deduplication:

In-memory: pendingEntries[] — Queue before flush to disk

On-disk: ~/.claude/history.jsonl — Append-only log

type LogEntry = {
  display: string                    // Formatted prompt for Ctrl+R picker
  project: string                    // Current project root
  sessionId: SessionId
  timestamp: number
  pastedContents?: Record<number, StoredPastedContent>
}

Key algorithms:

Dedup by display text (newest first) for Ctrl+R
Current-session-first ordering (up-arrow doesn’t interleave sessions)
Small pastes (<1KB) inlined; large pastes stored with hash references

12.3 Cost State Persistence

Session costs survive process restarts:

getStoredSessionCosts()       // Retrieve if session ID matches
saveCurrentSessionCosts()     // Persist before session switch
restoreCostStateForSession()  // Restore on resume (validates session ID)

13. Multi-Agent Architecture: Subagents, Swarms, and Worktrees

13.1 Agent Spawning

The AgentTool spawns child agents with configurable isolation:

Default — Shared filesystem, separate conversation context
Worktree — Isolated git branch copy, changes merged on exit
Remote (CCR) — Runs on a separate machine

Agents are addressable by name:

Model: "Ask the test-runner agent to run the suite"
  → SendMessage(to: "test-runner", message: "Run the test suite")

13.2 Task System

Background tasks use file-based IPC with concurrent-session locking:

type TaskType = 'local_bash' | 'local_agent' | 'remote_agent'
              | 'in_process_teammate' | 'local_workflow' | 'monitor_mcp' | 'dream'

type TaskStatus = 'pending' | 'running' | 'completed' | 'failed' | 'killed'

Task IDs use base-36 encoding with type prefixes (b=bash, a=agent, r=remote, etc.).

Lock retries use 30 attempts with 5-100ms backoff (~2.6s max wait) for swarm coordination across tmux/iTerm2 panes.

13.3 Worktree Isolation

EnterWorktreeTool / ExitWorktreeTool provide git-level isolation:

Create a temporary git worktree on a new branch
Agent works in the worktree (safe to make destructive changes)
On exit: keep changes (merge) or discard (clean up)

14. Error Recovery: A System That Refuses to Crash

14.1 API Error Recovery

The retry system handles transient and permanent errors differently:

Permanent (fail fast): | Error | Strategy | |——-|———-| | 401 | OAuth refresh → retry once → clear credentials | | 400 | Invalid request, no retry | | 403 | Permission denied, no retry |

Persistent retry mode (for unattended operation):

Env var: CLAUDE_CODE_UNATTENDED_RETRY
Indefinite 429/529 retries with max 5-minute backoff
30-second heartbeat keep-alive messages

14.2 Prompt-Too-Long Recovery

When the API returns 413:

413 Prompt Too Long
  ├─ 1. Collapse drain (commit staged context collapses)
  ├─ 2. Reactive compact (generate full conversation summary)
  └─ 3. Surface error if all paths exhausted

The error is withheld from the SDK until recovery paths are exhausted — the user never sees a 413 if compaction can resolve it.

14.3 Max Output Tokens Recovery

max_output_tokens stop reason
  ├─ 1. Escalate to 64K tokens (once per turn)
  ├─ 2. Inject meta recovery message ("Resume directly")
  ├─ 3. Max 3 attempts before surfacing
  └─ 4. Withhold intermediate errors

14.4 Model Fallback

On persistent 529 errors:

Switch to fallback model (e.g., Sonnet when Opus is overloaded)
Strip thinking blocks (model-bound signatures)
Log fallback event with chain ID
Yield system message about the switch

14.5 Streaming Fallback

If streaming fails mid-response:

Retry with non-streaming request
Tombstone orphaned messages
Clear assistant messages to restart the turn
Fresh tool executor to prevent orphan results

15. Cost Tracking and Telemetry

15.1 Usage Accumulation

Per-model tracking:

Input tokens, output tokens
Cache read/write tokens
Web search requests
USD cost (calculated via calculateUSDCost())

Advisor model costs are recursively accumulated from getAdvisorUsage().

15.2 Display

formatTotalCost() produces a multi-line report:

Total cost
Per-model breakdown
API/wall-clock duration
Lines of code changed
Unknown model cost disclaimer

15.3 Telemetry

Analytics use a decoupled sink pattern:

attachAnalyticsSink() called during startup
Events queued until sink is available (prevents import cycles)
Datadog fanout + first-party event logging
PII-tagged fields for compliance
OpenTelemetry spans for LLM request tracing

Gateway detection identifies proxy infrastructure from response headers: LiteLLM, Helicone, Portkey, Cloudflare AI Gateway, Kong, Braintrust, Databricks.

16. Execution Modes: One Codebase, Many Faces

Claude Code runs in multiple modes from a single codebase:

Interactive CLI (Default)

Full React terminal UI with REPL loop, text selection, mouse support, and rich rendering.

Non-Interactive / Headless

--print mode outputs the response to stdout. --output saves to a file. No user interaction — suitable for scripts, CI/CD, and piping.

MCP Server Mode

claude mcp serve runs Claude Code as an MCP server, exposing its tools to other MCP clients.

Bridge Mode (Claude.ai Integration)

WebSocket connection to claude.ai for remote control:

CLI sends status updates to the web UI
Web UI sends control commands back
Bidirectional message adaptation (SDK format ↔ local format)
Viewer-only mode for read-only clients

Remote / Teleport

claude remote-control exposes the CLI as a WebSocket server. Users can connect via claude.ai’s web interface or QR code.

Local Agent Mode

Subprocesses spawned for multi-agent swarms. Each agent gets its own session, AppState, and task directory. Communication via file I/O.

Coordinator Mode

Orchestrates multiple agents working in parallel on different aspects of a task. (See dedicated section below.)

17. BUDDY: A Tamagotchi-Style AI Pet

One of the most surprising finds in the codebase: a fully implemented Tamagotchi-style virtual companion that lives beside the user’s input box. What started as an April Fools feature (teaser window: April 1-7, 2026) became a real, permanent feature.

17.1 How Your Buddy Is Born

Every companion is deterministically generated from the user’s account ID using a Mulberry32 seeded PRNG:

// Mulberry32 — tiny seeded PRNG, good enough for picking ducks
function mulberry32(seed: number): () => number {
  let a = seed >>> 0
  return function () {
    a |= 0
    a = (a + 0x6d2b79f5) | 0
    let t = Math.imul(a ^ (a >>> 15), 1 | a)
    t = (t + Math.imul(t ^ (t >>> 7), 61 | t)) ^ t
    return ((t ^ (t >>> 14)) >>> 0) / 4294967296
  }
}

The seed is hash(userId + 'friend-2026-401'). This means your companion is unique to you but identical across devices and sessions — you always get the same one.

17.2 Species, Rarity, and Cosmetics

18 species: duck, goose, blob, cat, dragon, octopus, owl, penguin, turtle, snail, ghost, axolotl, capybara, cactus, robot, rabbit, mushroom, chonk

Species names are encoded as String.fromCharCode(0x64,0x75,0x63,0x6b) rather than string literals to avoid tripping an excluded-strings build check (one species name collides with a model codename).

Rarity tiers (weighted random):

Tier	Weight	Stat Floor	Hat?
Common	60%	5	None
Uncommon	25%	15	Random
Rare	10%	25	Random
Epic	4%	35	Random
Legendary	1%	50	Random

Cosmetics:

6 eye styles: ·, +, x, @, °, and a special star eye
8 hats: none, crown, tophat, propeller, halo, wizard, beanie, tinyduck
1% shiny chance — independent of rarity
5 stats: DEBUGGING, PATIENCE, CHAOS, WISDOM, SNARK — one peak stat, one dump stat, rest scattered. Higher rarity = higher stat floors.

Each species star rating displays with themed colors: common (inactive), uncommon (green), rare (permission blue), epic (auto-accept purple), legendary (warning gold).

17.3 Soul Generation

On first “hatch,” Claude generates a unique name and personality for the companion. This is stored permanently in the user’s global config as StoredCompanion:

type StoredCompanion = CompanionSoul & { hatchedAt: number }
type CompanionSoul = { name: string; personality: string }

Critically, only the soul persists — bones (species, rarity, stats) are regenerated from the userId hash every time. This prevents users from editing their config to fake a legendary, and allows species renames without breaking stored companions.

17.4 Sprite Animation System

Each species has 3 animation frames as 5-line, 12-character-wide ASCII art:

Frame 0 (idle):     Frame 1 (fidget):   Frame 2 (rare):
    __                   __               __
  <(· )___             <(· )___         <(· )___
   (  ._>               (  ._>           (  .__>
    `--´                 `--´~            `--´

Eye placeholders {E} are replaced with the companion’s assigned eye character at render time. Hat lines overlay the top row (only when the species’ top row is blank).

The idle sequence cycles at 500ms per tick:

const IDLE_SEQUENCE = [0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, 2, 0, 0, 0]
// -1 = "blink on frame 0" (eyes temporarily replaced)

This creates a natural feel: mostly still, occasional fidgets, rare blinks.

17.5 Speech Bubbles and Interaction

The companion renders as a CompanionSprite React component positioned beside the prompt input. It features:

Speech bubbles with a SpeechBubble component using rounded borders
Bubbles display for ~10 seconds (20 ticks) then fade over the last 3 seconds
/buddy pet triggers a floating heart animation (2.5 seconds) with hearts drifting upward
The companion can react to conversation events via companionReaction in AppState
When terminal is too narrow (<100 cols), the full sprite is hidden and replaced with a compact face-only rendering

A companion intro is injected as a special attachment into the conversation, informing Claude that a small creature named X sits beside the input box and occasionally comments in bubbles.

17.6 Teaser and Release Strategy

export function isBuddyTeaserWindow(): boolean {
  const d = new Date()
  return d.getFullYear() === 2026 && d.getMonth() === 3 && d.getDate() <= 7
}

The teaser uses local dates, not UTC — creating a rolling 24-hour wave across timezones for sustained social media buzz rather than a single UTC-midnight spike. During the teaser window, users who haven’t hatched a companion see a rainbow-colored /buddy notification.

18. KAIROS: Persistent Assistant Mode and Auto-Dreaming

KAIROS (feature-flagged as KAIROS) is a complete alternate UX where Claude becomes a long-lived autonomous agent that persists across sessions — the “Always-On Claude.”

18.1 Auto-Dreaming: Memory Consolidation

The most concrete KAIROS subsystem in the codebase is the auto-dream system (services/autoDream/). This is a background memory consolidation agent that runs as a forked subagent.

Gate order (cheapest checks first):

Time gate: Hours since last consolidation >= minHours (default: 24h)
Session gate: Number of transcript sessions since last consolidation >= minSessions (default: 5)
Lock gate: No other process is mid-consolidation (file lock with mtime-based conflict detection)
Scan throttle: Even when the time gate passes, session scanning is throttled to every 10 minutes

The 4-phase dream prompt:

Phase 1 — Orient
  └─ ls the memory directory, read the index, skim existing topic files

Phase 2 — Gather recent signal
  └─ Check daily logs, find drifted memories, grep transcripts narrowly

Phase 3 — Consolidate
  └─ Write/update memory files, merge duplicates, convert relative dates

Phase 4 — Prune and index
  └─ Update the entrypoint index (max ~25KB), remove stale pointers

Tool constraints for dream runs: Bash is restricted to read-only commands only — ls, find, grep, cat, stat, wc, head, tail. Write operations are denied. File edits go through the normal Edit/Write tools with permission via createAutoMemCanUseTool().

Dream task lifecycle:

type DreamTaskState = {
  type: 'dream'
  phase: 'starting' | 'updating'    // Flips when first Edit/Write lands
  sessionsReviewing: number
  filesTouched: string[]             // Paths observed in Edit/Write tool_use
  turns: DreamTurn[]                 // Last 30 assistant turns (rolling window)
  abortController?: AbortController
  priorMtime: number                 // For lock rollback on failure
}

Users can kill a running dream from the background tasks dialog (Shift+Down). On kill, the lock mtime is rewound so the next session can retry.

18.2 KAIROS Integration Points

KAIROS is referenced throughout the codebase:

getKairosActive() in bootstrap state — gates whether KAIROS mode is active
Auto-dream is disabled in KAIROS mode (KAIROS uses its own disk-skill dream variant)
Brief mode (BriefTool) — all output goes through SendUserMessage tool (structured markdown + attachments + status)
Proactive prompts — periodic check-ins where Claude decides what to do next
15-second blocking budget — commands exceeding 15s are auto-backgrounded
Exclusive tools: SendUserFile, PushNotification, SubscribePR (GitHub webhook subscriptions), SleepTool
Append-only daily logs at ~/.claude/projects//memory/logs/YYYY/MM/YYYY-MM-DD.md
Midnight boundary handling — flushes yesterday’s transcript on date change so the dream process can find it

18.3 Session History (Assistant Mode)

assistant/sessionHistory.ts provides paginated session event retrieval for KAIROS:

type HistoryPage = {
  events: SDKMessage[]
  firstId: string | null    // Cursor for next-older page
  hasMore: boolean
}

Uses OAuth-authenticated API calls to CCR (Claude Code Remote) to fetch session transcripts. Pagination goes backwards (newest → oldest) with 100 events per page and 15-second timeout per request.

19. ULTRAPLAN: Remote Planning Sessions

ULTRAPLAN is an interactive planning system that farms out complex exploration to a remote Claude Code instance (CCR) for up to 30 minutes.

19.1 How It Works

User types “ultraplan” (keyword detection, not slash command) or uses /ultraplan
A remote CCR session is created with plan mode pre-configured
The CLI polls the remote session every 3 seconds for up to 30 minutes
Remote Claude explores, plans, and calls ExitPlanMode when ready
User approves or rejects the plan in the browser (claude.ai)
Rejected plans loop back for iteration

19.2 Keyword Detection

The keyword trigger system (utils/ultraplan/keyword.ts) is remarkably sophisticated. It finds “ultraplan” in user input while avoiding false positives:

Skipped contexts:

Inside paired delimiters (backticks, quotes, brackets, angle brackets)
Path-like context (src/ultraplan/foo.ts, ultraplan.tsx)
Identifier-like context (--ultraplan-mode, ultraplan-s)
Followed by ? (questions about the feature shouldn’t invoke it)
Slash command input (/rename ultraplan foo runs /rename, not ultraplan)

When triggered, “ultraplan” is replaced with “plan” in the forwarded prompt to keep it grammatical: "please ultraplan this" → "please plan this".

19.3 Two Execution Paths on Approval

On approval, the user chooses one of two paths:

Path	What Happens
“remote”	Execute the plan in the cloud CCR instance
“teleport to terminal”	Archive the remote session, execute locally

The teleport path uses a sentinel string __ULTRAPLAN_TELEPORT_LOCAL__ embedded in the browser’s rejection feedback. The rejection keeps the remote in plan mode, but the plan text is extracted from the feedback and executed locally.

19.4 Event Stream Scanning

The ExitPlanModeScanner class is a pure stateful classifier for the CCR event stream:

type ScanResult =
  | { kind: 'approved'; plan: string }
  | { kind: 'teleport'; plan: string }
  | { kind: 'rejected'; id: string }
  | { kind: 'pending' }
  | { kind: 'terminated'; subtype: string }
  | { kind: 'unchanged' }

Phase tracking for the UI pill:

running → (turn ends, no ExitPlanMode) → needs_input
needs_input → (user replies in browser) → running
running → (ExitPlanMode emitted) → plan_ready
plan_ready → (rejected) → running
plan_ready → (approved) → poll resolves, pill removed

Resilience: The poller tolerates up to 5 consecutive network failures before aborting (a 30-minute poll makes ~600 API calls — at any nonzero failure rate, one blip is inevitable).

20. Coordinator Mode: Multi-Agent Orchestrator

Coordinator Mode (CLAUDE_CODE_COORDINATOR_MODE=1) transforms Claude Code from a single-agent assistant into a multi-agent orchestrator where a master coordinator directs multiple parallel workers.

20.1 Architecture

Coordinator (you)
  ├─ AgentTool → Worker A (research)     ─┐
  ├─ AgentTool → Worker B (research)     ─┤ Run in parallel
  ├─ AgentTool → Worker C (implement)    ─┘
  └─ SendMessage → Continue Worker A with synthesized spec

The coordinator’s system prompt enforces a strict discipline:

“Never write ‘based on your findings’“ — the coordinator must synthesize worker research into specific specs with file paths, line numbers, and exactly what to change
Workers report back as XML messages with status, summary, result, and usage
The coordinator never polls — workers push completion notifications
Workers get isolated scratch directories (via tengu_scratch feature gate) for durable cross-worker knowledge

20.2 Worker Capabilities

Workers spawned via AgentTool have access to standard tools (or a simplified Bash/Read/Edit set in CLAUDE_CODE_SIMPLE mode), plus MCP tools from configured servers. The coordinator injects a workerToolsContext into the system prompt listing exactly which tools workers can use.

20.3 Task Workflow

The coordinator system prompt defines four phases:

Phase	Who	Purpose
Research	Workers (parallel)	Investigate codebase, find files
Synthesis	Coordinator	Read findings, craft implementation specs
Implementation	Workers	Make changes per spec, commit
Verification	Workers	Prove the code works (not just confirm it exists)

Concurrency rules:

Read-only tasks (research) — run in parallel freely
Write-heavy tasks (implementation) — one at a time per set of files
Verification — can run alongside implementation on different file areas

20.4 Continue vs. Spawn

The system provides explicit guidance on when to continue an existing worker vs. spawn fresh:

Situation	Mechanism	Reason
Research explored the exact files to edit	Continue	Worker already has files in context
Research was broad, implementation is narrow	Spawn fresh	Avoid exploration noise
Correcting a failure	Continue	Worker has error context
Verifying another worker’s code	Spawn fresh	Verifier needs fresh eyes
Wrong approach entirely	Spawn fresh	Wrong context pollutes retry

20.5 Session Mode Matching

When resuming a session, coordinator mode is automatically matched to the stored session mode:

export function matchSessionMode(sessionMode: 'coordinator' | 'normal' | undefined): string | undefined {
  // If current mode doesn't match the resumed session, flip the env var
  if (sessionIsCoordinator) {
    process.env.CLAUDE_CODE_COORDINATOR_MODE = '1'
  } else {
    delete process.env.CLAUDE_CODE_COORDINATOR_MODE
  }
}

This prevents a normal session from being resumed in coordinator mode (or vice versa), which would cause confusion.

21. The Memory System: Persistent AI Memory

Claude Code has a sophisticated file-based memory system (memdir/) that allows it to remember context across conversations — user preferences, project knowledge, feedback, and reference pointers.

21.1 Memory Architecture

Memories are stored as individual markdown files with YAML frontmatter at ~/.claude/projects//memory/:

---
name: user_role
description: User is a senior backend engineer focused on Rust
type: user
---

User is a senior backend engineer at Acme Corp, primarily works in Rust...

An index file MEMORY.md (max 200 lines / 25KB) serves as a table of contents — it’s loaded into every conversation’s system prompt so Claude knows what memories exist without reading them all.

21.2 Four Memory Types

Type	Purpose	Example
`user`	Role, preferences, knowledge level	“User is a data scientist, new to React”
`feedback`	How to approach work (corrections + confirmations)	“Don’t mock the database in integration tests”
`project`	Ongoing work, goals, deadlines	“Merge freeze begins 2026-03-05 for mobile release”
`reference`	Pointers to external systems	“Pipeline bugs tracked in Linear project INGEST”

The system explicitly does NOT save: code patterns, architecture, git history, debugging solutions, or anything derivable from the current project state.

21.3 Intelligent Memory Recall

Not all memories are loaded every turn. Instead, a Sonnet-powered relevance selector (findRelevantMemories.ts) runs as a side query:

Scan all .md files in the memory directory (max 200, newest-first)
Parse frontmatter headers (name, description, type) from the first 30 lines
Send the user’s query + memory manifest to Sonnet with structured JSON output
Sonnet returns up to 5 most relevant filenames
Those files are injected into the conversation context

A clever optimization: recently-used tools are passed to the selector so it skips reference docs for tools Claude is already exercising (e.g., don’t surface MCP spawn docs when Claude is actively using the spawn tool).

21.4 Path Security

The memory path system includes robust security validation:

Rejects relative paths, root paths, UNC paths, null bytes
projectSettings (committed to repo) is intentionally excluded from autoMemoryDirectory — a malicious repo could otherwise set autoMemoryDirectory: "~/.ssh" and gain write access to sensitive directories
All worktrees of the same git repo share one memory directory (via findCanonicalGitRoot)
CLAUDE_COWORK_MEMORY_PATH_OVERRIDE for SDK/Cowork integration

21.5 Team Memory

When the TEAMMEM feature is enabled, memories split into private (per-user) and team (shared) directories. User preferences stay private; project conventions and reference pointers default to team scope. A conflict rule prevents private feedback memories from contradicting team-level ones.

22. Hooks: User-Defined Automation

The hooks system (schemas/hooks.ts) lets users attach automated behaviors to Claude Code events — shell commands, LLM prompts, HTTP calls, or agent verifiers that fire before/after tool use, message submission, and more.

22.1 Four Hook Types

type HookCommand =
  | { type: 'command'; command: string; shell?: 'bash' | 'powershell' }
  | { type: 'prompt'; prompt: string; model?: string }
  | { type: 'http'; url: string; headers?: Record<string, string> }
  | { type: 'agent'; prompt: string; model?: string }

Command hooks run shell commands with optional timeout, async/background execution, and one-shot mode (once: true — runs once then auto-removes).

Prompt hooks evaluate an LLM prompt with $ARGUMENTS placeholder for the hook input JSON.

HTTP hooks POST the hook input to a URL with configurable headers and env var interpolation (only explicitly-allowed env vars are resolved — prevents leaking secrets).

Agent hooks run a full agentic verification loop (“Verify that unit tests ran and passed”) with configurable model and timeout.

22.2 Event-Matcher-Hook Pipeline

Hooks are configured in settings.json as a three-level structure:

Event → Matcher[] → Hook[]

Each event (PreToolUse, PostToolUse, PreMessage, PostMessage, etc.) has an array of matchers with optional permission-rule-syntax patterns (e.g., "Bash(git *)" — only fires for git commands). Each matcher has an array of hooks to execute.

The if condition field uses the same permission rule syntax as the tool permission system, evaluated against tool_name and tool_input — so hooks can fire selectively without spawning a process for every tool call.

22.3 Advanced Hook Features

async: true — Hook runs in background without blocking the model
asyncRewake: true — Runs in background but wakes the model on exit code 2 (blocking error)
once: true — Auto-removes after first execution (useful for one-time setup)
statusMessage — Custom spinner text while the hook runs
Environment variable interpolation in HTTP headers with explicit allowlist

23. Voice Mode, Bridge, and Infrastructure

23.1 Voice Mode

Claude Code includes a voice input mode (feature-flagged as VOICE_MODE) that allows voice-to-text interaction:

Requires Anthropic OAuth (not API keys, Bedrock, or Vertex) — uses the voice_stream endpoint on claude.ai
Protected by a GrowthBook kill-switch (tengu_amber_quartz_disabled) for emergency off
Auth check uses memoized keychain reads (~20-50ms first call, cache hit thereafter)
The /voice command, ConfigTool, and a VoiceModeNotice component all gate on isVoiceModeEnabled()

23.2 The Bridge System (31 Files)

The bridge (bridge/) is the most substantial networking subsystem — a persistent WebSocket connection between the local CLI and claude.ai’s web interface (CCR). It enables using Claude Code from a browser while the actual tools execute locally.

Key components:

bridgeMain.ts — Main bridge loop with exponential backoff (2s initial → 2min cap → 10min give-up)
replBridge.ts / replBridgeTransport.ts — REPL-side bridge handle, message framing
bridgeApi.ts — API client with JWT refresh, trusted device tokens, session validation
bridgeMessaging.ts / inboundMessages.ts — Message adaptation (SDK format ↔ local format)
bridgePermissionCallbacks.ts — Permission request mediation between web UI and local CLI
sessionRunner.ts — Spawns agent sessions per work item, manages worktrees
capacityWake.ts — Wakes idle bridge when capacity becomes available
workSecret.ts — Encrypted work routing between bridge workers

The bridge handles session lifecycle, token refresh, trusted device enrollment, and graceful reconnection — essentially a mini-RPC framework over WebSocket.

23.3 Direct Connect

The server/ directory implements Direct Connect — a WebSocket-based protocol for external clients to connect to a running Claude Code instance:

class DirectConnectSessionManager {
  connect(): void                    // Open WebSocket
  sendMessage(content): boolean      // Send user message
  respondToPermissionRequest(...)    // Handle tool permission prompts
  sendInterrupt(): void              // Cancel current request
  disconnect(): void                 // Close connection
}

Messages are JSON-over-WebSocket using the SDK message format. Control requests (permission prompts) are forwarded to the client, which responds with allow/deny decisions. This enables IDE integrations and custom frontends.

23.4 Upstream Proxy (CCR Security)

When running inside a CCR container, the upstream proxy system (upstreamproxy/) provides secure network access:

Read session token from /run/ccr/session_token
Set prctl(PR_SET_DUMPABLE, 0) — blocks same-UID ptrace (prevents prompt-injected gdb -p $PPID from scraping the token off the heap)
Download CA certificate and concatenate with system bundle for MITM proxy trust
Start local CONNECT→WebSocket relay on a random port
Unlink the token file (token stays heap-only; file is gone before the agent loop can access it)
Inject HTTPS_PROXY / SSL_CERT_FILE env vars for all subprocesses

Every step fails open — a broken proxy never breaks an otherwise-working session. The NO_PROXY list covers loopback, RFC1918, IMDS, Anthropic API, GitHub, and package registries.

23.5 Output Styles

The outputStyles/ system lets users customize Claude’s response format via markdown files:

Project styles: .claude/output-styles/*.md
User styles: ~/.claude/output-styles/*.md
Plugin styles: provided by installed plugins

Each style file has frontmatter (name, description, keep-coding-instructions) and a prompt body that shapes how Claude formats its responses.

23.6 Native TypeScript Modules

native-ts/ contains TypeScript bindings for performance-critical native code:

yoga-layout/ — TypeScript interface to the Yoga layout engine (flexbox calculations)
file-index/ — Native file indexing for fast codebase search
color-diff/ — Native color difference calculations (for theme/styling)

23.7 Moreright (Internal-Only)

The moreright/ directory contains a stub for an internal-only feature. The external build ships a no-op implementation with onBeforeQuery, onTurnComplete, and render all returning trivially. The real implementation is internal to Anthropic.

24. Vim Mode, Keybindings, and Developer Ergonomics

24.1 Vim Mode

A full vi command system:

Motions — h, j, k, l, w, b, e, 0, $, gg, G
Operators — d (delete), c (change), y (yank)
Text Objects — iw (inner word), ap (a paragraph)
Modal State Machine — Insert, Normal, Visual modes

All compiled to a single-pass command matcher for low-latency input processing.

24.2 Dynamic Keybindings

Context-aware keybinding resolution:

type KeybindingContext = {
  focus?: 'prompt' | 'file' | 'terminal'
  isRecording?: boolean
  vimMode?: boolean
  mode?: 'insert' | 'normal' | 'visual'
}

Users can define chord bindings: ctrl+k ctrl+o maps to custom actions via ~/.claude/keybindings.json.

24.3 Debug Tools

CLAUDE_CODE_DEBUG_REPAINTS=1 — Shows component owner chain for every repaint
CLAUDE_CODE_COMMIT_LOG=/tmp/commits.log — Logs slow renders for profiling
CLAUDE_CODE_PROFILE_STARTUP=1 — Full startup profiling with memory snapshots

25. Key Engineering Patterns and Takeaways

Pattern 1: Lazy Everything

Claude Code is aggressive about deferral:

Lazy schemas — Zod instantiation deferred via lazySchema()
Lazy commands — Module imports via load() functions
Lazy tools — 18 tools deferred to ToolSearchTool
Lazy modules — Dynamic imports for OpenTelemetry, analytics, heavy components
Lazy bundled skills — Reference files extracted on first use

Pattern 2: Memoization by Identity

Key functions are memoized to prevent redundant work:

COMMANDS() — Memoized, cleared by clearCommandMemoizationCaches()
loadAllCommands(cwd) — Memoized by working directory
init() — Memoized to prevent re-entrancy

Pattern 3: Feature Flags for Dead Code Elimination

Bun’s feature() function enables compile-time dead code elimination:

if (feature('COORDINATOR_MODE')) {
  // This entire block is removed from the binary when the flag is off
  const { CoordinatorUI } = await import('./coordinator/index.js')
}

Pattern 4: Interning for Performance

Three interning pools (chars, styles, hyperlinks) reduce memory and enable O(1) comparison by integer ID instead of string equality. The style pool even pre-computes ANSI transition sequences.

Pattern 5: Fail-Closed Security

The buildTool() factory provides safe defaults for 7 commonly-stubbed methods. Permissions default to “ask” — a tool must explicitly opt into auto-approval.

Pattern 6: Centralized Side Effects

onChangeAppState() is the single choke point for all state mutations that affect external systems. No scattered useEffect side effects.

Pattern 7: File-Based IPC

Multi-agent coordination uses files, not sockets:

Task outputs in ~/.claude/
History in ~/.claude/history.jsonl
Session transcripts for resume
Lock files with retry backoff for concurrent access

Pattern 8: Prompt Cache Stability

Tools are sorted alphabetically before being sent to the API. This keeps the tool list in the same order across requests, maximizing prompt cache hit rates.

Pattern 9: Progressive Disclosure

The deferred tool system implements progressive disclosure at the API level:

Base prompt stays under 200K tokens
Model discovers additional tools on demand via ToolSearchTool
Discovered tools are callable in the same turn

Pattern 10: Three-Tier Configuration

Settings are resolved from multiple sources with clear precedence:

MDM Policy (highest) → Remote Managed → User Settings
→ Project Config → Global Config → Defaults (lowest)

26. Conclusion

Claude Code is a remarkable piece of engineering. What appears to the user as a simple chat interface in the terminal is backed by:

A custom React reconciler with Yoga layout, double-buffered rendering, and hardware scroll optimization
A resilient query engine with automatic context compression, multi-strategy error recovery, and token budget continuation
A 60+ tool ecosystem unified under a single generic interface with Zod validation, lazy schemas, and elastic discovery
A multi-layered permission system balancing security and developer productivity across 5 modes, rule patterns, and ML classifiers
An extensibility framework spanning skills, plugins, and MCP with 8 configuration scopes and 5 transport types
Production-grade infrastructure: interned style pools, file-based IPC, sampled profiling, parallelized startup, and comprehensive telemetry

The codebase demonstrates that a CLI tool can be as architecturally sophisticated as any web application — perhaps more so, given the unique constraints of terminal rendering, keyboard input ambiguity, and the need to coordinate an AI model, file system, shell, and git repository all within a single conversation loop.

For developers building similar tools, the key lessons are:

Invest in the rendering layer. Claude Code’s custom Ink framework is its competitive advantage for terminal UX.
Design for failure. The multi-strategy error recovery (compaction → collapse → fallback → surface) means users almost never see raw API errors.
Defer aggressively. Lazy loading at every level — schemas, modules, tools, skills — keeps startup fast and memory bounded.
Intern everything. Style pools, character pools, and hyperlink pools turn O(n) string comparisons into O(1) integer comparisons.
Make safety the default. Fail-closed permissions, dangerous pattern detection, and mandatory confirmation for destructive operations build user trust.

Claude Code isn’t just a wrapper around an API. It’s a complete development environment that happens to run in your terminal.

This analysis is based on examination of the Claude Code source code. All technical details reflect the codebase as observed at the time of analysis.

Claude Code’s Hidden Features: Undocumented, Gated, and Internal Capabilities

2026-03-31T00:00:00+00:00

Genuinely hidden, gated, or underdocumented capabilities found in the source code — things the public docs don’t cover.

Based on direct source inspection. “Hidden” = hidden from --help, feature-flagged, or dependent on non-public backends. Source presence does not guarantee your build/account has access.

1. Hidden CLI Flags
2. Feature-Gated Slash Commands
3. The Buddy System — A Full Tamagotchi Pet
4. KAIROS — Persistent Autonomous Assistant Mode
5. Auto-Dream — Background Memory Consolidation
6. Magic Docs — Self-Maintaining Documentation
7. ULTRAPLAN — Remote 30-Minute Planning Sessions
8. Coordinator Mode — Multi-Agent Swarms
9. Speculation — Predictive Response Generation
10. The Advisor Model System
11. Voice Mode
12. Team Memory Sync
13. Remote Triggers — Scheduled Cloud Agents
14. Direct Connect — cc:// Session URLs
15. Bridge Mode & Remote Control
16. SSH Remote Execution
17. MCP Channels — Inbound Push Notifications
18. AFK / Auto-Permission Mode
19. Background Sessions (Detached)
20. “While You Were Away” Session Recaps
21. Tool-Use Summary Generation
22. Auto-Memory Extraction
23. Prompt Suggestions & Follow-Up Generation
24. Deferred Tool Discovery
25. Hidden Keybindings
26. Lesser-Known Environment Variables
27. Lesser-Known Settings Keys
28. CLAUDE.md Loading — Hidden Discovery Paths
29. Internal-Only Commands
30. Build-Time Feature Flags

1. Hidden CLI Flags

These flags are registered with .hideHelp() — they work but won’t appear in claude --help:

Flag	What It Does
`--teleport`	Upload local git state to a remote Claude Code session on claude.ai
`--remote`	Create a new remote session (comment in code: “undocumented until GA”)
`--remote-control` / `--rc`	Enter bridge mode — control from claude.ai web UI (requires `BRIDGE_MODE`)
`--sdk-url`	Connect to a custom SDK URL for direct-connect sessions
`--channels`	Register for MCP inbound push notifications (KAIROS builds)
`--dangerously-load-development-channels`	Bypass MCP channel allowlist
`--enable-auto-mode`	AI classifier-driven auto-permission (requires `TRANSCRIPT_CLASSIFIER`)
`--advisor`	Attach a secondary reviewer model
`--cowork`	Switch plugin commands to internal cowork marketplace
`--agent-id`, `--team-name`, `--teammate-mode`, `--agent-type`	Swarm identity flags for multi-agent coordination

Deprecated aliases (still work): --afk and --dangerously-skip-permissions-with-classifiers → map to --enable-auto-mode

Correction: --voice is NOT a CLI flag. Voice is activated via /voice slash command or the voiceEnabled setting. --brief and --proactive are not hidden — they appear in help when their feature flags are on.

2. Feature-Gated Slash Commands

These commands exist but are conditionally registered or hidden:

Command	What It Does	Gate
`/buddy`	Hatch and interact with a Tamagotchi-style AI pet	`BUDDY` flag + date gate
`/voice`	Toggle hold-to-talk voice dictation	`VOICE_MODE` + Anthropic OAuth
`/advisor [model\\|off]`	Attach/detach a secondary reviewer model	GrowthBook `tengu_sage_compass`
`/fast [on\\|off]`	Toggle fast inference mode	Available when fast mode is supported
`/dream`	Manually trigger memory consolidation	Auto-memory must be enabled
`/brief`	Toggle brief/checkpoint mode	`KAIROS` or `KAIROS_BRIEF`
`/ultraplan`	Launch a remote 30-minute planning session	`ULTRAPLAN` feature flag
`/heapdump`	Dump JavaScript heap to `~/Desktop`	Always registered, hidden
`/thinkback`	2025 Claude Code year-in-review stats	GrowthBook `tengu_thinkback`
`/remote-control` (alias `/rc`)	Enter bridge mode	`BRIDGE_MODE`

3. The Buddy System — A Full Tamagotchi Pet

A fully implemented virtual companion that lives beside your input box.

Activate: /buddy

Deterministic generation from your userId via Mulberry32 PRNG — same user = same companion across all devices
18 species: duck, goose, blob, cat, dragon, octopus, owl, penguin, turtle, snail, ghost, axolotl, capybara, cactus, robot, rabbit, mushroom, chonk
5 rarity tiers: common (60%), uncommon (25%), rare (10%), epic (4%), legendary (1%)
6 eye styles, 8 hats (commons get no hat), 1% shiny chance
5 stats: DEBUGGING, PATIENCE, CHAOS, WISDOM, SNARK — one peak stat, one dump stat, floors scale with rarity
Soul generation: On first hatch, Claude generates a unique name and personality, stored permanently
ASCII sprite animation: 3 frames per species at 500ms tick — idle, fidget, and rare blink frames
Speech bubbles (10 seconds, 3-second fade) and /buddy pet hearts animation (2.5 seconds)
Anti-cheat: Only the soul persists — bones (species/rarity/stats) are regenerated from userId hash every load

Release window: Teaser April 1-7 2026 (local dates for rolling timezone buzz). Live permanently from April 2026 onward. Always on for internal builds.

4. KAIROS — Persistent Autonomous Assistant Mode

Activate: --assistant flag (feature-gated: KAIROS)

A complete alternate UX where Claude becomes a long-lived autonomous agent persisting across sessions:

Append-only daily logs at ~/.claude/projects//memory/logs/YYYY/MM/YYYY-MM-DD.md
15-second blocking budget — any command exceeding 15s is auto-backgrounded
Proactive prompts — periodic check-ins where Claude decides what to do next or calls Sleep
Brief mode — all output through SendUserMessage tool (structured markdown + attachments + status), not free-form text
Exclusive tools: SendUserFile, PushNotification, SubscribePR (GitHub webhook subscriptions), SleepTool
Midnight boundary handling — flushes transcript on date change so the dream process can find it
Nightly dreaming uses a separate disk-skill variant (distinct from the auto-dream system below)

5. Auto-Dream — Background Memory Consolidation

Runs automatically in the background — no user action needed.

A forked subagent reviews your recent sessions and consolidates learnings into structured memory files.

4-phase process:

Orient — ls memory dir, read index, skim existing topics
Gather — Check daily logs, find drifted memories, grep transcripts narrowly
Consolidate — Write/update memory files, merge duplicates, convert relative dates to absolute
Prune — Update MEMORY.md index (max ~25KB), remove stale pointers

Gates (cheapest checks first):

Time: 24+ hours since last consolidation
Sessions: 5+ sessions since last consolidation
Lock: no other process mid-consolidation
Scan throttle: session scanning limited to every 10 minutes

Safety: Bash restricted to read-only commands. Users can kill from the background tasks dialog (Shift+Down).

User control: autoDreamEnabled setting overrides the GrowthBook gate tengu_onyx_plover.

6. Magic Docs — Self-Maintaining Documentation

Files with a # MAGIC DOC: </code> first line are automatically updated by a background agent. How to use: <ol> <li>Create a markdown file with <code class="language-plaintext highlighter-rouge"># MAGIC DOC: My Topic</code> as the first line</li> <li>Optionally add italic instructions on the next line</li> <li>Make sure Claude reads that file during a session</li> <li>A constrained background agent will update it with new learnings</li> </ol> The agent can only edit that specific file — it cannot modify other files. This is triggered by file format, not a command. <h2 id="7-ultraplan--remote-30-minute-planning-sessions">7. ULTRAPLAN — Remote 30-Minute Planning Sessions</h2> Activate: Type “ultraplan” in your message (keyword detection) or use <code class="language-plaintext highlighter-rouge">/ultraplan</code> Farms out complex exploration to a remote Claude Code instance (CCR): <ol> <li>Remote session created with plan mode pre-configured</li> <li>CLI polls every 3 seconds for up to 30 minutes</li> <li>Remote Claude explores, plans, and calls <code class="language-plaintext highlighter-rouge">ExitPlanMode</code> when ready</li> <li>You approve or reject the plan in the browser (claude.ai)</li> <li>Rejected plans loop back for iteration</li> <li>On approval: choose “remote” (execute in cloud) or “teleport to terminal” (execute locally)</li> </ol> Smart keyword detection avoids false positives — skips occurrences inside quotes, paths, identifiers, and questions. <h2 id="8-coordinator-mode--multi-agent-swarms">8. Coordinator Mode — Multi-Agent Swarms</h2> Activate: <code class="language-plaintext highlighter-rouge">CLAUDE_CODE_COORDINATOR_MODE=1</code> Transforms Claude into a multi-agent orchestrator: <ul> <li>Master coordinator spawns workers via <code class="language-plaintext highlighter-rouge">AgentTool</code> in parallel</li> <li>Workers report back as XML <code class="language-plaintext highlighter-rouge"><task-notification></code> messages with status, summary, result, and token usage</li> <li>Coordinator never polls — push-based completion notifications</li> <li>Workers get isolated scratch directories (via <code class="language-plaintext highlighter-rouge">tengu_scratch</code> gate) for cross-worker knowledge</li> <li>System prompt enforces: “Never write ‘based on your findings’ — synthesize yourself”</li> <li>4-phase workflow: Research (parallel) → Synthesis (coordinator) → Implementation → Verification</li> <li>Explicit continue-vs-spawn guidance based on context overlap</li> </ul> <h2 id="9-speculation--predictive-response-generation">9. Speculation — Predictive Response Generation</h2> While you’re still typing, Claude Code speculatively starts generating a response. <ul> <li>File writes go to an overlay filesystem (not your real files)</li> <li>If your actual input matches the speculation boundary → overlay committed instantly</li> <li>If it doesn’t match → discarded silently</li> <li>Feature gate: <code class="language-plaintext highlighter-rouge">tengu_chomp_inflection</code> (GrowthBook)</li> <li>Result: noticeably lower perceived latency for predictable follow-ups</li> </ul> <h2 id="10-the-advisor-model-system">10. The Advisor Model System</h2> Activate: <code class="language-plaintext highlighter-rouge">/advisor <model></code> or <code class="language-plaintext highlighter-rouge">--advisor <model></code> Attaches a secondary reviewer/advisor model as a server-side tool: <ul> <li>Main model (e.g., Sonnet) can call a stronger model (e.g., Opus) for review</li> <li>Full conversation history is forwarded when the advisor is invoked</li> <li>Beta header: <code class="language-plaintext highlighter-rouge">advisor-tool-2026-03-01</code></li> <li>Does NOT work on Bedrock/Vertex (they don’t support the advisor beta header)</li> <li>Disable with <code class="language-plaintext highlighter-rouge">/advisor off</code> or <code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_ADVISOR_TOOL</code></li> <li>GrowthBook gate: <code class="language-plaintext highlighter-rouge">tengu_sage_compass</code></li> </ul> <h2 id="11-voice-mode">11. Voice Mode</h2> Activate: <code class="language-plaintext highlighter-rouge">/voice</code> slash command or <code class="language-plaintext highlighter-rouge">voiceEnabled</code> setting <ul> <li>Hold-to-talk dictation (default keybinding: hold <code class="language-plaintext highlighter-rouge">space</code>)</li> <li>Requires Anthropic OAuth — not available with API keys, Bedrock, or Vertex</li> <li>Uses the <code class="language-plaintext highlighter-rouge">voice_stream</code> endpoint on claude.ai</li> <li>Multiple audio backends: native audio, SoX, <code class="language-plaintext highlighter-rouge">arecord</code></li> <li>Protected by GrowthBook kill-switch (<code class="language-plaintext highlighter-rouge">tengu_amber_quartz_disabled</code>) for emergency off</li> </ul> <h2 id="12-team-memory-sync">12. Team Memory Sync</h2> Feature-gated: <code class="language-plaintext highlighter-rouge">TEAMMEM</code> Memories split into private (per-user) and team (shared) directories: <ul> <li>Team memory synced to server APIs across authenticated org members</li> <li>Secret scanning prevents leaking sensitive data into shared memory</li> <li>Optimistic locking for conflict resolution</li> <li>Team memory lives at <code class="language-plaintext highlighter-rouge">.../memory/team/MEMORY.md</code></li> <li>Requires first-party OAuth and org-scoped server APIs</li> </ul> <h2 id="13-remote-triggers--scheduled-cloud-agents">13. Remote Triggers — Scheduled Cloud Agents</h2> Tool: <code class="language-plaintext highlighter-rouge">RemoteTrigger</code> (deferred, discovered via ToolSearchTool) Create and manage scheduled remote Claude Code agents via CCR API: <ul> <li><code class="language-plaintext highlighter-rouge">create</code> — Schedule a trigger with cron expression</li> <li><code class="language-plaintext highlighter-rouge">list</code> / <code class="language-plaintext highlighter-rouge">get</code> — View triggers</li> <li><code class="language-plaintext highlighter-rouge">update</code> — Modify a trigger</li> <li><code class="language-plaintext highlighter-rouge">run</code> — Manually fire a trigger</li> </ul> Requires Claude.ai OAuth. Feature gate: <code class="language-plaintext highlighter-rouge">tengu_surreal_dali</code>. Beta: <code class="language-plaintext highlighter-rouge">ccr-triggers-2026-01-30</code>. <h2 id="14-direct-connect--cc-session-urls">14. Direct Connect — cc:// Session URLs</h2> Activate: <code class="language-plaintext highlighter-rouge">claude server</code> (when <code class="language-plaintext highlighter-rouge">DIRECT_CONNECT</code> is enabled) Creates shareable session URLs that external clients can connect to: <ul> <li>WebSocket-based protocol with SDK message format</li> <li>Permission requests forwarded to connecting client</li> <li>Supports interrupt/cancel signals</li> <li><code class="language-plaintext highlighter-rouge">claude open cc://...</code> connects to an existing session (described as internal)</li> <li>Enables custom IDE integrations and frontends</li> </ul> <h2 id="15-bridge-mode--remote-control">15. Bridge Mode & Remote Control</h2> Activate: <code class="language-plaintext highlighter-rouge">--remote-control</code> / <code class="language-plaintext highlighter-rouge">--rc</code> (when <code class="language-plaintext highlighter-rouge">BRIDGE_MODE</code> compiled in) Persistent WebSocket connection between local CLI and claude.ai web interface: <ul> <li>Use Claude Code from a browser while tools execute locally</li> <li>Exponential backoff reconnection (2s → 2min cap → 10min give-up)</li> <li>JWT token refresh, trusted device enrollment</li> <li>Permission request mediation between web UI and local CLI</li> <li>31 files in the bridge subsystem — essentially a full RPC framework</li> </ul> Related settings: <ul> <li><code class="language-plaintext highlighter-rouge">remoteControlAtStartup</code> — auto-start bridge</li> <li><code class="language-plaintext highlighter-rouge">taskCompleteNotifEnabled</code>, <code class="language-plaintext highlighter-rouge">inputNeededNotifEnabled</code>, <code class="language-plaintext highlighter-rouge">agentPushNotifEnabled</code> — push notification controls</li> </ul> <h2 id="16-ssh-remote-execution">16. SSH Remote Execution</h2> Activate: <code class="language-plaintext highlighter-rouge">claude ssh <host> [dir]</code> (when <code class="language-plaintext highlighter-rouge">SSH_REMOTE</code> is enabled) <ul> <li>Deploys Claude Code binary to a remote Linux host over SSH</li> <li>API auth tunnels back through the local machine — no separate remote auth setup</li> <li>Build-time feature-gated</li> </ul> <h2 id="17-mcp-channels--inbound-push-notifications">17. MCP Channels — Inbound Push Notifications</h2> Activate: <code class="language-plaintext highlighter-rouge">--channels plugin:name@marketplace</code> Register sessions for real-time event delivery from approved MCP servers/plugins: <ul> <li>Allowlist controlled by GrowthBook</li> <li><code class="language-plaintext highlighter-rouge">--dangerously-load-development-channels</code> bypasses the allowlist</li> <li>Enables event-driven workflows (e.g., GitHub events, CI notifications)</li> </ul> <h2 id="18-afk--auto-permission-mode">18. AFK / Auto-Permission Mode</h2> Activate: <code class="language-plaintext highlighter-rouge">--enable-auto-mode</code> (hidden flag, requires <code class="language-plaintext highlighter-rouge">TRANSCRIPT_CLASSIFIER</code>) Classifier-assisted automatic permission decisions when the user is away: <ul> <li>AI model evaluates each permission request based on conversation context</li> <li>Beta header: <code class="language-plaintext highlighter-rouge">afk-mode-2026-01-31</code></li> <li>Deprecated aliases: <code class="language-plaintext highlighter-rouge">--afk</code>, <code class="language-plaintext highlighter-rouge">--dangerously-skip-permissions-with-classifiers</code></li> <li>Designed for unattended, long-running agent workflows</li> </ul> <h2 id="19-background-sessions-detached">19. Background Sessions (Detached)</h2> Feature-gated: <code class="language-plaintext highlighter-rouge">BG_SESSIONS</code> Run Claude Code sessions in the background with management commands: <ul> <li><code class="language-plaintext highlighter-rouge">claude --bg</code> / <code class="language-plaintext highlighter-rouge">claude --background</code> — Start a detached session</li> <li><code class="language-plaintext highlighter-rouge">claude ps</code> — List running background sessions</li> <li><code class="language-plaintext highlighter-rouge">claude logs <id></code> — View session logs</li> <li><code class="language-plaintext highlighter-rouge">claude attach <id></code> — Reattach to a session</li> <li><code class="language-plaintext highlighter-rouge">claude kill <id></code> — Stop a session</li> </ul> <h2 id="20-while-you-were-away-session-recaps">20. “While You Were Away” Session Recaps</h2> When you return to a session after being away, Claude can generate a short recap card summarizing what happened. <ul> <li>Implemented in <code class="language-plaintext highlighter-rouge">services/awaySummary.ts</code></li> <li>Produces short re-entry summaries automatically</li> <li>UI-side feature, not a slash command</li> </ul> <h2 id="21-tool-use-summary-generation">21. Tool-Use Summary Generation</h2> For SDK/mobile surfaces, raw tool batches are automatically converted into compact high-level progress summaries: <ul> <li>Implemented in <code class="language-plaintext highlighter-rouge">services/toolUseSummary/toolUseSummaryGenerator.ts</code></li> <li>Generates short labels for completed tool batches</li> <li>Used by SDK to provide progress updates to clients and mobile-like single-line rows</li> </ul> <h2 id="22-auto-memory-extraction">22. Auto-Memory Extraction</h2> A background agent automatically extracts memories from your conversations: <ul> <li>Runs at end of each complete query loop (when <code class="language-plaintext highlighter-rouge">EXTRACT_MEMORIES</code> is enabled)</li> <li>Uses a forked agent that shares the prompt cache</li> <li>Scans existing memory files first to avoid duplicates</li> <li>When the main agent writes memories directly, the background extractor skips that range</li> <li>Writes to <code class="language-plaintext highlighter-rouge">~/.claude/projects/<path>/memory/</code></li> <li>Auto-memory is enabled by default unless disabled via <code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_AUTO_MEMORY</code>, <code class="language-plaintext highlighter-rouge">--bare</code>, or settings</li> </ul> <h2 id="23-prompt-suggestions--follow-up-generation">23. Prompt Suggestions & Follow-Up Generation</h2> Activate: <code class="language-plaintext highlighter-rouge">CLAUDE_CODE_ENABLE_PROMPT_SUGGESTION=1</code> After Claude finishes a response, it can suggest what to ask next: <ul> <li>Feature gate: <code class="language-plaintext highlighter-rouge">tengu_chomp_inflection</code> (GrowthBook)</li> <li>Env var can force on/off, otherwise GrowthBook + interactive-session checks apply</li> <li>Tied to the speculation system — may pre-generate the speculative response too</li> </ul> <h2 id="24-deferred-tool-discovery">24. Deferred Tool Discovery</h2> ~18 of Claude Code’s 60+ tools are not sent to the model in every request. They’re discovered on-demand: <ol> <li>Model calls <code class="language-plaintext highlighter-rouge">ToolSearchTool</code> with a keyword query</li> <li>Matching deferred tools’ schemas are returned</li> <li>Model calls the discovered tool in the same turn</li> </ol> Query syntax: <ul> <li><code class="language-plaintext highlighter-rouge">select:TaskCreate,LSP</code> — Direct selection by name</li> <li><code class="language-plaintext highlighter-rouge">task create</code> — Keyword search against names, descriptions, and hints</li> <li><code class="language-plaintext highlighter-rouge">+slack send</code> — Require “slack” in tool name</li> </ul> Why it matters: Keeps the base prompt under 200K tokens. Without this, 60+ tool schemas would consume too much context. <h2 id="25-hidden-keybindings">25. Hidden Keybindings</h2> Feature-gated keybindings that only appear when their features are enabled: <table> <thead> <tr> <th>Key Combo</th> <th>Action</th> <th>Gate</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">Space</code> (hold)</td> <td>Push-to-talk voice input</td> <td><code class="language-plaintext highlighter-rouge">VOICE_MODE</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+Shift+B</code></td> <td>Toggle Brief mode</td> <td><code class="language-plaintext highlighter-rouge">KAIROS</code> / <code class="language-plaintext highlighter-rouge">KAIROS_BRIEF</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+Shift+F</code> / <code class="language-plaintext highlighter-rouge">Cmd+Shift+F</code></td> <td>Global search</td> <td><code class="language-plaintext highlighter-rouge">QUICK_SEARCH</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+Shift+P</code> / <code class="language-plaintext highlighter-rouge">Cmd+Shift+P</code></td> <td>Quick open</td> <td><code class="language-plaintext highlighter-rouge">QUICK_SEARCH</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Meta+J</code></td> <td>Toggle terminal panel</td> <td><code class="language-plaintext highlighter-rouge">TERMINAL_PANEL</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Shift+Up</code></td> <td>Message actions menu</td> <td><code class="language-plaintext highlighter-rouge">MESSAGE_ACTIONS</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+Shift+O</code></td> <td>Toggle teammate preview</td> <td>Teams enabled</td> </tr> </tbody> </table> Always-available but often unknown: <table> <thead> <tr> <th>Key Combo</th> <th>Action</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+X Ctrl+K</code></td> <td>Kill all running agents</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+_ </code> or <code class="language-plaintext highlighter-rouge">Ctrl+Shift+-</code></td> <td>Undo</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+X Ctrl+E</code> or <code class="language-plaintext highlighter-rouge">Ctrl+G</code></td> <td>Open external editor</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+S</code></td> <td>Stash current input</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Meta+P</code></td> <td>Model picker</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Meta+O</code></td> <td>Fast mode toggle</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Meta+T</code></td> <td>Thinking mode toggle</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+E</code></td> <td>Toggle permission explanation (in confirmation dialog)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">Ctrl+D</code></td> <td>Toggle permission debug info (in confirmation dialog)</td> </tr> </tbody> </table> All keybindings are overridable via <code class="language-plaintext highlighter-rouge">~/.claude/keybindings.json</code>. <h2 id="26-lesser-known-environment-variables">26. Lesser-Known Environment Variables</h2> Debug & profiling: <table> <thead> <tr> <th>Variable</th> <th>Purpose</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_PROFILE_STARTUP=1</code></td> <td>Full startup profiling with memory snapshots</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_PROFILE_QUERY=1</code></td> <td>Profile query pipeline timing</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DEBUG_REPAINTS=1</code></td> <td>Show component owner chain for every terminal repaint</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_PERFETTO_TRACE</code></td> <td>Enable Perfetto tracing format</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_TERMINAL_RECORDING</code></td> <td>Record terminal in asciinema format</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_COMMIT_LOG=/path</code></td> <td>Log slow renders for profiling</td> </tr> </tbody> </table> Behavioral overrides: <table> <thead> <tr> <th>Variable</th> <th>Purpose</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</code></td> <td>Disable ALL non-essential network traffic (most restrictive privacy)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_AUTO_MEMORY</code></td> <td>Disable automatic memory management</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_CRON</code></td> <td>Disable cron job scheduler</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_FILE_CHECKPOINTING</code></td> <td>Disable file snapshot backups</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_ADVISOR_TOOL</code></td> <td>Disable advisor model</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_ENABLE_PROMPT_SUGGESTION</code></td> <td>Enable speculative next-prompt suggestions</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_MAX_OUTPUT_TOKENS</code></td> <td>Override max output tokens per response</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_MAX_CONTEXT_TOKENS</code></td> <td>Override max context window</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_EFFORT_LEVEL</code></td> <td>Set effort: <code class="language-plaintext highlighter-rouge">low\|medium\|high\|max\|auto</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_COORDINATOR_MODE=1</code></td> <td>Enable coordinator mode</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_SIMPLE</code></td> <td>Same as <code class="language-plaintext highlighter-rouge">--bare</code> — skip hooks, LSP, plugins, background tasks</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CONFIG_DIR</code></td> <td>Override <code class="language-plaintext highlighter-rouge">~/.claude</code> config directory</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_ENV_FILE</code></td> <td>Path to env file to source on startup</td> </tr> </tbody> </table> Provider switching: <table> <thead> <tr> <th>Variable</th> <th>Purpose</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_USE_BEDROCK=1</code></td> <td>Use AWS Bedrock</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_USE_VERTEX=1</code></td> <td>Use Google Vertex AI</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_USE_FOUNDRY=1</code></td> <td>Use Azure Foundry</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_CLIENT_CERT</code> / <code class="language-plaintext highlighter-rouge">CLAUDE_CODE_CLIENT_KEY</code></td> <td>mTLS client certificates</td> </tr> </tbody> </table> Prompt caching control: <table> <thead> <tr> <th>Variable</th> <th>Purpose</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">DISABLE_PROMPT_CACHING</code></td> <td>Disable globally</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">DISABLE_PROMPT_CACHING_HAIKU</code> / <code class="language-plaintext highlighter-rouge">_SONNET</code> / <code class="language-plaintext highlighter-rouge">_OPUS</code></td> <td>Disable per model</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">DISABLE_INTERLEAVED_THINKING</code></td> <td>Disable thinking blocks</td> </tr> </tbody> </table> <h2 id="27-lesser-known-settings-keys">27. Lesser-Known Settings Keys</h2> In <code class="language-plaintext highlighter-rouge">~/.claude/settings.json</code> — things most people don’t know you can set: <table> <thead> <tr> <th>Key</th> <th>What It Does</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">apiKeyHelper</code></td> <td>External command to fetch API key (e.g., <code class="language-plaintext highlighter-rouge">1password read op://vault/key</code>)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">awsCredentialExport</code></td> <td>Command to export AWS credentials for Bedrock</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">env</code></td> <td>Arbitrary environment variables injected into every session</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">effortLevel</code></td> <td>Default effort level: <code class="language-plaintext highlighter-rouge">low\|medium\|high\|max\|auto</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">alwaysThinkingEnabled</code></td> <td>Force extended thinking on every request</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">spinnerVerbs</code></td> <td>Custom spinner verb list</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">spinnerTipsOverride</code></td> <td>Custom tip messages during spinner</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">worktree.symlinkDirectories</code></td> <td>Directories to symlink in worktrees (saves disk)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">worktree.sparsePaths</code></td> <td>Git sparse-checkout paths for monorepo worktrees</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">autoMemoryDirectory</code></td> <td>Custom path for memory storage</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">autoDreamEnabled</code></td> <td>Enable/disable auto-dream consolidation</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">minSleepDurationMs</code></td> <td>Minimum SleepTool duration</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">skipWebFetchPreflight</code></td> <td>Skip WebFetch URL validation</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">disableBypassPermissionsMode</code></td> <td>Prevent entering bypass mode</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">allowManagedPermissionRulesOnly</code></td> <td>Enterprise: only admin-defined permission rules</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">allowManagedMcpServersOnly</code></td> <td>Enterprise: only admin-defined MCP servers</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">allowManagedHooksOnly</code></td> <td>Enterprise: only admin-defined hooks</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">allowedHttpHookUrls</code></td> <td>URL allowlist for HTTP hooks</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">httpHookAllowedEnvVars</code></td> <td>Env vars HTTP hooks can interpolate</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">remote.defaultEnvironmentId</code></td> <td>Default remote environment for CCR</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">minimumVersion</code></td> <td>Enforce minimum Claude Code version</td> </tr> </tbody> </table> <h2 id="28-claudemd-loading--hidden-discovery-paths">28. CLAUDE.md Loading — Hidden Discovery Paths</h2> Beyond the standard project <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>, Claude Code loads instructions from: <ul> <li><code class="language-plaintext highlighter-rouge">~/.claude/CLAUDE.md</code> — User-level global instructions</li> <li>Parent directories up to git root — all CLAUDE.md files in parent dirs are included</li> <li><code class="language-plaintext highlighter-rouge">.claude/CLAUDE.md</code> — Inside the <code class="language-plaintext highlighter-rouge">.claude</code> directory</li> <li><code class="language-plaintext highlighter-rouge">.claude/rules/*.md</code> — Per-project rule files (all <code class="language-plaintext highlighter-rouge">.md</code> files in this directory)</li> <li><code class="language-plaintext highlighter-rouge">@include</code>-style references inside memory files</li> </ul> <h2 id="29-internal-only-commands">29. Internal-Only Commands</h2> Registered only when <code class="language-plaintext highlighter-rouge">USER_TYPE === 'ant'</code> — not in public builds: <code class="language-plaintext highlighter-rouge">backfill-sessions</code>, <code class="language-plaintext highlighter-rouge">break-cache</code>, <code class="language-plaintext highlighter-rouge">bughunter</code>, <code class="language-plaintext highlighter-rouge">ctx_viz</code>, <code class="language-plaintext highlighter-rouge">good-claude</code>, <code class="language-plaintext highlighter-rouge">init-verifiers</code>, <code class="language-plaintext highlighter-rouge">force-snip</code>, <code class="language-plaintext highlighter-rouge">mock-limits</code>, <code class="language-plaintext highlighter-rouge">bridge-kick</code>, <code class="language-plaintext highlighter-rouge">subscribe-pr</code>, <code class="language-plaintext highlighter-rouge">reset-limits</code>, <code class="language-plaintext highlighter-rouge">share</code>, <code class="language-plaintext highlighter-rouge">ant-trace</code>, <code class="language-plaintext highlighter-rouge">perf-issue</code>, <code class="language-plaintext highlighter-rouge">env</code>, <code class="language-plaintext highlighter-rouge">oauth-refresh</code>, <code class="language-plaintext highlighter-rouge">debug-tool-call</code>, <code class="language-plaintext highlighter-rouge">agents-platform</code>, <code class="language-plaintext highlighter-rouge">autofix-pr</code>, <code class="language-plaintext highlighter-rouge">onboarding</code> <h2 id="30-build-time-feature-flags">30. Build-Time Feature Flags</h2> Compile-time flags via <code class="language-plaintext highlighter-rouge">feature()</code> from <code class="language-plaintext highlighter-rouge">bun:bundle</code>. When off, code is eliminated entirely from the binary: <table> <thead> <tr> <th>Flag</th> <th>Feature</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">COORDINATOR_MODE</code></td> <td>Multi-agent coordinator</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">VOICE_MODE</code></td> <td>Voice input</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">KAIROS</code> / <code class="language-plaintext highlighter-rouge">KAIROS_BRIEF</code></td> <td>Persistent assistant mode</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">PROACTIVE</code></td> <td>Autonomous mode</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">BRIDGE_MODE</code></td> <td>Remote control bridge</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">SSH_REMOTE</code></td> <td>SSH remote execution</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">DIRECT_CONNECT</code></td> <td><code class="language-plaintext highlighter-rouge">cc://</code> URL handling</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">BG_SESSIONS</code></td> <td>Background sessions</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">TEMPLATES</code></td> <td>Template/new/reply flows</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">TEAMMEM</code></td> <td>Team memory sync</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">TRANSCRIPT_CLASSIFIER</code></td> <td>AI permission classification</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">BUDDY</code></td> <td>Tamagotchi pet</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">ULTRAPLAN</code></td> <td>Remote planning</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">EXTRACT_MEMORIES</code></td> <td>Auto-memory extraction</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">WORKFLOW_SCRIPTS</code></td> <td>Workflow automation</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">QUICK_SEARCH</code></td> <td>Quick search (Ctrl+Shift+F)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">TERMINAL_PANEL</code></td> <td>Terminal panel (Meta+J)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">MESSAGE_ACTIONS</code></td> <td>Message action menu (Shift+Up)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CONTEXT_COLLAPSE</code></td> <td>Context collapse optimization</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">HISTORY_SNIP</code></td> <td>History snipping</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">MCP_SKILLS</code></td> <td>MCP skill discovery</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">DAEMON</code></td> <td>Long-running daemon</td> </tr> </tbody> </table> Compiled from static analysis of the Claude Code source. Corrections applied from the audited version. Features behind flags may not be in your build. </article> <article> <h1>Designing a URL Shortener for 1 Trillion URLs</h1> 2026-03-08T00:00:00+00:00 I’ve read multiple system design books, taken courses, and watched several playlists on YouTube. One of the most common and foundational system design questions that almost everyone starts with is the Bitly problem: designing a URL shortener that converts long URLs into shorter, more manageable links. Over time, this question started to feel dry and repetitive. I often found myself skipping it whenever I saw it again. Recently, though, I came across a very interesting variation of the problem. I tried designing a solution for it, and that is what inspired me to write this post. <h2 id="the-challenge">The Challenge</h2> The challenge is to design a URL shortener that supports 1 trillion URLs. At first glance, this sounds like a simple scaling problem, but it actually changes a lot of things and breaks many of the naive assumptions that usually work for the standard Bitly-style system design question. Even if we assume a base-62 encoding (<code class="language-plaintext highlighter-rouge">0-9</code>, <code class="language-plaintext highlighter-rouge">a-z</code>, <code class="language-plaintext highlighter-rouge">A-Z</code>), a 7-character string gives us roughly 3.5 trillion possible combinations. That sounds like plenty for 1 trillion URLs, but things get complicated once we start thinking about distribution, storage, and collision guarantees. Also, <code class="language-plaintext highlighter-rouge">62^7</code> is only the raw theoretical namespace. In practice, the usable space is smaller because some codes will be reserved for custom aliases, blocked words, internal testing, and occasionally wasted ranges from crashed allocators. The margin is still comfortable, but it is not infinite. <h3 id="constraints">Constraints</h3> <ol> <li>The shortened URL can have at most 7 characters.</li> <li>The system must guarantee unique URLs with no collisions.</li> <li>The system must support 1 trillion URLs over its entire lifetime.</li> </ol> Before diving into the trillion-URL challenge, let’s first revisit the standard approach used to solve the traditional URL shortener design problem. <h2 id="the-traditional-approach">The Traditional Approach</h2> <h3 id="functional-requirements">Functional Requirements</h3> <h4 id="core-requirements">Core requirements</h4> <ul> <li>Users should be able to submit a long URL and receive a shortened version.</li> <li>Optionally, users should be able to specify a custom alias for their shortened URL, i.e. <code class="language-plaintext highlighter-rouge">www.short.ly/my-custom-alias</code>.</li> <li>Optionally, users should be able to specify an expiration date for their shortened URL.</li> <li>Users should be able to access the original URL by using the shortened URL.</li> </ul> <h4 id="below-the-line-out-of-scope">Below the line (out of scope)</h4> <ul> <li>User authentication and account management.</li> <li>Analytics on link clicks, such as click counts or geographic data.</li> </ul> <h3 id="non-functional-requirements">Non-Functional Requirements</h3> <h4 id="core-requirements-1">Core requirements</h4> <ul> <li>The system should ensure uniqueness for short codes, where each short code maps to exactly one long URL.</li> <li>Redirection should occur with minimal delay, ideally under 100 ms.</li> <li>The system should be reliable and available 99.99% of the time, with availability prioritized over consistency.</li> <li>The system should scale to support 1 billion shortened URLs and 100 million DAU.</li> </ul> <h4 id="below-the-line-out-of-scope-1">Below the line (out of scope)</h4> <ul> <li>Real-time consistency for analytics.</li> <li>Advanced security features such as spam detection and malicious URL filtering.</li> </ul> <h2 id="high-level-design">High-Level Design</h2> We’ll go through the functional requirements one by one and design a system that satisfies them. <h3 id="1-submitting-a-long-url-and-receiving-a-short-url">1. Submitting a long URL and receiving a short URL</h3> When a user submits a long URL, the client sends a <code class="language-plaintext highlighter-rouge">POST</code> request to <code class="language-plaintext highlighter-rouge">/urls</code> with the long URL, custom alias, and expiration date. The flow looks like this: <ol> <li>The primary server receives the request and validates the long URL format using libraries like <code class="language-plaintext highlighter-rouge">is-url</code> or simple validation logic.</li> <li>Optionally, we can check whether this exact long URL was already shortened and return the existing short code as a deduplication optimization.</li> <li>In practice, most URL shorteners allow multiple short codes for the same long URL, since different users may want different expiration dates, separate analytics, or different custom aliases.</li> <li>If the URL is valid, we generate a short code and store the mapping.</li> </ol> <h4 id="generating-the-short-code">Generating the short code</h4> The standard answer is to use a hash function that produces enough randomness to make collisions unlikely. A hash function like MD5 takes an input and produces a deterministic fixed-size output. That means the same long URL would always map to the same hash, which is useful if you want deterministic code generation. But the system still needs to store the redirect mapping, because a hash is not reversible and we still need metadata such as expiration dates, ownership, and custom aliases. It is also not desirable if you need multiple short codes for the same URL. Hash outputs also have high entropy, which makes them appear random. We can encode that output using base-62 and take the first 7 characters as our shortcode. Here, encoding simply means converting binary hash output into a sequence of readable characters from a chosen alphabet so it can be used as a short, URL-friendly code. This gives us approximately <code class="language-plaintext highlighter-rouge">62^7 = 3.52 trillion</code> possible values, which is a large namespace. But a large namespace does not make random or truncated-hash generation safe. If the code space has size <code class="language-plaintext highlighter-rouge">|S|</code> and <code class="language-plaintext highlighter-rouge">n</code> codes are already in use, the probability that the next random code collides is <code class="language-plaintext highlighter-rouge">n / |S|</code>. That means you still need retries and database checks to enforce uniqueness. In a space of this size, even a system that creates around 1 billion links would expect on the order of <code class="language-plaintext highlighter-rouge">10^5</code> colliding pairs unless it performs uniqueness checks. At more ordinary scale, a more production-friendly baseline is to stop relying on random hashes and use a centralized counter instead. Redis is a good fit here because <code class="language-plaintext highlighter-rouge">INCR</code> is atomic and hands out unique integers efficiently. We can base-62-encode each integer into a short code and guarantee uniqueness without a retry loop. Even if links expire, we usually do not recycle short codes. Reusing codes creates ugly edge cases with stale caches, delayed clients, and old analytics data, so it is safer to treat the namespace as append-only over the system’s lifetime. Once we have the short URL, we can insert it into the database along with the long URL, optional custom alias, and expiration date. Finally, we return the shortened URL to the client. <h3 id="2-accessing-the-original-url-from-the-shortened-url">2. Accessing the original URL from the shortened URL</h3> Once the short URL is live, users can use it to reach the original URL. Importantly, that shortened URL exists under a domain we own. When a user accesses a shortened URL, the flow looks like this: <ol> <li>The browser sends a <code class="language-plaintext highlighter-rouge">GET</code> request with the short code, for example <code class="language-plaintext highlighter-rouge">GET /abc123</code>.</li> <li>The primary server looks up the short code in the database.</li> <li>If the short code exists and has not expired, the server retrieves the long URL. If it has expired, the server returns <code class="language-plaintext highlighter-rouge">410 Gone</code>.</li> <li>The server responds with an HTTP redirect, usually either <code class="language-plaintext highlighter-rouge">301</code> or <code class="language-plaintext highlighter-rouge">302</code>.</li> </ol> For a URL shortener, a <code class="language-plaintext highlighter-rouge">302</code> redirect is often preferred because: <ul> <li>It gives us more control over the redirection process, allowing us to update or expire links later.</li> <li>It prevents browsers from aggressively caching the redirect.</li> <li>It still allows us to track click statistics, even though analytics are out of scope for this design.</li> </ul> <h4 id="how-do-we-make-redirects-fast">How do we make redirects fast?</h4> A naive database lookup could devolve into a full table scan, which is clearly too slow. A better baseline is to add an index, or simply make the shortened URL the primary key. That gives us indexed lookups and also enforces uniqueness. The remaining problem is SSD IOPS. A single database instance would still struggle to keep up with heavy traffic, leading to slower response times and possible timeouts. A much better solution is to place an in-memory cache like Redis or Memcached between the server and the database. Frequently accessed mappings from short code to long URL can live in memory. The read path then becomes: <ul> <li>On a cache hit, return the long URL in milliseconds.</li> <li>On a cache miss, query the database, populate the cache, and return the result.</li> </ul> The difference in speed is significant: <ul> <li>Memory access time: about 100 nanoseconds (<code class="language-plaintext highlighter-rouge">0.0001 ms</code>)</li> <li>SSD access time: about <code class="language-plaintext highlighter-rouge">0.1 ms</code></li> <li>HDD access time: about <code class="language-plaintext highlighter-rouge">10 ms</code></li> </ul> That means memory access is roughly 1,000 times faster than SSD and 100,000 times faster than HDD. In terms of operations per second: <ul> <li>Memory can support millions of reads per second.</li> <li>SSDs can support roughly 100,000 IOPS.</li> <li>HDDs typically support around 100 to 200 IOPS.</li> </ul> The only real challenge here is cache invalidation, which can get complicated when updates or deletions happen. In this system, though, the problem is smaller because shortened URLs are mostly read-heavy and rarely change. The cache also needs time to warm up, so the first few requests for a URL may still hit the database. Since memory is limited, we also need to think carefully about cache size, eviction policies such as LRU, and which entries are worth storing. <h3 id="3-scaling-the-standard-design-to-1-billion-urls">3. Scaling the standard design to 1 billion URLs</h3> Let’s do some rough sizing. Each row in the database contains: <ul> <li>A short code, roughly 8 bytes</li> <li>A long URL, roughly 100 bytes</li> <li><code class="language-plaintext highlighter-rouge">creationTime</code>, roughly 8 bytes</li> <li>An optional custom alias, roughly 100 bytes</li> <li>An expiration date, roughly 8 bytes</li> </ul> That totals around 200 bytes per row. If we round up to 500 bytes to account for metadata such as creator ID, analytics ID, and internal overhead, then 1 billion mappings would require: <code class="language-plaintext highlighter-rouge">500 bytes * 1 billion rows = 500 GB</code> That is still within the capabilities of modern SSDs, and if we need more headroom, we can shard data across multiple servers. Reads are much more frequent than writes, so we can separate the system into reader and writer services and scale them independently. We can then add more server instances behind a load balancer to handle higher RPS without concentrating load on a single machine. Here is the high-level difference between the two versions of the problem: <table> <thead> <tr> <th>Aspect</th> <th>Standard shortener</th> <th>Trillion-scale shortener</th> </tr> </thead> <tbody> <tr> <td>ID generation</td> <td>Centralized counter or sequence</td> <td>Range-based allocation across many writers</td> </tr> <tr> <td>Collision handling</td> <td>Database checks are acceptable</td> <td>Uniqueness must be generated up front</td> </tr> <tr> <td>Storage</td> <td>Single relational cluster can still work</td> <td>Sharded distributed key-value storage</td> </tr> <tr> <td>Read path</td> <td>Cache in front of the database</td> <td>Multi-region cache plus edge-aware reads</td> </tr> <tr> <td>Public code shape</td> <td>Sequential codes may be acceptable</td> <td>Codes should be scrambled to prevent enumeration</td> </tr> </tbody> </table> All of this works well for the standard problem. Now let’s go back to the trillion-URL version and look at why those answers stop working. The core shift is from generating IDs with local randomness to treating the shortcode space as a globally allocated namespace. <h2 id="why-the-usual-approaches-fail-at-1-trillion-urls">Why the Usual Approaches Fail at 1 Trillion URLs</h2> The constraints change the problem enough that several standard answers break down. <h3 id="1-truncated-hashes-stop-being-safe">1. Truncated hashes stop being safe</h3> The assumption that MD5 plus base-62 gives us enough entropy fails here. Once we truncate the hash to just 7 characters, we are throwing away most of the hash space, and the birthday paradox tells us that collisions become mathematically unavoidable. The birthday paradox says that in a room of just 23 people, there is about a 50% chance that two people share the same birthday, even though there are 365 possible days. That feels surprising because 23 is much smaller than 365. The reason is that we are comparing many pairs, not just one. The number of comparisons among <code class="language-plaintext highlighter-rouge">k</code> items is: \[\frac{k(k-1)}{2}\] So the probability of collision grows quadratically as the number of generated IDs increases. After base-62 encoding, the total possible ID space is approximately 3.5 trillion. The rough intuition is that collisions start becoming noticeable around $\sqrt{N}$, but the more precise 50% threshold is: \[k_{50} \approx \sqrt{2N \ln 2}\] Where: <ul> <li><code class="language-plaintext highlighter-rouge">k_{50}</code> is the point where the chance of at least one collision is about 50%</li> <li><code class="language-plaintext highlighter-rouge">N</code> is the size of the total ID space</li> </ul> For 7 base-62 characters: \[N = 62^7\] \[N = 3.5 \times 10^{12}\] So: \[k_{50} \approx \sqrt{2 \cdot 62^7 \cdot \ln 2}\] \[k_{50} \approx 2.21 \times 10^6\] So by about 2.2 million generated IDs, the system already has roughly a 50% chance of at least one collision. Even at 2 million IDs, the probability is already significant. In this system, that is disastrous, because a user could be redirected to the wrong site. <h3 id="2-retry-on-collision-becomes-too-expensive">2. Retry-on-collision becomes too expensive</h3> One way to patch the problem is to salt the URL and keep rehashing until you find a unique value. That creates a loop like this: <code class="language-plaintext highlighter-rouge">hash -> DB check -> collision? -> rehash -> check again</code> At trillion-record scale, the database index will itself be several terabytes. Every collision check becomes a random lookup against that massive index, and many of those lookups will fall out of cache. If you have heavy write traffic, you end up spending a large chunk of your compute budget just proving that a string has not been used before. In the worst case, that becomes extremely expensive and operationally ugly. We need to stop searching for uniqueness and start generating uniqueness. <h2 id="generating-uniqueness-instead-of-searching-for-it">Generating Uniqueness Instead of Searching for It</h2> The simplest way to guarantee uniqueness is to use a counter again. That mathematically guarantees a unique value for every URL. But now we hit the next problem: we cannot hand users a giant integer because the short URL is constrained to at most 7 characters. <h3 id="turning-a-large-integer-into-a-7-character-code">Turning a large integer into a 7-character code</h3> The solution is to map the large integer into a base-62 string. A base-62 alphabet can look like this: <ul> <li><code class="language-plaintext highlighter-rouge">0-9</code> (10 characters)</li> <li><code class="language-plaintext highlighter-rouge">a-z</code> (26 characters)</li> <li><code class="language-plaintext highlighter-rouge">A-Z</code> (26 characters)</li> </ul> That gives us: \[62 = 10 + 26 + 26\] Here is a simple example. Assume our counter is <code class="language-plaintext highlighter-rouge">500</code>. \[500 / 62 = 8 \text{ remainder } 4\] The remainder <code class="language-plaintext highlighter-rouge">4</code> maps to the character at index <code class="language-plaintext highlighter-rouge">4</code> in the alphabet, and the quotient <code class="language-plaintext highlighter-rouge">8</code> becomes the next digit in the conversion. This gives us a one-to-one mapping where every integer in the usable range maps to a unique short string, and we can decode it back correctly if needed. There are no collision checks and no guesswork. The mapping itself is <code class="language-plaintext highlighter-rouge">O(1)</code>. <h3 id="avoiding-a-global-counter-bottleneck">Avoiding a global counter bottleneck</h3> A single counter is a disaster in a distributed system, because every writer would need to talk to the same coordinator. To solve that, we introduce a coordinator service like ZooKeeper. It acts as a distributed, highly available source of truth and hands out ranges of IDs to each server. For example: <ul> <li>Server A gets <code class="language-plaintext highlighter-rouge">1</code> to <code class="language-plaintext highlighter-rouge">1,000,000</code></li> <li>Server B gets <code class="language-plaintext highlighter-rouge">1,000,001</code> to <code class="language-plaintext highlighter-rouge">2,000,000</code></li> </ul> Now each server can allocate IDs locally in memory by incrementing a local integer. That means: <ul> <li>No network calls for every URL creation</li> <li>No coordination on every write</li> <li>No locks</li> <li>No database uniqueness checks</li> </ul> Once a server exhausts its range, it goes back to ZooKeeper and requests another one. If Server A crashes, the worst case is that we lose some unused IDs from its range. That is fine. We have roughly 3.5 trillion total combinations and only need 1 trillion of them, so wasting some IDs is acceptable. This makes ID generation solid. We are no longer hoping collisions do not happen. We have mathematically eliminated them. <h2 id="the-storage-wall">The Storage Wall</h2> Now we get to the storage problem. If each URL takes roughly 100 bytes, then <code class="language-plaintext highlighter-rouge">100B * 1T = 100 TB</code> of raw URL data alone, and that is before timestamps, indexes, metadata, replication, and everything else. That is not something a single PostgreSQL instance should handle. Even if it could, a deep B-tree index at this scale would translate into a cascade of disk lookups. So the obvious answer is to distribute and shard the data across multiple nodes. For example, if 100 nodes each handle 5 TB, the problem becomes much more manageable. <h3 id="routing-requests-to-the-right-shard">Routing requests to the right shard</h3> The next question is how to know which node holds which URL. The simplest approach is modulo-based sharding. The routing rule can be as simple as: <code class="language-plaintext highlighter-rouge">shard = hash(short_url) % 100</code> That gives us <code class="language-plaintext highlighter-rouge">O(1)</code> routing without a central directory. If we expect the number of shards to change frequently, true consistent hashing or rendezvous hashing is a better choice because it minimizes remapping when nodes are added or removed. Since this workload is mostly key-value lookups with no major joins or cross-node transactions, databases like Cassandra or DynamoDB are a much better fit than a monolithic relational database. They are built for exactly this type of access pattern, and their LSM-tree storage engines handle write-heavy workloads much better while also simplifying sharding and replication. <h3 id="the-read-path-at-trillion-scale">The read path at trillion scale</h3> The Pareto principle applies strongly here: a small fraction of URLs will receive most of the traffic. So treating every URL equally is wasteful. Instead, we place a distributed cache such as Redis or Memcached in front of the storage layer. The read path becomes: <ul> <li>Cache hit: return the URL in under a millisecond.</li> <li>Cache miss: look up the correct shard in Cassandra, fetch the URL, store it in Redis, and then return the redirect.</li> </ul> <h2 id="security-global-ux-and-the-reductionist-mindset">Security, Global UX, and the Reductionist Mindset</h2> There is one important issue that usually does not come up early in system design interviews: security. <h3 id="making-the-urls-unpredictable">Making the URLs unpredictable</h3> If the system simply uses a visible counter, then the short URLs become predictable. If someone knows one URL, they can often guess the previous or next one. That means an attacker could scrape the namespace, discover private links, or infer business intelligence like usage growth. So even though we still want a counter under the hood, we need the outward-facing short codes to look random. That is where a Feistel cipher helps. More precisely, we use it as a reversible permutation over the fixed ID space: the counter value is scrambled into another integer of the same size. Every input still maps to a unique output, and we can still reverse it back to the original value. In other words, it behaves like format-preserving scrambling. But from the outside, the output looks random. That gives us the best of both worlds: guaranteed uniqueness underneath and non-predictable public URLs on top. <h3 id="global-reads-and-disaster-scenarios">Global reads and disaster scenarios</h3> Now imagine a regional failure, or simply a user very far away from the region where the system is deployed. If all redirects are served from, say, Northern Virginia, then a user in Tokyo has to make a full round trip to Virginia before being redirected. That can easily add around 300 ms, which is a poor user experience for something as simple as a redirect. To avoid that, we can push reads and redirects closer to the user with a CDN like Cloudflare or Akamai, but that only helps if the short-code mapping is also available near the edge. That can be done by caching redirect responses, storing hot mappings in edge KV, or running edge compute backed by nearby replicas. Otherwise, the edge still has to call back to origin. With edge-local data, the Tokyo user can be served from the Tokyo edge in something closer to 10 ms. Writes are different. If a user in New York creates a link, it gets written to Cassandra and then replicated asynchronously to Tokyo. That brings us to the CAP theorem. Consistency, availability, and partition tolerance cannot all be maximized at the same time. In this case, we are willing to give up immediate consistency so we can preserve high availability and partition tolerance. That means eventual consistency is acceptable. If a user in Tokyo cannot immediately access a link that was just created in New York because replication is still catching up, that is not a fatal problem. A refresh a few seconds later should resolve it. So we are okay with eventual consistency, but we are not willing to compromise on availability or partition tolerance. The one thing we absolutely cannot compromise on is uniqueness. That is the beauty of the range-manager approach. Even if New York and Tokyo are completely isolated from each other, as long as they were assigned non-overlapping ranges before the split, they can continue generating unique URLs independently with no risk of overlap. <h2 id="final-takeaway">Final Takeaway</h2> Let’s step back and trace what we built here. We did not approach this as a simple URL-shortening coding task. We approached it as a namespace-management problem. <ol> <li>We determined the size of the namespace: <code class="language-plaintext highlighter-rouge">62^7</code>, or about 3.5 trillion possible short codes.</li> <li>We reduced uniqueness to a sequential counter.</li> <li>We handled distributed coordination through range allocation with ZooKeeper.</li> <li>We scaled storage with sharded key-value data in Cassandra.</li> <li>We reduced read latency by caching the hot set with Redis.</li> <li>We addressed predictability and security with a Feistel cipher.</li> </ol> That is the reductionist mindset. At this scale, you cannot brute-force your way through the problem with more code. You have to find the right abstraction that makes the problem manageable again. If you walk into an interview and say, “I’ll use a hash and hope for the best,” you are thinking like a coder, which is fine. But if you talk about the physics of the ID space, explain why LSM trees outperform B-trees for this write-heavy profile, or discuss the birthday paradox and what it means for truncated hashes, you show a deeper understanding of how systems actually work. That is what I mean when I say that understanding why a system fails is more important than knowing how to build it. In the end, this is TinyURL at trillion scale: a simple-looking problem on the surface, but a masterclass in distributed systems underneath. </article> <article> <h1>The Architecture of Copying and Pasting Images on the Web</h1> 2026-03-07T00:00:00+00:00 Copying an image from one website and pasting it to another. No downloads, no temporary files, no dragging things to your desktop. Just <code class="language-plaintext highlighter-rouge">Ctrl+C</code> -> <code class="language-plaintext highlighter-rouge">Ctrl+V</code> and the image shows up as if it teleported across the web. To understand how this works internally we need to understand the sandboxed renderer processes, serializing internal memory structures, navigating the inter-process communication (<code class="language-plaintext highlighter-rouge">IPC</code>) frameworks of the host OS, interfacing with legacy and modern clipboard APIs across platforms, and ultimately reconstructing the data into a secure, scriptable Object within a distinct Document Object Model (<code class="language-plaintext highlighter-rouge">DOM</code>). This article dives deep into this exact feature, detailing the lifecycle of a copied image starting from the browser’s rendering engine, traversing through macOS, Windows, and Linux (both X11 and Wayland) OS clipboards, and securely re-entering a sandboxed web application. <h2 id="browser-side-copy-operation">Browser-Side Copy Operation</h2> The operation initiates when a user triggers a context menu over an image element and selects “Copy Image.” This action bypasses standard JavaScript clipboard API interceptions, which are typically gated by <code class="language-plaintext highlighter-rouge">ClipboardEvent.clipboardData</code>, and directly invokes the browser’s internal native handlers. <h3 id="image-retrieval-from-the-rendering-engine">Image Retrieval from the Rendering Engine</h3> When “Copy Image” is invoked, the browser must extract the visual data. Modern layout engines, such as Blink in Chrome, Gecko in Firefox, or WebKit in Safari, do not simply fetch the image from the network cache. While the compressed original bytes might exist in the HTTP disk or memory cache, a rendered image may have been modified by CSS, transformed, or drawn to an HTML5 <code class="language-plaintext highlighter-rouge"><canvas></code>. Instead, the browser’s rendering subsystem extracts the fully decoded bitmap currently residing in memory. In Chromium’s Blink engine, images are represented via the <code class="language-plaintext highlighter-rouge">blink::Image</code> abstraction. Specifically, a <code class="language-plaintext highlighter-rouge">BitmapImage</code> (which often wraps an <span class="define" data-term="SkBitmap" data-definition="A class in the Skia graphics library representing a rectangular array of pixels stored in system memory, used for CPU-side image manipulation and rendering."><code class="language-plaintext highlighter-rouge">SkBitmap</code> or <code class="language-plaintext highlighter-rouge">SkImage</code> from the <span class="define" data-term="Skia" data-definition="An open-source 2D graphics library maintained by Google. It serves as the rendering backend for Chrome, Android, Flutter, and many other products, handling text, shapes, and image drawing."><code class="language-plaintext highlighter-rouge">Skia</code> graphics library) contains the raw pixel data. If the browser employs hardware-accelerated compositing, the <code class="language-plaintext highlighter-rouge">SkImage</code> may reside in GPU <span class="define" data-term="VRAM" data-definition="Video Random Access Memory. Dedicated high-bandwidth memory on the graphics card used for storing textures, framebuffers, and other GPU-accessible data. Faster than system RAM for GPU operations but not directly accessible by the CPU."><code class="language-plaintext highlighter-rouge">VRAM</code> as an OpenGL texture or Vulkan buffer. To place this on the CPU-bound OS clipboard, the engine must perform a <span class="define" data-term="GPU readback" data-definition="The process of copying pixel data from GPU video memory (VRAM) back into CPU-accessible system RAM. This is an expensive operation because it stalls the GPU pipeline and requires synchronization between the CPU and GPU."><code class="language-plaintext highlighter-rouge">GPU readback</code> - A computationally expensive operation where pixels are copied from VRAM back into system RAM via <code class="language-plaintext highlighter-rouge">glReadPixels</code> or equivalent APIs, converting the hardware texture back into a software <code class="language-plaintext highlighter-rouge">SkBitmap</code>. <h3 id="generation-of-internal-mime-representations">Generation of Internal MIME Representations</h3> The OS clipboard is entirely format-agnostic; it acts as a generic key-value store where keys are format identifiers and values are binary blobs. To ensure the highest probability of successful pasting into diverse native applications, the browser generates multiple simultaneous representations of the image. A single “Copy Image” action typically generates several internal representations before they are mapped to OS-specific formats. First, the engine re-encodes the raw <code class="language-plaintext highlighter-rouge">SkBitmap</code> pixel data into a standard compressed format, overwhelmingly <code class="language-plaintext highlighter-rouge">image/png</code>. This re-encoding step is crucial as it ensures a standardized file header and strips out malformed or proprietary data chunks. Second, the browser generates an HTML fragment representing the image, labeled as <code class="language-plaintext highlighter-rouge">text/html</code>. This often embeds the image as a Base64 encoded Data URI or provides an <code class="language-plaintext highlighter-rouge"></code> tag pointing to the source URL. <div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><meta charset='utf-8'> src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==" alt="Description" width="500" height="300"> </code></pre></div></div> Finally, the absolute URL of the image is provided as <code class="language-plaintext highlighter-rouge">text/plain</code> as a fallback for text-only paste targets. It’s important to know the difference between the 2 copy operations presented to the user. The “Copy Image” command extracts the decoded bitmap, re-encodes it, and places the binary blob on the clipboard alongside HTML and Text fallbacks. Conversely, “Copy Image Address” simply extracts the <code class="language-plaintext highlighter-rouge">src</code> attribute from the <code class="language-plaintext highlighter-rouge">DOM</code> node and places it on the clipboard exclusively as <code class="language-plaintext highlighter-rouge">text/plain</code>. <h2 id="inter-process-communication-and-memory-ownership">Inter-Process Communication and Memory Ownership</h2> Because web pages execute in highly restricted, sandboxed “Renderer” processes, they lack the operating system privileges required to interact with the global system clipboard directly. The Renderer must therefore serialize the extracted image and transmit it to the highly privileged “Browser” process. In Chromium, this boundary is crossed using <span class="define" data-term="Mojo" data-definition="Chromium's IPC (Inter-Process Communication) framework. It provides strongly-typed message passing between processes using interface definition language (mojom), replacing the older Chrome IPC system."><code class="language-plaintext highlighter-rouge">Mojo</code>, a lightweight message passing system. The Blink pasteboard implementation, specifically <code class="language-plaintext highlighter-rouge">blink::Pasteboard::writeImage</code>, formulates an IPC message historically routed via <code class="language-plaintext highlighter-rouge">ClipboardHostMsg_WriteImage</code> and now managed via strongly typed Mojo interfaces. Image data is inherently large. Passing a multi-megabyte decoded bitmap over a standard UNIX domain socket or named pipe via standard message serialization would introduce massive latency and memory duplication. To circumvent this, Mojo utilizes a structure called <code class="language-plaintext highlighter-rouge">mojo_base.mojom.BigBuffer</code>. When a payload exceeds a specific threshold, <code class="language-plaintext highlighter-rouge">BigBuffer</code> transparently shifts from an inline byte array to a <code class="language-plaintext highlighter-rouge">BigBufferSharedMemoryRegion</code>. The Renderer process requests the OS to allocate an anonymous shared memory segment, writes the encoded PNG bytes into it, and sends merely the file descriptor (or Windows Handle) and size over the Mojo IPC channel. The Browser process maps this shared memory into its own address space, allowing zero-copy transmission of the image payload across the process boundary. Once the Browser process receives this message, the <code class="language-plaintext highlighter-rouge">ClipboardHostImpl</code> verifies the data, manages sequence tokens to prevent race conditions, and interfaces with the OS-specific clipboard APIs. <h3 id="architectural-diagram-browser-process-boundary">Architectural Diagram: Browser Process Boundary</h3> <h2 id="os-specific-clipboard-layer-architecture">OS Specific Clipboard Layer Architecture</h2> Clipboards vary across different OSes. The browser must translate its internal web-standard MIME types into the native data structures expected by macOS, Windows, and Linux to ensure seamless interoperability with native applications. <h3 id="windows-win32-clipboard-api">Windows Win32 Clipboard API</h3> On Windows, the clipboard is a shared system resource accessed via the legacy Win32 API. When Chromium’s <code class="language-plaintext highlighter-rouge">ClipboardWin::WriteBitmap</code> executes, it translates the incoming <code class="language-plaintext highlighter-rouge">SkBitmap</code> into <code class="language-plaintext highlighter-rouge">Device Independent Bitmap</code> (<code class="language-plaintext highlighter-rouge">DIB</code>) formats. Windows historically relies on <code class="language-plaintext highlighter-rouge">CF_BITMAP</code> (a GDI handle), <code class="language-plaintext highlighter-rouge">CF_DIB</code>, and <code class="language-plaintext highlighter-rouge">CF_DIBV5</code>. Because standard <code class="language-plaintext highlighter-rouge">CF_DIB</code> does not reliably support alpha channels for transparency, modern browsers write <code class="language-plaintext highlighter-rouge">CF_DIBV5</code>, which includes a <code class="language-plaintext highlighter-rouge">BITMAPV5HEADER</code> specifying color masks, color space information, and alpha values. However, due to rampant bugs in legacy software, such as Microsoft Office mishandling <code class="language-plaintext highlighter-rouge">CF_DIBV5</code> alpha channels resulting in black backgrounds browsers also explicitly write a standardized PNG format blob. Thus, the Windows clipboard receives both <code class="language-plaintext highlighter-rouge">DIB</code> formats and a raw PNG blob. The order of format registration is vital, browsers prioritize the PNG format so that aware applications select it over the lossy or buggy <code class="language-plaintext highlighter-rouge">DIB</code> representations. <h3 id="macos-nspasteboard">macOS NSPasteboard</h3> Apple’s macOS handles clipboard operations via the <code class="language-plaintext highlighter-rouge">NSPasteboard</code> class, which acts as a client-side Objective-C wrapper around the <code class="language-plaintext highlighter-rouge">pbs</code> (pasteboard server) background daemon. The general pasteboard (<code class="language-plaintext highlighter-rouge">NSPasteboard.generalPasteboard</code>) manages data copying across the system. WebKit and Chromium translate their internal representations into <span class="define" data-term="UTI" data-definition="Uniform Type Identifier. Apple's system for identifying data types using reverse-DNS strings (e.g. 'public.png', 'com.adobe.pdf'). UTIs form a conformance hierarchy, so 'public.png' conforms to 'public.image', which conforms to 'public.data'."><code class="language-plaintext highlighter-rouge">UTIs</code>. An image is registered under <code class="language-plaintext highlighter-rouge">public.png</code> (or <code class="language-plaintext highlighter-rouge">NSPasteboardType.png</code> / <code class="language-plaintext highlighter-rouge">NSPasteboardTypePNG</code>). HTML fallbacks are registered as <code class="language-plaintext highlighter-rouge">public.html</code> or the proprietary Apple Web Archive format. When the browser writes to <code class="language-plaintext highlighter-rouge">NSPasteboard</code>, it packages the image into an <code class="language-plaintext highlighter-rouge">NSPasteboardItem</code>. Unlike Windows, which requires transferring global memory handles, macOS utilizes Mach ports to transfer data to the <code class="language-plaintext highlighter-rouge">pbs</code> daemon’s address space. For extremely large files, macOS supports “promised data” (<code class="language-plaintext highlighter-rouge">NSFilePromiseProvider</code>), where the clipboard merely holds a reference and defers materialization until the drop or paste occurs. However, for standard web images, the binary PNG is written directly to the pasteboard using <code class="language-plaintext highlighter-rouge">setData:forType:</code>. <h3 id="linux-x11-selection-model">Linux X11 Selection Model</h3> The X Window System (<code class="language-plaintext highlighter-rouge">X11</code>) does not inherently possess a global “clipboard buffer” that stores binary data like Windows or macOS. Instead, <code class="language-plaintext highlighter-rouge">X11</code> relies on “Selections” specifically the <code class="language-plaintext highlighter-rouge">CLIPBOARD</code> selection, managed via the <span class="define" data-term="ICCCM" data-definition="Inter-Client Communication Conventions Manual. The X11 specification that defines how X client applications should communicate with each other and the window manager, including clipboard (selection) ownership, data transfer protocols, and session management."><code class="language-plaintext highlighter-rouge">ICCCM</code> standard. When a user copies an image in Firefox or Chrome on <code class="language-plaintext highlighter-rouge">X11</code>, the browser calls <code class="language-plaintext highlighter-rouge">XSetSelectionOwner</code>, claiming ownership of the <code class="language-plaintext highlighter-rouge">CLIPBOARD</code> atom. No image data is transferred to the X server at this point. The browser merely registers itself as the owner. When a user switches to Website B and triggers a paste, the receiving application calls <code class="language-plaintext highlighter-rouge">XConvertSelection</code>. The X server sends a <code class="language-plaintext highlighter-rouge">SelectionRequest</code> event to the owner (the browser process that copied the image). The requesting application asks for the <code class="language-plaintext highlighter-rouge">TARGETS</code> atom to discover what formats are available. The copying browser responds with a list of atoms corresponding to MIME types, such as <code class="language-plaintext highlighter-rouge">image/png</code> and <code class="language-plaintext highlighter-rouge">text/html</code>. Once the receiving app requests <code class="language-plaintext highlighter-rouge">image/png</code>, the copying browser writes the PNG data to a property on the receiving application’s X window using <code class="language-plaintext highlighter-rouge">XChangeProperty</code>. However, the <code class="language-plaintext highlighter-rouge">X11</code> protocol has a maximum request size. For large images, the transfer must be negotiated using the <code class="language-plaintext highlighter-rouge">INCR</code> protocol. The data is chunked, often in 256KB increments, requiring a complex state machine of <code class="language-plaintext highlighter-rouge">SelectionNotify</code> and <code class="language-plaintext highlighter-rouge">PropertyNotify</code> events to stream the image from the sending process to the receiving process memory. <h3 id="linux-wayland-clipboard-protocol">Linux Wayland Clipboard Protocol</h3> Wayland modernizes Linux display architecture by entirely removing the X Server and substituting a secure compositor protocol. Like X11, Wayland lacks a global memory buffer; it is a pure peer-to-peer IPC mechanism mediated by the compositor. When an image is copied, Chromium’s Ozone/Wayland backend creates a <code class="language-plaintext highlighter-rouge">wl_data_source</code> and calls <code class="language-plaintext highlighter-rouge">wl_data_source_offer</code>, indicating to the compositor that it possesses <code class="language-plaintext highlighter-rouge">image/png</code>. The browser then calls <code class="language-plaintext highlighter-rouge">wl_data_device_set_selection</code> to assert ownership. When pasting, the receiving application asks for the data by sending a <code class="language-plaintext highlighter-rouge">wl_data_offer.receive</code> request to the compositor, specifying the MIME type and passing a file descriptor (<code class="language-plaintext highlighter-rouge">fd</code>), which is typically one end of a UNIX pipe. The compositor forwards this pipe to the copying browser via a <code class="language-plaintext highlighter-rouge">wl_data_source.send</code> event. The copying browser then writes the raw PNG binary data directly into the file descriptor and closes it. <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Architectural Pseudocode for Wayland Data Offer Reception void wl_data_offer_receive(struct wl_data_offer *offer, const char *mime_type, int fd) { // The browser receives the request, writes PNG bytes into 'fd' write(fd, png_binary_data, png_size); // Closing the file descriptor signals EOF to the receiving application close(fd); } </code></pre></div></div> This file-descriptor-passing model provides excellent performance and security, as massive binary blobs are streamed directly through kernel pipes without passing through a middleman server, avoiding memory duplication. <h2 id="pasting-into-website-b-reverse-flow">Pasting into Website B (Reverse Flow)</h2> When the user navigates to Website B and presses <code class="language-plaintext highlighter-rouge">Ctrl+V</code> (or <code class="language-plaintext highlighter-rouge">Cmd+V</code>), the flow reverses, but introduces significant security checkpoints, sanitization requirements, and <code class="language-plaintext highlighter-rouge">DOM</code> API layers. <h3 id="gating-and-security-checks">Gating and Security Checks</h3> Pasting is an inherently dangerous operation. A malicious website could silently read the user’s clipboard, stealing passwords or personally identifiable information (<code class="language-plaintext highlighter-rouge">PII</code>) copied from external applications. Consequently, browsers mandate that paste events are heavily gated by <span class="define" data-term="transient user activation" data-definition="A browser security concept where certain privileged APIs (clipboard, fullscreen, popups) are only available for a brief window after a genuine user interaction like a click or keypress. This prevents scripts from silently invoking sensitive operations without user intent.">“transient user activation” - a recent interaction like a physical click or keypress. If the site attempts to read the clipboard programmatically via the Async Clipboard API (<code class="language-plaintext highlighter-rouge">navigator.clipboard.read()</code>), the browser invokes the Permissions API. If the clipboard-read permission has not been explicitly granted, the browser pauses script execution and displays a native permission prompt to the user. <h3 id="receiving-the-paste-event-and-os-ipc">Receiving the Paste Event and OS IPC</h3> Once authorized, the Browser process requests data from the OS clipboard. On Windows, it calls <code class="language-plaintext highlighter-rouge">GetClipboardData</code> for formats like <code class="language-plaintext highlighter-rouge">CF_DIBV5</code> or <code class="language-plaintext highlighter-rouge">PNG</code>. On macOS, it requests data from the <code class="language-plaintext highlighter-rouge">NSPasteboard</code>. On Wayland, it provides a pipe file descriptor to <code class="language-plaintext highlighter-rouge">wl_data_offer_receive</code> and reads the incoming stream. Before this data is allowed back into the sandboxed Renderer process of Website B, it must be aggressively sanitized. An OS clipboard could contain a malformed image crafted to exploit vulnerabilities in libraries like <code class="language-plaintext highlighter-rouge">libpng</code> or <code class="language-plaintext highlighter-rouge">libjpeg</code>. Furthermore, an image might contain hidden EXIF metadata, such as GPS coordinates, representing a massive privacy violation if unknowingly pasted into a web form. To mitigate this, the Browser process passes the raw binary blob to a sandboxed utility process. Here, the image is decoded back into an uncompressed bitmap, strictly discarding any metadata, ICC profiles, or malformed chunks. It is then securely re-encoded back into a clean PNG. This sanitized payload is passed via <code class="language-plaintext highlighter-rouge">Mojo BigBuffer</code> to Website B’s Renderer process. <h3 id="dom-paste-event-flow-and-datatransfer">DOM Paste Event Flow and DataTransfer</h3> Inside the Renderer, the JavaScript engine fires a paste event on the active DOM element. The event object (<code class="language-plaintext highlighter-rouge">ClipboardEvent</code>) contains a <code class="language-plaintext highlighter-rouge">DataTransfer</code> property. The engine parses the multiple MIME types provided by the OS and exposes them via the <code class="language-plaintext highlighter-rouge">event.clipboardData.items</code> list. This <code class="language-plaintext highlighter-rouge">DataTransfer</code> infrastructure is heavily shared with the HTML5 Drag-and-Drop API, utilizing identical underlying C++ data objects to represent the transferring payload. Because reading heavy binary blobs synchronously would freeze the browser’s main thread, the <code class="language-plaintext highlighter-rouge">DataTransfer</code> object utilizes delayed materialization. When a developer loops through <code class="language-plaintext highlighter-rouge">clipboardData.items</code> and calls <code class="language-plaintext highlighter-rouge">item.getAsFile()</code>, the browser instantiates a JavaScript <code class="language-plaintext highlighter-rouge">File</code> (a subclass of <code class="language-plaintext highlighter-rouge">Blob</code>). The backing memory for this <code class="language-plaintext highlighter-rouge">Blob</code> is a pointer to the shared memory or cached byte array established during the IPC phase. Different DOM elements handle the default paste behavior differently: <ul> <li>contenteditable elements: The browser’s editing commands parse the incoming <code class="language-plaintext highlighter-rouge">text/html</code> payload from the clipboard. If an image is present, it generates an <code class="language-plaintext highlighter-rouge"></code> tag and attempts to insert it into the DOM. If the image is a raw binary, it may be converted into a Base64 data URI.</li> <li>textarea elements: These inputs accept only plain text. The browser aggressively filters the clipboard, stripping all HTML tags and ignoring binary image blobs, pasting only the fallback <code class="language-plaintext highlighter-rouge">text/plain</code> URL if available.</li> <li><code class="language-plaintext highlighter-rouge"><input type="file"></code> elements: The browser intercepts the paste event and populates the input’s <code class="language-plaintext highlighter-rouge">FileList</code> with the reconstructed <code class="language-plaintext highlighter-rouge">File</code> object, mimicking the behavior of a user manually selecting a file from the disk.</li> </ul> <h3 id="async-clipboard-api-vs-legacy-clipboard">Async Clipboard API vs. Legacy Clipboard</h3> The legacy <code class="language-plaintext highlighter-rouge">document.execCommand('paste')</code> and synchronous <code class="language-plaintext highlighter-rouge">ClipboardEvent</code> flow inherently block the main thread. To support modern, rich web applications, browsers have implemented the Async Clipboard API. When <code class="language-plaintext highlighter-rouge">navigator.clipboard.read()</code> is called, it returns a <code class="language-plaintext highlighter-rouge">Promise</code>. The browser engine asynchronously queries the OS clipboard, performs the heavy decoding and sanitization off the main thread, and resolves the <code class="language-plaintext highlighter-rouge">Promise</code> with an array of <code class="language-plaintext highlighter-rouge">ClipboardItem</code> objects. The developer then calls <code class="language-plaintext highlighter-rouge">item.getType('image/png')</code>, which returns a secondary <code class="language-plaintext highlighter-rouge">Promise</code> resolving to the binary <code class="language-plaintext highlighter-rouge">Blob</code>. This completely asynchronous model allows the transfer of multi-megabyte images without degrading UI responsiveness or causing frame drops. <h2 id="full-end-to-end-data-flow">Full End-to-End Data Flow</h2> The following sequence details the complete low-level trace from the initial render on Website A to the final DOM insertion on Website B. <table> <thead> <tr> <th>Phase</th> <th>Component</th> <th>Technical Action / Memory Transition</th> </tr> </thead> <tbody> <tr> <td>1. Trigger</td> <td>Website A (Renderer)</td> <td>User right-clicks and selects “Copy Image”. The browser intercepts the native OS menu command, bypassing JS listeners.</td> </tr> <tr> <td>2. Extraction</td> <td>Layout Engine (Blink/Gecko/WebKit)</td> <td>Decoded bitmap (<code class="language-plaintext highlighter-rouge">SkBitmap</code> or equivalent) is extracted from the render tree. If hardware-accelerated, a GPU-to-CPU readback occurs.</td> </tr> <tr> <td>3. Encoding</td> <td>Image Encoder</td> <td>The uncompressed bitmap is synchronously encoded into compressed PNG bytes. HTML and Text fallbacks are generated.</td> </tr> <tr> <td>4. IPC Send</td> <td>IPC Framework (Mojo)</td> <td>The Renderer allocates an anonymous shared memory segment, writes the PNG bytes, and sends a <code class="language-plaintext highlighter-rouge">BigBuffer</code> file descriptor to the Browser Process.</td> </tr> <tr> <td>5. OS Registration</td> <td>OS Clipboard API</td> <td>Browser Process maps the shared memory and registers the data with the OS. Windows: <code class="language-plaintext highlighter-rouge">GlobalAlloc</code> + <code class="language-plaintext highlighter-rouge">SetClipboardData</code>. macOS: <code class="language-plaintext highlighter-rouge">NSPasteboard</code> + <code class="language-plaintext highlighter-rouge">pbs</code>. Linux: Asserts <code class="language-plaintext highlighter-rouge">CLIPBOARD</code> ownership or Wayland <code class="language-plaintext highlighter-rouge">wl_data_device_set_selection</code>.</td> </tr> <tr> <td>Context Switch</td> <td>Operating System</td> <td>The user switches the active window or tab to Website B, transferring application focus.</td> </tr> <tr> <td>6. Trigger Paste</td> <td>Website B (Renderer)</td> <td>User presses <code class="language-plaintext highlighter-rouge">Ctrl+V</code>. The browser initiates a paste sequence, checking for transient user activation to authorize the action.</td> </tr> <tr> <td>7. OS Query</td> <td>OS Clipboard API</td> <td>Browser Process requests data. Windows/Mac: Reads memory handles/ports. Linux Wayland: Provides a UNIX pipe <code class="language-plaintext highlighter-rouge">fd</code> to <code class="language-plaintext highlighter-rouge">wl_data_offer_receive</code> and reads the streamed bytes.</td> </tr> <tr> <td>8. Sanitization</td> <td>Utility Process</td> <td>The raw OS binary is decoded into a pixel array, stripping EXIF data, ICC profiles, and malformed chunks to neutralize exploits, then re-encoded into a safe PNG.</td> </tr> <tr> <td>9. IPC Receive</td> <td>IPC Framework (Mojo)</td> <td>The Browser process sends the sanitized PNG via a new <code class="language-plaintext highlighter-rouge">BigBuffer</code> shared memory region to Website B’s Renderer.</td> </tr> <tr> <td>10. DOM Exposure</td> <td>JavaScript Engine (V8/SpiderMonkey)</td> <td>The Renderer constructs a <code class="language-plaintext highlighter-rouge">ClipboardEvent</code>. The <code class="language-plaintext highlighter-rouge">DataTransferItemList</code> is populated. The script invokes <code class="language-plaintext highlighter-rouge">getAsFile()</code>, generating a delayed-materialization JS <code class="language-plaintext highlighter-rouge">Blob</code>.</td> </tr> <tr> <td>11. Application</td> <td>Website B Logic</td> <td>The application reads the <code class="language-plaintext highlighter-rouge">Blob</code>, uploads it via <code class="language-plaintext highlighter-rouge">fetch()</code>, or displays it using <code class="language-plaintext highlighter-rouge">URL.createObjectURL()</code>.</td> </tr> </tbody> </table> <h2 id="cross-browser-architectural-differences">Cross-Browser Architectural Differences</h2> While the general copy-paste pipeline remains conceptually consistent, the internal mechanisms and data structures diverge significantly based on the browser engine architecture. <h3 id="chrome-blink">Chrome (Blink)</h3> Blink prioritizes multi-process security and performance. Its use of <code class="language-plaintext highlighter-rouge">Mojo BigBuffer</code> for memory transfers ensures that IPC bottlenecks are minimized, avoiding redundant memory copying. Chromium explicitly manages format prioritization on Windows, placing PNG ahead of <code class="language-plaintext highlighter-rouge">CF_DIBV5</code> to appease applications like Microsoft Word, which possess buggy <code class="language-plaintext highlighter-rouge">CF_DIBV5</code> decoders. Furthermore, Chrome leads the implementation of the Async Clipboard API and recently introduced the unsanitized option to allow specific trusted payloads to bypass the strict image re-encoding step when absolute fidelity is required. <h3 id="firefox-gecko">Firefox (Gecko)</h3> Firefox’s architecture relies on the <code class="language-plaintext highlighter-rouge">nsIClipboard</code> interface. Data is bundled into an <code class="language-plaintext highlighter-rouge">nsITransferable</code> object, which manages various “flavors” (<code class="language-plaintext highlighter-rouge">MIME</code> types). A persistent architectural difference in Firefox is its handling of string encodings over <code class="language-plaintext highlighter-rouge">X11</code>, often utilizing <code class="language-plaintext highlighter-rouge">UTF-16</code>, which has historically caused translation issues with native Java applications expecting <code class="language-plaintext highlighter-rouge">UTF-8</code>. Furthermore, Firefox is highly aggressive in providing <code class="language-plaintext highlighter-rouge">CF_HDROP</code> (file drop) formats alongside standard image bitmaps, making pasted images appear as physical files to certain OS targets, which can improve compatibility with legacy file managers. Firefox also heavily utilizes <code class="language-plaintext highlighter-rouge">kSelectionClipboard</code> to support middle-click paste natively on Linux environments. <h3 id="safari-webkit">Safari (WebKit)</h3> WebKit’s pasteboard implementation (<code class="language-plaintext highlighter-rouge">Pasteboard.h</code> and <code class="language-plaintext highlighter-rouge">PlatformPasteboardIOS.mm</code>) is tightly integrated with Cocoa paradigms. It directly translates web types into Apple <code class="language-plaintext highlighter-rouge">UTIs</code>, such as mapping <code class="language-plaintext highlighter-rouge">image/png</code> to <code class="language-plaintext highlighter-rouge">public.png</code> and HTML to Apple Web Archive formats. Because Safari runs predominantly on macOS and iOS, it extensively utilizes <code class="language-plaintext highlighter-rouge">NSItemProvider</code> to handle promised data, interacting deeply with the pasteboard server (<code class="language-plaintext highlighter-rouge">pbs</code>). WebKit handles user activation differently than Blink, requiring developers to resolve <code class="language-plaintext highlighter-rouge">ClipboardItem</code> Promises within a very strict, synchronously triggered scope to prevent security exceptions, addressing specific iOS sandbox constraints. <h2 id="edge-cases-and-protocol-complexities">Edge Cases and Protocol Complexities</h2> The standard copy-paste flow is routinely complicated by edge cases involving web specifications, proprietary media types, and strict privacy boundaries. <h3 id="cross-origin-images-and-cors-implications">Cross-Origin Images and CORS Implications</h3> If Website A embeds an image from a different domain (e.g., <code class="language-plaintext highlighter-rouge">cdn.example.com</code>), the Same-Origin Policy prevents JavaScript from reading the pixels of that image. If a script draws a cross-origin image to an HTML5 <code class="language-plaintext highlighter-rouge"><canvas></code>, the canvas becomes “tainted,” and calling <code class="language-plaintext highlighter-rouge">getImageData()</code> or <code class="language-plaintext highlighter-rouge">toBlob()</code> will throw a security exception unless the server provided an <code class="language-plaintext highlighter-rouge">Access-Control-Allow-Origin</code> (<code class="language-plaintext highlighter-rouge">CORS</code>) header. However, the native “Copy Image” context menu is a trusted user action initiated outside of the <code class="language-plaintext highlighter-rouge">DOM</code>’s execution environment. The browser’s internal C++ handlers possess absolute access to the render tree’s memory and can successfully extract the <code class="language-plaintext highlighter-rouge">SkBitmap</code> and write it to the OS clipboard, bypassing <code class="language-plaintext highlighter-rouge">CORS</code> entirely. If Website A wishes to implement a custom “Copy” button using the Async Clipboard API, it must obey <code class="language-plaintext highlighter-rouge">CORS</code> and utilize <code class="language-plaintext highlighter-rouge">crossOrigin="Anonymous"</code> when fetching the image, or the operation will fail. <h3 id="copying-animated-gif-and-webp">Copying Animated GIF and WebP</h3> Animated formats present a severe limitation for OS clipboards. Binary formats like <code class="language-plaintext highlighter-rouge">CF_DIB</code> on Windows or <code class="language-plaintext highlighter-rouge">public.png</code> on macOS are fundamentally designed for static bitmaps. When a user copies an animated GIF via the context menu, the browser typically extracts the currently visible frame from the render tree, encodes it as a static PNG or Bitmap, and places it on the clipboard. Consequently, pasting the GIF into a chat application often results in a static, frozen image. To preserve animations, browsers attempt to write the HTML representation (<code class="language-plaintext highlighter-rouge"></code>) or file paths (<code class="language-plaintext highlighter-rouge">CF_HDROP</code>), relying on the receiving application to parse the HTML or file reference rather than the raw bitmap. <h3 id="copying-svg-images">Copying SVG Images</h3> Scalable Vector Graphics (<code class="language-plaintext highlighter-rouge">SVG</code>) are mathematically defined paths rather than rasterized pixels. When “Copy Image” is invoked on an <code class="language-plaintext highlighter-rouge"><svg></code> element, the browser cannot easily map it into a generic <code class="language-plaintext highlighter-rouge">CF_DIB</code>. Instead, the browser rasterizes the <code class="language-plaintext highlighter-rouge">SVG</code> to a target resolution, generating a standard PNG pixel buffer, and places that on the clipboard. Alternatively, the raw XML text of the <code class="language-plaintext highlighter-rouge">SVG</code> is placed into the <code class="language-plaintext highlighter-rouge">text/html</code> or <code class="language-plaintext highlighter-rouge">text/plain</code> slots, enabling vector editors like Adobe Illustrator to reconstruct the mathematical paths from the markup. <h3 id="private--incognito-mode-restrictions">Private / Incognito Mode Restrictions</h3> Browsers operate with extreme caution regarding clipboard data in private browsing modes. While data can be copied to the global OS clipboard (as it is the user’s explicit intent), caching the intermediate chunks on disk is strictly prohibited. For massive clipboard transfers (like macOS file promises or Linux Wayland pipe spools) that might ordinarily spill to the filesystem to save memory, the browser must force everything to remain in anonymous volatile memory to ensure no forensic traces survive process termination. <h2 id="final-thoughts">Final Thoughts</h2> Most of the time we never notice any of this, and that’s kind of the point. Modern browsers are designed so that these complexities disappear behind simple user interactions. Not bad for something we do dozens of times a day. </article> <article> <h1>How I Built a PostgreSQL SSO Proxy from Scratch</h1> 2026-02-21T00:00:00+00:00 A company where developers and product managers are required to be given access to the production database to edit rows sounds like a compliance nightmare. It was, but that wasn’t the problem I was looking to solve. I wanted to solve the issue of how we used to provide this access to people: we gave them a password for a single database user that everyone used, including the microservices themselves, and that user had a permission set of <code class="language-plaintext highlighter-rouge">GRANT ALL</code> on the entire database. You could argue that each user could have an individual database user created and be handed the password to that, which would solve the issue of audit logging (who did what on the database), but it was just a management nightmare for us as the infra and security team. Hence we were looking for JIT tools like <code class="language-plaintext highlighter-rouge">strongDM</code> or <code class="language-plaintext highlighter-rouge">Teleport</code> which would solve these issues, but the cost of acquiring such a tool at our scale was going to be at least 100k USD per annum, which was not something I was comfortable asking my CTO to spend. That’s when the idea of building an RDS proxy integrated with SSO came into the picture. This would allow us to give access to the databases via corporate email addresses only, with detailed audit logging of the queries run on the database. This blog goes into detail on implementing this proxy in <code class="language-plaintext highlighter-rouge">Go</code> from scratch and maybe helps you understand the fundamentals of PostgreSQL and how it works. The basic features I was aiming to build for my proxy were: <ol> <li>Connection pooling</li> <li>SSL/TLS support</li> <li>SSO auth with Azure AD via Auth0</li> <li>Auditing and observability</li> </ol> The first step in building a proxy is to expose it as a server on a particular port (<code class="language-plaintext highlighter-rouge">7777</code>) actively listening for client connections. Once a connection is made, it is passed on to a goroutine to be processed. Once the connection is made, the proxy then has to establish a connection to the actual PostgreSQL database. To understand how this happens exactly, we need to understand the communication protocol used by PostgreSQL. <h2 id="postgresql-wire-protocol-frontendbackend-protocol">PostgreSQL wire protocol (Frontend/Backend Protocol)</h2> PostgreSQL uses a message-based protocol for communication between frontends and backends (clients and servers). The protocol is supported universally on TCP/IP port <code class="language-plaintext highlighter-rouge">5432</code>. In order to serve multiple clients efficiently, the server launches a new “backend” process for each client. In the current implementation, a new child process is created immediately after an incoming connection is detected. This is transparent to the protocol, however. For purposes of the protocol, the terms “backend” and “server” are interchangeable; likewise “frontend” and “client” are interchangeable. Every PostgreSQL message has this format: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Message Type (1 byte)] [Length (4 bytes)] [Message Body (Length-4 bytes)] </code></pre></div></div> <ul> <li>Message type: single character identifying the message</li> <li>Length: 32-bit int</li> <li>Message body: the actual data</li> </ul> At the proxy level, the messages are referred to like this: <ul> <li>Frontend messages — sent by the client; the proxy intercepts these and sends them to the DB</li> <li>Backend messages — sent by the server; the proxy intercepts these from the DB and sends them to the user</li> </ul> In our implementation, we use <a href="https://github.com/jackc/pgproto3"><code class="language-plaintext highlighter-rouge">pgproto3</code></a>, which is the encoder and decoder of the PostgreSQL wire protocol version 3. <h3 id="startup-message">Startup message</h3> The Startup message in the PostgreSQL wire protocol is the very first message sent when a PostgreSQL connection is established, and it is special enough that it has to be handled separately because: <ol> <li>It has no message type.</li> <li>It’s always the first message in any PostgreSQL connection.</li> <li>It contains connection parameters such as username and database name.</li> </ol> Flow: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Client -> StartupMessage -> Proxy -> StartupMessage -> Database </code></pre></div></div> Now that we have an understanding of the protocol, we can move ahead with the message flow. Once the connection is made to the proxy, a new <code class="language-plaintext highlighter-rouge">pgproto3.Backend</code> wrapping the raw TCP connection is created and the first call is <code class="language-plaintext highlighter-rouge">pgconn.ReceiveStartupMessage()</code>. The PostgreSQL startup message, as mentioned above, has no message type byte. Its format is <code class="language-plaintext highlighter-rouge">[length:4 bytes][protocol_version:4 bytes][parameters]</code>. <code class="language-plaintext highlighter-rouge">pgproto3</code> handles this by reading the first 4 bytes, checking if they match the SSL request magic number (<code class="language-plaintext highlighter-rouge">80877103</code>), the cancel request magic (<code class="language-plaintext highlighter-rouge">80877102</code>), or a protocol version, and returning the appropriate concrete type. There are three possible message types at this point, each handled by its own branch. <h4 id="case-a-sslrequest-pgproto3sslrequest">Case A: <code class="language-plaintext highlighter-rouge">SSLRequest</code> (<code class="language-plaintext highlighter-rouge">*pgproto3.SSLRequest</code>)</h4> The client sends this request before the real startup message when it wants to establish TLS. The proxy must respond with a single byte before the client proceeds. <ol> <li> If TLS is not configured: the proxy sends the single byte <code class="language-plaintext highlighter-rouge">N</code> (ASCII 78), indicating to the client that TLS is unavailable, and the client immediately sends the real <code class="language-plaintext highlighter-rouge">StartupMessage</code> on the same plaintext connection. </li> <li> If TLS is configured: the proxy sends <code class="language-plaintext highlighter-rouge">S</code> (ASCII 83), indicating TLS is accepted, wraps the raw <code class="language-plaintext highlighter-rouge">net.Conn</code> in a <code class="language-plaintext highlighter-rouge">tls.Conn</code> using the server’s certificate, and performs the TLS handshake. </li> </ol> Once done, <code class="language-plaintext highlighter-rouge">pc.conn</code> is replaced with the TLS connection and a new <code class="language-plaintext highlighter-rouge">pgproto3.Backend</code> connection is built over the TLS connection, and everything subsequent (password and queries) is encrypted. Once one of the above is completed successfully, the <code class="language-plaintext highlighter-rouge">StartupMessage</code> is sent by the client to the proxy. <h4 id="case-b-startupmessage-pgproto3startupmessage">Case B: <code class="language-plaintext highlighter-rouge">StartupMessage</code> (<code class="language-plaintext highlighter-rouge">*pgproto3.StartupMessage</code>)</h4> The <code class="language-plaintext highlighter-rouge">StartupMessage</code> contains <code class="language-plaintext highlighter-rouge">user</code>, <code class="language-plaintext highlighter-rouge">database</code>, <code class="language-plaintext highlighter-rouge">application_name</code>, <code class="language-plaintext highlighter-rouge">client_encoding</code>, and any other parameters the client sends. Once the proxy receives this, it doesn’t send it to the PostgreSQL server. It instead sends back an <code class="language-plaintext highlighter-rouge">AuthenticationCleartextPassword</code>, just like how the PostgreSQL server would. <h4 id="case-c-cancelrequest-pgproto3cancelrequest">Case C: <code class="language-plaintext highlighter-rouge">CancelRequest</code> (<code class="language-plaintext highlighter-rouge">*pgproto3.CancelRequest</code>)</h4> Cancel requests are entirely separate TCP connections, if enabled. A client receiving the SIGKILL or <code class="language-plaintext highlighter-rouge">Ctrl+C</code> opens a new connection to the proxy’s port and immediately sends a 16-byte cancel message without any SSL negotiation. Structure: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Byte offset Size Value Meaning ---------------------------------------------------------- 0 - 3 4 bytes 0x00000010 (16) Total message length 4 - 7 4 bytes 0x04D2162E Cancel magic number (80877102) 8 - 11 4 bytes <ProcessID> The backend PID to cancel 12 - 15 4 bytes <SecretKey> The secret key for that PID </code></pre></div></div> Inside the proxy, this message must be assembled manually each time, like this: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>buf := make([]byte, 16) // Message length - 16 bytes binary.BigEndian.PutUint32(buf[0:4], 16) // Cancel request code - 80877102 binary.BigEndian.PutUint32(buf[4:8], 80877102) // Process ID binary.BigEndian.PutUint32(buf[8:12], cancel.ProcessID) // Secret key binary.BigEndian.PutUint32(buf[12:16], cancel.SecretKey) </code></pre></div></div> <code class="language-plaintext highlighter-rouge">ProcessID</code> is the identification assigned to the process forked for handling the new connection made. <code class="language-plaintext highlighter-rouge">SecretKey</code> is generated by PostgreSQL within the backend process when a new client connection is established. When PostgreSQL forks a dedicated backend process to handle the connection, it creates a random 32-bit integer and associates it with that process’s PID. The <code class="language-plaintext highlighter-rouge">SecretKey</code> exists only for the purpose of query cancellation. PostgreSQL then sends both the PID and the <code class="language-plaintext highlighter-rouge">SecretKey</code> to the connected client in a <code class="language-plaintext highlighter-rouge">BackendKeyData</code> message: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────┬──────────────┬──────────────┬──────────────┐ │ 'K' (1 byte) │ Length │ ProcessID │ SecretKey │ │ type byte │ (4 bytes) │ (4 bytes) │ (4 bytes) │ └──────────────┴──────────────┴──────────────┴──────────────┘ </code></pre></div></div> Based on the <code class="language-plaintext highlighter-rouge">ProcessID</code> and <code class="language-plaintext highlighter-rouge">SecretKey</code>, the proxy must now identify the active connection and cancel it. In my proxy, we store the <code class="language-plaintext highlighter-rouge">activeConnections</code> in a map keyed by <code class="language-plaintext highlighter-rouge">uint64(ProcessID << 32 | SecretKey)</code> — a bitfield combining both values into a single efficient map key. If the connection is found to be active, the cancel request flow is called, which initiates a new TCP connection to the database, not from the connection pool. The raw 16-byte binary cancel message is sent and the connection is closed immediately. PostgreSQL receives this, validates the PID/<code class="language-plaintext highlighter-rouge">SecretKey</code> against its own backend process table, and sends <code class="language-plaintext highlighter-rouge">SIGINT</code> to the matching backend process which aborts the in-flight query and returns an <code class="language-plaintext highlighter-rouge">ErrorResponse</code> with code <code class="language-plaintext highlighter-rouge">57014</code> to the client via the existing pooled connection. <h2 id="authentication-inside-the-proxy">Authentication inside the proxy</h2> Now that the <code class="language-plaintext highlighter-rouge">StartupMessage</code> has been sent successfully, it’s time for the user authentication part of the proxy. I split it into 3 sequential phases: <ol> <li>Open a temporary backend connection - A raw TCP connection to PostgreSQL, not from the pool, is made with the sole purpose of authentication.</li> <li> Request a password from the client - The proxy sends <code class="language-plaintext highlighter-rouge">AuthenticationCleartextPassword</code> to the client. From the client’s perspective, the proxy is behaving like a PostgreSQL server requesting a password. The client sends back a <code class="language-plaintext highlighter-rouge">PasswordMessage</code> containing whatever was in <code class="language-plaintext highlighter-rouge">PGPASSWORD</code> or whatever was entered interactively. PostgreSQL generally uses SCRAM-SHA-256 or MD5, but the proxy here always asks the client for cleartext. </li> <li>Determine the authentication flow - The reason for getting the password as cleartext is for the proxy to inspect it and decide whether the password sent is a JWT token or a normal password.</li> </ol> JWT detection: check for the <code class="language-plaintext highlighter-rouge">eyJ</code> prefix (base64url encoding of <code class="language-plaintext highlighter-rouge">{"</code>) and exactly 2 dots (the three-part JWT structure <code class="language-plaintext highlighter-rouge">header.payload.signature</code>). Simple heuristic, but correct for all JWTs. <h3 id="traditional-password-flow">Traditional password flow</h3> The username the client sent is used along with the password. It’s basically a fallback for when someone configures a real PostgreSQL user in the proxy and connects with a traditional password. <h3 id="inside-jwt-validation">Inside JWT validation</h3> <ol> <li><code class="language-plaintext highlighter-rouge">jwt.Parse</code> is called with a key function. The key function: <ul> <li>Checks <code class="language-plaintext highlighter-rouge">t.Method.Alg()</code> == <code class="language-plaintext highlighter-rouge">"RS256"</code> — rejects anything else</li> <li>Extracts <code class="language-plaintext highlighter-rouge">kid</code> (Key ID) from the token header</li> <li>Calls <code class="language-plaintext highlighter-rouge">v.getPublicKey(kid)</code></li> </ul> </li> <li><code class="language-plaintext highlighter-rouge">getPublicKey(kid)</code> — this is where JWKS caching happens: <ul> <li>Acquires <code class="language-plaintext highlighter-rouge">RLock</code>, checks if <code class="language-plaintext highlighter-rouge">kid</code> exists in <code class="language-plaintext highlighter-rouge">v.publicKeys</code> and <code class="language-plaintext highlighter-rouge">time.Since(v.lastKeysFetch) < 1 hour</code></li> <li>Cache hit: releases RLock, returns the key — no network call</li> <li>Cache miss: releases RLock, acquires full <code class="language-plaintext highlighter-rouge">Lock</code> (write), double-checks again (another goroutine may have fetched while waiting), then calls <code class="language-plaintext highlighter-rouge">fetchJWKS()</code></li> <li><code class="language-plaintext highlighter-rouge">fetchJWKS()</code> makes a GET to <code class="language-plaintext highlighter-rouge">https://<AUTH0_TENANT>/.well-known/jwks.json</code> with a 10-second timeout, filters for <code class="language-plaintext highlighter-rouge">kty=RSA, use=sig</code>, decodes base64url modulus <code class="language-plaintext highlighter-rouge">N</code> and exponent <code class="language-plaintext highlighter-rouge">E</code>, constructs <code class="language-plaintext highlighter-rouge">*rsa.PublicKey</code> objects, stores them all in <code class="language-plaintext highlighter-rouge">v.publicKeys</code> keyed by <code class="language-plaintext highlighter-rouge">kid</code>, updates <code class="language-plaintext highlighter-rouge">v.lastKeysFetch</code></li> <li>Fetch failure with stale keys: if the JWKS endpoint is down but old keys exist, logs a warning and returns the stale key — this is the graceful degradation path</li> <li>Fetch failure, no keys: returns error</li> </ul> </li> <li> <code class="language-plaintext highlighter-rouge">jwt.Parse</code> verifies the RS256 signature using the public key returned from the key function. If the signature is invalid, it returns an error. </li> <li>Manual claim validation (after signature passes): <ul> <li><code class="language-plaintext highlighter-rouge">iss</code> claim == configured issuer — exact string match</li> <li><code class="language-plaintext highlighter-rouge">aud</code> claim == configured audience — handles both <code class="language-plaintext highlighter-rouge">string</code> and <code class="language-plaintext highlighter-rouge">[]interface{}</code> types (Auth0 can send either)</li> <li><code class="language-plaintext highlighter-rouge">email</code> claim — must be present and non-empty</li> <li><code class="language-plaintext highlighter-rouge">sub</code> claim — must be present and non-empty</li> <li><code class="language-plaintext highlighter-rouge">exp</code> claim — <code class="language-plaintext highlighter-rouge">time.Now().After(oauthContext.ExpiresAt)</code> — double-check (jwt.Parse also checks this but the manual check is explicit)</li> </ul> </li> <li>Role extraction from <code class="language-plaintext highlighter-rouge">extractRoles()</code>: tries <code class="language-plaintext highlighter-rouge">claims["role"]</code> first, then <code class="language-plaintext highlighter-rouge">claims["roles"]</code> — handles both singular and plural claim names. Each can be a <code class="language-plaintext highlighter-rouge">string</code> or <code class="language-plaintext highlighter-rouge">[]interface{}</code>.</li> </ol> This validation process returns the email, role, and expiry time for the token, which is then used to map the role to a service account configured in the proxy. If no role matches, it ends up using the default role, which has read-only access. Service accounts are basically users configured in the PostgreSQL database that the proxy uses to connect to the database, since the SSO-returned user does not actually exist inside PostgreSQL. This also ensures we don’t have to create a PostgreSQL user for every user logging into the database, and the same goes for deletion as well. If a user is removed from Active Directory, they automatically do not have access to the database anymore. <h2 id="authentication-with-postgresql">Authentication with PostgreSQL</h2> Now that the proxy has authenticated and authorised the incoming SSO user, it now needs to connect this user/client to the actual PostgreSQL database (backend). This is done in the same way by sending a <code class="language-plaintext highlighter-rouge">StartupMessage</code> to PostgreSQL via a temporary connection, with one small change: the <code class="language-plaintext highlighter-rouge">user</code> field is replaced with the service account username before being sent to PostgreSQL. PostgreSQL never sees the original <code class="language-plaintext highlighter-rouge">user@email.com</code> that the client sent. The proxy now enters a loop reading messages from PostgreSQL. This part is referred to as <a href="https://www.postgresql.org/docs/current/sasl-authentication.html">SASL authentication</a> in the PostgreSQL protocol. <ol> <li> To begin a SASL authentication exchange, the PostgreSQL server sends an <code class="language-plaintext highlighter-rouge">AuthenticationSASL</code> message. It includes a list of SASL authentication mechanisms that the server can accept, in the server’s preferred order. The default for this is usually either SCRAM-SHA-256 or MD5, rarely cleartext. </li> <li> The proxy selects the first one in the priority of the supported mechanisms from the list, and sends a <code class="language-plaintext highlighter-rouge">SASLInitialResponse</code> message to the server. If <code class="language-plaintext highlighter-rouge">AuthenticationSASL</code> sends SCRAM-SHA-256, the proxy instantiates an <code class="language-plaintext highlighter-rouge">xdg-go/scram</code> SHA-256 client acting on behalf of the service account. The library is used to perform the full cryptographic exchange using the service account’s password. PostgreSQL never sees the JWT — it only sees the service account performing standard SCRAM. SCRAM-SHA-256 is a 3-round challenge-response protocol. The proxy first sends the <code class="language-plaintext highlighter-rouge">SASLInitialResponse</code> and, with the help of the <code class="language-plaintext highlighter-rouge">scram</code> library, starts the conversation by calling <code class="language-plaintext highlighter-rouge">Step("")</code> with an empty string — this means “generate the client-first message” (the opening move of SCRAM). </li> </ol> The above in code looks something like this: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>client, err := scram.SHA256.NewClient(username, password, "") if err != nil { return logger.Errorf("failed to create SCRAM client: %w", err) } scramConversation = client.NewConversation() initialResponse, err := scramConversation.Step("") if err != nil { return logger.Errorf("SCRAM initial step failed: %w", err) } logger.Debug("sending SCRAM initial response to backend") err = frontend.Send(&pgproto3.SASLInitialResponse{ AuthMechanism: "SCRAM-SHA-256", Data: []byte(initialResponse), }) if err != nil { return logger.Errorf("failed to send SASL initial response: %w", err) } </code></pre></div></div> The final payload is a structured ASCII string. It looks like: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n,,n=gprxy_admin,r=fyko+d2lbbFgONRv9qkxdawL </code></pre></div></div> Breaking this down character by character: <table> <thead> <tr> <th>Part</th> <th>Value</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td>n,,</td> <td>n,,</td> <td>GS2 header. <code class="language-plaintext highlighter-rouge">n</code> = no channel binding. <code class="language-plaintext highlighter-rouge">,,</code> = no authzid</td> </tr> <tr> <td>n=</td> <td>n=gprxy_admin</td> <td>The username (the <code class="language-plaintext highlighter-rouge">n=</code> attribute)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">,</code></td> <td><code class="language-plaintext highlighter-rouge">,</code></td> <td>Separator</td> </tr> <tr> <td>r=</td> <td>r=fyko+d2lbbFgONRv9qkxdawL</td> <td>Client nonce</td> </tr> </tbody> </table> <code class="language-plaintext highlighter-rouge">authzid</code> (Authorization Identity) is the SASL mechanism component defining the user identity that a client wants to act as. Client nonce — a cryptographically random base64 string generated fresh for this authentication. <ol> <li>PostgreSQL responds with the <code class="language-plaintext highlighter-rouge">AuthenticationSASLContinue</code> server-first message, a challenge. This is the message that makes SCRAM secure.</li> </ol> The payload contains <code class="language-plaintext highlighter-rouge">r=<combined_nonce>,s=<salt>,i=<iterations></code> Combined nonce — client nonce + server nonce appended together. PostgreSQL echoes back the client nonce and appends its own random suffix. The client must verify the prefix matches what it sent. Salt — a random base64-encoded value stored in pg_authid alongside the user’s password hash. Different for every user. Iteration count — how many times to apply PBKDF2 to derive the key. Higher = more expensive to brute-force. PostgreSQL defaults to 4096. <ol> <li>The proxy responds with the client-final message, containing the client proof as <code class="language-plaintext highlighter-rouge">SASLResponse</code>.</li> </ol> To send the response, the proxy first needs to do the cryptographic modifications to the request, for which it calls <code class="language-plaintext highlighter-rouge">scramConversation.Step(serverFirstMessage)</code>. The following cryptographic computations are done: <ol> <li><code class="language-plaintext highlighter-rouge">SaltedPassword = PBKDF2(SHA-256, password, salt, iterations, 32)</code></li> <li><code class="language-plaintext highlighter-rouge">ClientKey = HMAC-SHA-256(SaltedPassword, "Client Key")</code></li> <li><code class="language-plaintext highlighter-rouge">StoredKey = SHA-256(ClientKey)</code></li> <li><code class="language-plaintext highlighter-rouge">AuthMessage = client-first-message-bare + "," + server-first-message + "," + client-final-message-without-proof</code></li> <li><code class="language-plaintext highlighter-rouge">ClientSignature = HMAC-SHA-256(StoredKey, AuthMessage)</code></li> <li><code class="language-plaintext highlighter-rouge">ClientProof = ClientKey XOR ClientSignature</code></li> </ol> The password never travels on the wire. Only <code class="language-plaintext highlighter-rouge">ClientProof</code> does — a value that proves you know the password without revealing it. The final payload that is sent to the server looks like this: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c=biws,r=fyko+d2lbbFgONRv9qkxdawL3rfcNHYJY1ZVvWVs7j,p=dHzbZapWIk4jUhN+Ute9ytag9zjfMHgsqmmiz9AndVQ= </code></pre></div></div> <table> <thead> <tr> <th>Attribute</th> <th>Example value</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td>c=</td> <td>biws</td> <td>Channel binding data <code class="language-plaintext highlighter-rouge">biws</code> is the base64 of <code class="language-plaintext highlighter-rouge">"n,,"</code> (the GS2 header from the initial message). Since gprxy uses no channel binding, this is always <code class="language-plaintext highlighter-rouge">biws</code>.</td> </tr> <tr> <td>r=</td> <td>fyko+d2lbbFgONRv9qkxdawL3rfcNHYJY1ZVvWVs7j</td> <td>The full combined nonce echoed back exactly as received from the server.</td> </tr> <tr> <td>p=</td> <td>dHzbZapWIk4jUhN+Ute9ytag9zjfMHgsqmmiz9AndVQ=</td> <td>The ClientProof — the XOR of <code class="language-plaintext highlighter-rouge">ClientKey</code> and <code class="language-plaintext highlighter-rouge">ClientSignature</code>, base64-encoded. This is the proof of knowledge.</td> </tr> </tbody> </table> <ol> <li> PostgreSQL receives the above payload and does the below computations before sending the final server message <code class="language-plaintext highlighter-rouge">AuthenticationSASLFinal</code>. </li> <li>Verifies <code class="language-plaintext highlighter-rouge">r=</code> still starts with the client nonce it saw earlier.</li> <li>Computes the same <code class="language-plaintext highlighter-rouge">AuthMessage</code> on its side using the stored password hash.</li> <li>Computes <code class="language-plaintext highlighter-rouge">StoredKey</code> from <code class="language-plaintext highlighter-rouge">pg_authid</code>.</li> <li>Verifies the proof: <code class="language-plaintext highlighter-rouge">SHA-256(ClientKey)</code> must equal <code class="language-plaintext highlighter-rouge">StoredKey</code>, which it can check without knowing <code class="language-plaintext highlighter-rouge">ClientKey</code> directly.</li> </ol> This is the mutual authentication step: PostgreSQL proves to the proxy that it also knows the password. This prevents man-in-the-middle attacks. The final payload that is sent to the client looks like this: <code class="language-plaintext highlighter-rouge">v=<ServerSignature></code> <ol> <li>The proxy calls <code class="language-plaintext highlighter-rouge">scramConversation.Step(serverFinalMessage)</code>:</li> </ol> This internally computes the expected <code class="language-plaintext highlighter-rouge">ServerSignature</code> using its copy of <code class="language-plaintext highlighter-rouge">SaltedPassword</code> and the <code class="language-plaintext highlighter-rouge">AuthMessage</code>, then compares it to <code class="language-plaintext highlighter-rouge">v=</code> from the server. If they don’t match, it means it is connected to a rogue server. <ol> <li> Along with <code class="language-plaintext highlighter-rouge">AuthenticationSASLFinal</code>, the server also sends an <code class="language-plaintext highlighter-rouge">AuthenticationOk</code> message. The same is passed along to the client, which makes it believe that the authentication is successful. </li> <li> PostgreSQL also sends several of these immediately after <code class="language-plaintext highlighter-rouge">AuthenticationOk</code>: </li> </ol> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>server_version = 16.1 client_encoding = UTF8 server_encoding = UTF8 DateStyle = ISO, MDY TimeZone = UTC integer_datetimes = on ... </code></pre></div></div> Each is forwarded to the client unchanged. The client caches these for the session. <ol> <li><code class="language-plaintext highlighter-rouge">BackendKeyData</code> and <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> are sent to the proxy by PostgreSQL, but these are never relayed to the client and the temporary connection is then terminated.</li> </ol> The reason behind this is that this whole authentication process was performed by a temporary connection that was terminated. Relaying that connection’s <code class="language-plaintext highlighter-rouge">BackendKeyData</code>, which is mainly used in cancelling requests by extracting the <code class="language-plaintext highlighter-rouge">(PID, SecretKey)</code>, would result in either of these 3 scenarios: <ol> <li> Temp connection PID is already dead - The temp connection is closed immediately after auth. Its backend PostgreSQL process (<code class="language-plaintext highlighter-rouge">PID=12345</code>) is gone. The client presses <code class="language-plaintext highlighter-rouge">Ctrl+C</code>. The proxy receives <code class="language-plaintext highlighter-rouge">(PID=12345, SK=98765)</code>. It looks in <code class="language-plaintext highlighter-rouge">activeConnections</code> — nothing is registered there with that pair. The cancel is silently dropped. The query keeps running forever. </li> <li> Temp connection PID is registered but wrong - Even if the proxy tried to register the temp connection’s key, what would it point to? The temp connection has no pool connection attached to it. There is no backend to cancel on. The proxy would forward the cancel to PostgreSQL targeting a process that is either dead or belongs to a completely different connection. </li> <li> OS PID reuse - PIDs are finite. The OS can recycle <code class="language-plaintext highlighter-rouge">PID=12345</code> to a completely different PostgreSQL backend process after the temp connection closes. The client’s cancel request, carrying that stale PID, could accidentally cancel a totally unrelated query running on a different client’s connection. </li> </ol> The only correct PID and SecretKey to give the client is the one belonging to the pool connection — the live backend process that is actually executing queries for this client. Similarly, the <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> message is also suppressed and not relayed to the client, as there is no pooled backend connection ready yet for it to start relaying queries. <h2 id="post-authentication-sequence">Post-Authentication Sequence</h2> Now that the user is authenticated successfully with PostgreSQL, the proxy needs a connection from the pool to start running queries. Let’s now talk about how connection pooling is implemented: <h3 id="layer-1-the-top-level-registry-poolmanager">Layer 1: The Top-Level Registry <code class="language-plaintext highlighter-rouge">poolManager</code></h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>var ( poolManager = make(map[poolKey]*pgxpool.Pool) poolMutex sync.RWMutex ) </code></pre></div></div> This is a global process-wide map — one instance for the entire gprxy process, shared across all goroutines and all client connections. It lives for the lifetime of the process and is never torn down. <h4 id="the-key-poolkey">The Key: <code class="language-plaintext highlighter-rouge">poolKey</code></h4> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type poolKey struct { user string database string } </code></pre></div></div> This is a Go struct used as a map key. Go allows any comparable type as a map key, and structs with only comparable fields are comparable. The two fields together form the composite key. <code class="language-plaintext highlighter-rouge">user</code> here is the original client username — e.g. <code class="language-plaintext highlighter-rouge">alice@example.com</code> from the JWT, or <code class="language-plaintext highlighter-rouge">bob</code> from traditional auth. It is NOT the service account (<code class="language-plaintext highlighter-rouge">gprxy_admin</code>). This is set from <code class="language-plaintext highlighter-rouge">msg.Parameters["user"]</code> from the original <code class="language-plaintext highlighter-rouge">StartupMessage</code>. <code class="language-plaintext highlighter-rouge">database</code> is the database name from the same <code class="language-plaintext highlighter-rouge">StartupMessage</code> — e.g. <code class="language-plaintext highlighter-rouge">gprxy_test</code>. So the map looks like this after several clients connect: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poolManager = { {user: "alice@example.com", database: "gprxy_test"} → *pgxpool.Pool (up to 5 conns) {user: "alice@example.com", database: "analytics"} → *pgxpool.Pool (up to 5 conns) {user: "bob@example.com", database: "gprxy_test"} → *pgxpool.Pool (up to 5 conns) {user: "carol", database: "gprxy_test"} → *pgxpool.Pool (up to 5 conns) } </code></pre></div></div> Each <code class="language-plaintext highlighter-rouge">*pgxpool.Pool</code> value manages its own set of up to 5 real TCP connections to PostgreSQL. <h4 id="the-lock-syncrwmutex">The Lock: <code class="language-plaintext highlighter-rouge">sync.RWMutex</code></h4> <code class="language-plaintext highlighter-rouge">poolMutex</code> protects <code class="language-plaintext highlighter-rouge">poolManager</code> from concurrent reads and writes across goroutines. Since every client connection runs in its own goroutine, many goroutines can call <code class="language-plaintext highlighter-rouge">GetOrCreatePool</code> simultaneously. A <code class="language-plaintext highlighter-rouge">sync.RWMutex</code> allows: <ul> <li>Many goroutines to read simultaneously — <code class="language-plaintext highlighter-rouge">RLock()</code> → <code class="language-plaintext highlighter-rouge">RUnlock()</code></li> <li>Only one goroutine to write, blocking all reads — <code class="language-plaintext highlighter-rouge">Lock()</code> → <code class="language-plaintext highlighter-rouge">Unlock()</code></li> </ul> The read path (pool already exists) is the fast path — it only holds a read lock for a microsecond. The write path (first connection for this key) is rare and takes a write lock briefly to insert the new pool. <h3 id="layer-2-the-double-checked-locking-pattern">Layer 2: The Double-Checked Locking Pattern</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poolMutex.RLock() pool, exists := poolManager[key] poolMutex.RUnlock() if exists { return pool, nil } poolMutex.Lock() defer poolMutex.Unlock() if pool, exists := poolManager[key]; exists { return pool, nil } </code></pre></div></div> This is a classic double-checked locking pattern. Here is why it needs two checks: Scenario without the second check: <ol> <li>Goroutine A: <code class="language-plaintext highlighter-rouge">RLock()</code> → pool not found → <code class="language-plaintext highlighter-rouge">RUnlock()</code></li> <li>Goroutine B: <code class="language-plaintext highlighter-rouge">RLock()</code> → pool not found → <code class="language-plaintext highlighter-rouge">RUnlock()</code></li> <li>Goroutine A: <code class="language-plaintext highlighter-rouge">Lock()</code> → creates pool → <code class="language-plaintext highlighter-rouge">Unlock()</code></li> <li>Goroutine B: <code class="language-plaintext highlighter-rouge">Lock()</code> → also creates a second pool → two pools for same key, one is lost</li> </ol> The second check inside the write lock prevents this. When goroutine B gets the write lock after A finishes, it checks again and finds the pool already there, so it returns it instead of creating a duplicate. <h3 id="layer-3-what-pgxpoolpool-actually-is">Layer 3: What <code class="language-plaintext highlighter-rouge">pgxpool.Pool</code> Actually Is</h3> Each value in <code class="language-plaintext highlighter-rouge">poolManager</code> is a <code class="language-plaintext highlighter-rouge">*pgxpool.Pool</code>. This is not a simple slice of connections. It is a sophisticated object with its own goroutines and internal data structures. <h4 id="internal-structure-inside-pgxpool">Internal structure (inside pgxpool):</h4> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*pgxpool.Pool ├── config *pgxpool.Config (MaxConns, timeouts, etc.) ├── p *puddle.Pool[*pgxpool.connResource] ← the actual pool │ ├── resources []poolResource (ring buffer of connections) │ ├── cond *sync.Cond (for blocking Acquire calls) │ └── ... ├── closeChan chan struct{} (signal pool close) └── ... </code></pre></div></div> pgxpool uses the puddle library internally for the actual pooling logic. <code class="language-plaintext highlighter-rouge">puddle</code> maintains a list of resources (connections) and a <code class="language-plaintext highlighter-rouge">sync.Cond</code> for goroutines waiting for an available connection. <h4 id="the-pool-configuration-gprxy-sets">The pool configuration gprxy sets:</h4> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>config.MaxConns = defaultMaxConns // 5 config.MinConns = defaultMinConns // 0 config.MaxConnLifetime = defaultMaxConnLifetime // 1 hour config.MaxConnIdleTime = defaultMaxConnIdleTime // 30 minutes config.HealthCheckPeriod = defaultHealthCheckPeriod // 1 minute config.ConnConfig.ConnectTimeout = defaultConnectTimeout // 5 seconds </code></pre></div></div> What each setting actually controls: <code class="language-plaintext highlighter-rouge">MaxConns = 5</code> The hard ceiling. At most 5 TCP connections to PostgreSQL will ever exist for this <code class="language-plaintext highlighter-rouge">(user, database)</code> key. When all 5 are acquired (in use), the 6th <code class="language-plaintext highlighter-rouge">pool.Acquire()</code> call blocks — the calling goroutine is parked and put on a wait queue inside puddle. It will be woken up when one of the 5 connections is released. <code class="language-plaintext highlighter-rouge">MinConns = 0</code> No pre-warming. When the pool is created, zero connections to PostgreSQL are opened. The first <code class="language-plaintext highlighter-rouge">pool.Acquire()</code> on a fresh pool will always open a new TCP connection. This is lazy initialization — no connections consumed for idle users. <code class="language-plaintext highlighter-rouge">MaxConnLifetime = 1 hour</code> A background goroutine inside pgxpool periodically checks the age of every connection. Any connection older than 1 hour is closed and removed from the pool, even if it is idle and healthy. This forces periodic reconnection, which is important for: <ul> <li>Picking up PostgreSQL configuration changes</li> <li>Rotating credentials if needed</li> <li>Preventing connections from being silently dropped by firewalls or load balancers that kill long-lived idle connections</li> </ul> <code class="language-plaintext highlighter-rouge">MaxConnIdleTime = 30 minutes</code> Any connection that has been idle (not acquired by anyone) for 30 minutes is closed. This prevents the pool from holding open connections during quiet periods. <code class="language-plaintext highlighter-rouge">HealthCheckPeriod = 1 minute</code> Every minute, a background goroutine runs through all idle connections and pings each one. Any connection that fails the ping (PostgreSQL restarted, network blip) is removed from the pool. This keeps the pool clean so that <code class="language-plaintext highlighter-rouge">Acquire()</code> always returns a working connection. <code class="language-plaintext highlighter-rouge">ConnectTimeout = 5 seconds</code> When a new TCP connection to PostgreSQL needs to be opened, it must complete the entire startup handshake (TCP connect + SCRAM auth + <code class="language-plaintext highlighter-rouge">ReadyForQuery</code>) within 5 seconds. If it takes longer, the connection attempt is aborted and an error is returned. <h3 id="layer-4-pgxpoolconn-as-a-single-exclusive-handle">Layer 4: <code class="language-plaintext highlighter-rouge">*pgxpool.Conn</code> as a Single Exclusive Handle</h3> When <code class="language-plaintext highlighter-rouge">pool.Acquire(ctx)</code> returns, it gives back a <code class="language-plaintext highlighter-rouge">*pgxpool.Conn</code>. This is not a connection itself — it is a handle that: <ol> <li>Wraps the underlying <code class="language-plaintext highlighter-rouge">*pgx.Conn</code></li> <li>Marks that connection as acquired (in use) in the pool’s internal state</li> <li>Provides a <code class="language-plaintext highlighter-rouge">Release()</code> method to return it</li> </ol> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*pgxpool.Conn ├── p *pgxpool.Pool (pointer back to parent pool) └── res *puddle.Resource (the resource being held) └── value *pgxpool.connResource └── conn *pgx.Conn (the actual connection) </code></pre></div></div> The connection is exclusively held — the pool will not give the same underlying <code class="language-plaintext highlighter-rouge">*pgx.Conn</code> to any other goroutine while one <code class="language-plaintext highlighter-rouge">*pgxpool.Conn</code> holds it. This is what makes it safe for <code class="language-plaintext highlighter-rouge">pc.bf.Send()</code> and <code class="language-plaintext highlighter-rouge">pc.bf.Receive()</code> to call directly into the TCP socket without any additional locking. The pool’s ownership model guarantees single-writer, single-reader. <h4 id="how-gprxy-digs-through-the-layers-to-get-the-raw-socket">How gprxy digs through the layers to get the raw socket</h4> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>underlyingConn := pc.poolConn.Conn().PgConn().Conn() bf := pgproto3.NewFrontend(pgproto3.NewChunkReader(underlyingConn), underlyingConn) </code></pre></div></div> The chain of unwrapping: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pc.poolConn *pgxpool.Conn .Conn() *pgx.Conn (higher-level pgx connection) .PgConn() *pgconn.PgConn (low-level wire protocol connection) .Conn() net.Conn (raw TCP socket) </code></pre></div></div> gprxy bypasses all of pgx’s query execution machinery and talks directly to the raw TCP socket. It wraps it with a <code class="language-plaintext highlighter-rouge">pgproto3.Frontend</code> to get PostgreSQL wire protocol serialization/deserialization. This is why gprxy can forward arbitrary protocol messages — because it is working at the wire level, not through pgx’s <code class="language-plaintext highlighter-rouge">Query()</code>/<code class="language-plaintext highlighter-rouge">Exec()</code> API. However, this creates a subtle tension: pgx still thinks it “owns” this connection and its internal state machine. gprxy is now sending bytes on the socket that pgx does not know about. This is why the <code class="language-plaintext highlighter-rouge">fullResetBeforeRelease</code> step is critical — pgx’s internal state may be out of sync with the actual PostgreSQL session state after gprxy forwards arbitrary queries, and <code class="language-plaintext highlighter-rouge">ROLLBACK</code> + <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> restores the PostgreSQL session to a clean state before pgx takes back ownership. <h3 id="layer-5-connection-lifecycle-state-machine">Layer 5: Connection Lifecycle State Machine</h3> For a single physical PostgreSQL connection managed by <code class="language-plaintext highlighter-rouge">pgxpool</code>, the happy-path lifecycle looks like this: Important nuance: after <code class="language-plaintext highlighter-rouge">pgxpool.NewWithConfig()</code>, the pool object exists immediately, but with <code class="language-plaintext highlighter-rouge">MinConns=0</code> there may be zero physical PostgreSQL connections inside it until the first <code class="language-plaintext highlighter-rouge">Acquire()</code> needs one. Also, <code class="language-plaintext highlighter-rouge">Release()</code> does not always transition to <code class="language-plaintext highlighter-rouge">[IDLE]</code> — if the connection is broken, expired, or otherwise not reusable, pgxpool closes it instead of returning it to the free list. <h3 id="layer-6-the-acquire-flow">Layer 6: The Acquire Flow</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>connection, err := pool.Acquire(context.Background()) if err != nil { return nil, logger.Errorf("error while acquiring connection from the database pool: %w", err) } err = connection.Ping(context.Background()) if err != nil { connection.Release() return nil, logger.Errorf("could not ping database: %w", err) } return connection, nil </code></pre></div></div> <code class="language-plaintext highlighter-rouge">pool.Acquire(ctx)</code> internally does: <ol> <li>Lock puddle’s internal mutex</li> <li>Check the free list (idle connections): <ul> <li>If found: remove from free list, mark as acquired, return it</li> </ul> </li> <li>If free list empty, check total count vs <code class="language-plaintext highlighter-rouge">MaxConns</code>: <ul> <li>If below <code class="language-plaintext highlighter-rouge">MaxConns</code>: unlock, open a new connection (dial + SCRAM auth), lock again, add to acquired set, return</li> <li>If at <code class="language-plaintext highlighter-rouge">MaxConns</code>: add this goroutine to a wait queue (<code class="language-plaintext highlighter-rouge">sync.Cond.Wait()</code>), unlock, sleep</li> <li>When another goroutine calls <code class="language-plaintext highlighter-rouge">Release()</code>: it calls <code class="language-plaintext highlighter-rouge">Cond.Signal()</code>, waking one waiter, which retries the acquire</li> </ul> </li> </ol> After <code class="language-plaintext highlighter-rouge">Acquire()</code> returns, gprxy calls <code class="language-plaintext highlighter-rouge">Ping()</code>. This is an extra safety net on top of the health check background goroutine. Between the health check goroutine’s last check (up to 1 minute ago) and right now, the connection could have gone stale. <code class="language-plaintext highlighter-rouge">Ping()</code> sends a minimal no-op to PostgreSQL and waits for a response. If <code class="language-plaintext highlighter-rouge">Ping()</code> fails, <code class="language-plaintext highlighter-rouge">Release()</code> is called immediately — but gprxy does not put a broken connection back. pgxpool’s <code class="language-plaintext highlighter-rouge">Release()</code> is smart: if the underlying connection returns an error, it destroys the connection instead of returning it to the free list. So after <code class="language-plaintext highlighter-rouge">connection.Release()</code> on a failed ping, the pool size decreases by 1, and the next <code class="language-plaintext highlighter-rouge">Acquire()</code> will open a fresh connection. <h3 id="layer-7-the-logpoolstats-observation-window">Layer 7: The <code class="language-plaintext highlighter-rouge">LogPoolStats</code> Observation Window</h3> After every successful <code class="language-plaintext highlighter-rouge">AcquireConnection</code>, gprxy calls: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pool.LogPoolStats(user, database) </code></pre></div></div> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func LogPoolStats(user, database string) { stats := pool.Stat() logger.Debug("pool stats for [%s,%s] - total: %d, acquired: %d, idle: %d", user, database, stats.TotalConns(), stats.AcquiredConns(), stats.IdleConns()) } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">pool.Stat()</code> returns a snapshot of: <ul> <li><code class="language-plaintext highlighter-rouge">TotalConns()</code> — total live connections (acquired + idle), max is 5</li> <li><code class="language-plaintext highlighter-rouge">AcquiredConns()</code> — currently held by goroutines (including this one just acquired)</li> <li><code class="language-plaintext highlighter-rouge">IdleConns()</code> — back in free list, available immediately</li> </ul> <code class="language-plaintext highlighter-rouge">TotalConns = AcquiredConns + IdleConns</code> always. For example: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pool stats for [alice@example.com,gprxy_test] - total: 3, acquired: 2, idle: 1 </code></pre></div></div> This means 3 real TCP connections exist to PostgreSQL, 2 are in use by active client connections, and 1 is sitting idle waiting to be acquired. <h3 id="layer-8-release-and-reset">Layer 8: Release and Reset</h3> When a client disconnects, the defer block in <code class="language-plaintext highlighter-rouge">handleConnection</code> runs: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if pc.poolConn != nil { err := fullResetBeforeRelease(pc) if err != nil { logger.Error("error while releasing connection back to the pool: %v", err) } pc.poolConn.Release() } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">fullResetBeforeRelease</code> runs two SQL commands through pgx’s normal execution path (not through <code class="language-plaintext highlighter-rouge">pc.bf</code>), because at this point gprxy is done forwarding arbitrary messages: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func fullResetBeforeRelease(connection *Connection) error { _, err := connection.poolConn.Exec(context.Background(), "ROLLBACK") // ... _, err = connection.poolConn.Exec(context.Background(), "DISCARD ALL") // ... return nil } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">ROLLBACK</code> rolls back any open transaction. Without this, if a client died mid-<code class="language-plaintext highlighter-rouge">BEGIN</code>, the pool connection would return to the free list still inside a transaction. Any rows it had locked would remain locked. The next client to get this connection would start their first query inside someone else’s transaction. <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> is a PostgreSQL supercommand that resets everything about the session in one round trip: <ul> <li>All <code class="language-plaintext highlighter-rouge">SET</code> variables back to defaults</li> <li>All named prepared statements deallocated</li> <li>All open cursors closed</li> <li>All <code class="language-plaintext highlighter-rouge">LISTEN</code> subscriptions removed</li> <li>All advisory locks released</li> <li>All cached query plans discarded</li> </ul> After these two commands, the pool connection’s PostgreSQL session is byte-for-byte identical to a fresh connection. pgx’s internal state may still be slightly stale (it didn’t observe gprxy’s arbitrary wire-level queries), but the actual PostgreSQL session is clean. Then <code class="language-plaintext highlighter-rouge">poolConn.Release()</code> is called. Inside pgxpool, this: <ol> <li>Locks puddle’s mutex</li> <li>Moves the resource from the acquired set back to the idle free list</li> <li>Calls <code class="language-plaintext highlighter-rouge">Cond.Signal()</code> to wake any goroutine blocking on <code class="language-plaintext highlighter-rouge">Acquire()</code></li> <li>Unlocks</li> </ol> The connection is now available for the next client that calls <code class="language-plaintext highlighter-rouge">AcquireConnection</code>. <h2 id="data-structure-map">Data Structure Map</h2> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PROCESS GLOBAL STATE ───────────────────────────────────────────────────────────────────── poolManager: map[poolKey]*pgxpool.Pool │ ├── key: {user:"alice@example.com", database:"gprxy_test"} │ └── *pgxpool.Pool │ ├── MaxConns: 5 │ ├── free list (idle): │ │ └── *pgx.Conn [TCP socket to pg:5432, PID=1001, SK=11111] │ ├── acquired set (in use): │ │ ├── *pgx.Conn [TCP socket to pg:5432, PID=1002, SK=22222] ← held by alice session 1 │ │ └── *pgx.Conn [TCP socket to pg:5432, PID=1003, SK=33333] ← held by alice session 2 │ └── background goroutines: │ ├── health checker (every 1 minute) │ └── idle reaper (checks MaxConnIdleTime/MaxConnLifetime) │ ├── key: {user:"bob@example.com", database:"gprxy_test"} │ └── *pgxpool.Pool │ ├── MaxConns: 5 │ ├── free list (idle): [empty] │ └── acquired set (in use): │ └── *pgx.Conn [TCP socket to pg:5432, PID=1004, SK=44444] ← held by bob session 1 │ └── poolMutex: sync.RWMutex (guards the map above) PER-CLIENT-CONNECTION STATE (one per active client goroutine) ───────────────────────────────────────────────────────────────────── *Connection (alice session 1) ├── conn: net.Conn → TCP socket to alice's psql process ├── poolConn: *pgxpool.Conn → exclusively holds PID=1002 conn above ├── bf: *pgproto3.Frontend → wired to PID=1002's raw TCP socket ├── user: "alice@example.com" ├── db: "gprxy_test" └── key: BackendKeyData{ProcessID:1002, SecretKey:22222} ↑ also registered in server.activeConnections </code></pre></div></div> <h2 id="key-design-properties-and-tradeoffs">Key Design Properties and Tradeoffs</h2> 1. Per-(client-user, database) pools, not a single global pool Each unique <code class="language-plaintext highlighter-rouge">(user, database)</code> pair gets its own pool. This means Alice and Bob each have their own separate bucket of connections. This has pros and cons: <ul> <li>Pro: isolation — Alice exhausting her 5 connections doesn’t block Bob</li> <li>Con: the worst case total connections to PostgreSQL = <code class="language-plaintext highlighter-rouge">(number of distinct users) × (number of distinct databases) × 5</code>. With 100 users each connecting to 2 databases that’s 1000 PostgreSQL backend processes — potentially catastrophic at scale.</li> </ul> A shared global pool per database would be more scalable, but then all users compete for the same connections. 2. The pool key uses client identity, not service account All connections in a pool connect to PostgreSQL as <code class="language-plaintext highlighter-rouge">gprxy_admin</code> (from <code class="language-plaintext highlighter-rouge">BuildConnectionString</code>), but the pool is keyed on the client user (<code class="language-plaintext highlighter-rouge">alice@example.com</code>). This means two different clients who both map to <code class="language-plaintext highlighter-rouge">gprxy_admin</code> still have completely separate pools. This provides isolation but wastes connections — both pools independently open connections as the same PostgreSQL user. 3. <code class="language-plaintext highlighter-rouge">MinConns=0</code> means cold start latency The first connection for any <code class="language-plaintext highlighter-rouge">(user, database)</code> pair always pays the full TCP dial + SCRAM handshake cost on the hot path (while the client is waiting). Subsequent connections are fast (free list lookup). If MinConns were 1, the pool would pre-open a connection at creation time, eliminating this latency at the cost of always holding an open connection even for inactive users. 4. The pool is never closed There is no code path in gprxy that calls <code class="language-plaintext highlighter-rouge">pool.Close()</code>. Once a pool exists in <code class="language-plaintext highlighter-rouge">poolManager</code>, it lives forever until the process exits. The idle reaper and MaxConnIdleTime handle draining unused connections from within each pool, but the pool object and its entry in <code class="language-plaintext highlighter-rouge">poolManager</code> persist. This is a minor memory leak for transient users — if 10,000 different users each connect once, <code class="language-plaintext highlighter-rouge">poolManager</code> will have 10,000 entries (each being a small pool object with zero connections) forever. Now that we have a detailed understanding of connection pooling, let’s get back to the authenticated user who now makes a request for a connection from the pool: Case A — Pool for this <code class="language-plaintext highlighter-rouge">(user, database)</code> key already exists: Read lock acquired, pool found, read lock released. No write lock, no allocation, extremely fast. The existing <code class="language-plaintext highlighter-rouge">*pgxpool.Pool</code> object is returned. Case B — Pool does not exist yet (first connection for this user+database combination): Write lock acquired. The pool configuration struct is built and <code class="language-plaintext highlighter-rouge">pgxpool.NewWithConfig</code> is called. With <code class="language-plaintext highlighter-rouge">MinConns=0</code>, the pool does not open any connections to PostgreSQL at creation time. It creates the management infrastructure (the pool object, its background goroutines for health checking, etc.) but no actual TCP connections to PostgreSQL yet. The pool is stored in the global <code class="language-plaintext highlighter-rouge">poolManager</code> map. Then comes the part of making an actual connection to the PostgreSQL database. The pool checks its internal free list: <ul> <li>If an idle connection exists: it returns it immediately and marks it as in-use.</li> <li>If no idle connection exists but the pool is below <code class="language-plaintext highlighter-rouge">MaxConns</code> (<code class="language-plaintext highlighter-rouge">5</code>): it opens a new TCP connection to PostgreSQL, performs the full PostgreSQL startup handshake (this includes SCRAM-SHA-256 authentication using the DSN credentials), and returns the resulting connection.</li> <li>If the pool is at <code class="language-plaintext highlighter-rouge">MaxConns</code> with none idle: it blocks until one is released by another goroutine.</li> </ul> When a new connection is opened, the full PostgreSQL wire protocol handshake happens inside pgx transparently — StartupMessage, SCRAM exchange, AuthenticationOk, ParameterStatus, BackendKeyData, ReadyForQuery. pgx handles all of this internally and stores the resulting PID and SecretKey on the connection object. This is where the pool connection’s BackendKeyData originates. After acquire, a ping is sent — a cheap <code class="language-plaintext highlighter-rouge">SELECT 1</code> equivalent at the protocol level. If the connection was idle and the PostgreSQL server closed it server-side (e.g. <code class="language-plaintext highlighter-rouge">idle_in_transaction_session_timeout</code>), the ping will fail, the connection is discarded, and <code class="language-plaintext highlighter-rouge">AcquireConnection</code> returns an error. This prevents handing the client a dead connection. After this call returns, <code class="language-plaintext highlighter-rouge">pc.poolConn</code> is a live, valid, exclusively-held <code class="language-plaintext highlighter-rouge">*pgxpool.Conn</code>. What the client is doing during all of this: still blocked. It sent <code class="language-plaintext highlighter-rouge">StartupMessage</code>, received <code class="language-plaintext highlighter-rouge">AuthenticationOk</code> + <code class="language-plaintext highlighter-rouge">ParameterStatus</code> messages, and is waiting for <code class="language-plaintext highlighter-rouge">BackendKeyData</code> + <code class="language-plaintext highlighter-rouge">ReadyForQuery</code>. It has no idea any of this infrastructure work is happening. <code class="language-plaintext highlighter-rouge">pgproto3.NewChunkReader(underlyingConn)</code> wraps the TCP socket in a buffered reader that knows how to read PostgreSQL wire protocol message boundaries. <code class="language-plaintext highlighter-rouge">pgproto3.NewFrontend(reader, underlyingConn)</code> creates a Frontend — an object that speaks PostgreSQL from the client side (i.e. it sends queries and reads responses, opposite of <code class="language-plaintext highlighter-rouge">Backend</code> which speaks from the server side). This <code class="language-plaintext highlighter-rouge">*pgproto3.Frontend</code> is what <code class="language-plaintext highlighter-rouge">pc.bf</code> is. Going forward, every call to <code class="language-plaintext highlighter-rouge">pc.bf.Send(msg)</code> writes PostgreSQL wire bytes directly onto the pool connection’s TCP socket to PostgreSQL. Every call to <code class="language-plaintext highlighter-rouge">pc.bf.Receive()</code> reads PostgreSQL response bytes off that same socket. <code class="language-plaintext highlighter-rouge">pc.user</code> and <code class="language-plaintext highlighter-rouge">pc.db</code> are stored for logging purposes throughout the query loop. <code class="language-plaintext highlighter-rouge">pgconn.PID()</code> and <code class="language-plaintext highlighter-rouge">pgconn.SecretKey()</code> return values that pgx stored internally when it completed the pool connection’s startup handshake. These are the PID and SecretKey of the live PostgreSQL backend process that is holding this pool connection on the other end. <code class="language-plaintext highlighter-rouge">pc.key</code> is now overwritten — the dead temp connection’s key is replaced with the live pool connection’s key. This is the correct key that must be given to the client and registered in the cancel registry. <h2 id="send-backendkeydata-to-the-client">Send <code class="language-plaintext highlighter-rouge">BackendKeyData</code> to the client</h2> <code class="language-plaintext highlighter-rouge">pgconn</code> here is the <code class="language-plaintext highlighter-rouge">*pgproto3.Backend</code> connected to the client — the server-side view of the client connection. <code class="language-plaintext highlighter-rouge">pgconn.Send(pc.key)</code> serializes the <code class="language-plaintext highlighter-rouge">BackendKeyData</code> struct into the PostgreSQL wire format and writes it to the client’s TCP socket: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>K (1 byte - message type 'K') 00 00 00 0C (4 bytes - length = 12) XX XX XX XX (4 bytes - ProcessID) YY YY YY YY (4 bytes - SecretKey) </code></pre></div></div> The client receives this, parses it, and stores <code class="language-plaintext highlighter-rouge">(ProcessID, SecretKey)</code> internally. Its driver will use this if the user or a timeout triggers a query cancellation. <h2 id="send-readyforquery-to-the-client">Send <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> to the client</h2> <code class="language-plaintext highlighter-rouge">TxStatus: 'I'</code> means “idle, not in a transaction”. This is always correct here because the pool connection was just acquired and either just opened or just had <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> run on it — it is guaranteed to be in idle state. Wire format: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Z (1 byte - message type 'Z') 00 00 00 05 (4 bytes - length = 5) 49 (1 byte - 'I' = idle, or 'T' = in transaction, 'E' = error state) </code></pre></div></div> This message unblocks the client. The client’s driver, which has been sitting in <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> wait since it sent the <code class="language-plaintext highlighter-rouge">StartupMessage</code>, now receives this and considers the connection fully established. It returns the connection object to the application. The application can now call <code class="language-plaintext highlighter-rouge">conn.Query(...)</code> or <code class="language-plaintext highlighter-rouge">conn.Exec(...)</code>. This <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> is synthesized by gprxy itself — it is not forwarded from PostgreSQL. gprxy is lying to the client in the best possible way: the client believes it just completed startup with a PostgreSQL server, but the <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> came from the proxy, which only sent it after confirming the pool connection is live and <code class="language-plaintext highlighter-rouge">pc.bf</code> is wired up and ready. <h2 id="register-in-the-cancel-registry">Register in the cancel registry</h2> <code class="language-plaintext highlighter-rouge">registerConnection</code> stores the <code class="language-plaintext highlighter-rouge">*Connection</code> pointer in <code class="language-plaintext highlighter-rouge">server.activeConnections</code> keyed by the bit-packed <code class="language-plaintext highlighter-rouge">uint64</code> of <code class="language-plaintext highlighter-rouge">(ProcessID << 32 | SecretKey)</code>. This happens after <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> is sent because the cancel registry only needs to be populated before a query actually runs — and no query can run until the client receives <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> and sends the next message. There is no race condition here because both happen in the same goroutine sequentially. <h2 id="what-happens-between-authentication-and-readyforquery">What happens between authentication and ReadyForQuery</h2> Every piece must happen in this exact order. If the pool connection fails (PostgreSQL down, max connections reached, ping fails), gprxy sends <code class="language-plaintext highlighter-rouge">ErrorResponse</code> with SQLSTATE <code class="language-plaintext highlighter-rouge">08006</code> to the client and the connection is torn down cleanly. The client never enters a half-connected state. Now the proxy is at the query loop entry point, where the user runs queries and the proxy bridges the gap between the user and the running PostgreSQL instance. Here is the complete deep dive into the entire query loop and cleanup. <h2 id="the-query-loop-entry-point">The Query Loop Entry Point</h2> After <code class="language-plaintext highlighter-rouge">handleStartupMessage</code> returns, control lands here: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>logger.Debug("entering query handling loop") for { err := pc.handleMessage(pgc) if err != nil { logger.Debug("query handling terminated: %v", err) return } } </code></pre></div></div> This is an unconditional <code class="language-plaintext highlighter-rouge">for {}</code> — it runs forever until <code class="language-plaintext highlighter-rouge">handleMessage</code> returns a non-nil error. There is no break condition, no timeout, no idle check. The goroutine lives as long as the client is connected. <code class="language-plaintext highlighter-rouge">pgc</code> is the <code class="language-plaintext highlighter-rouge">*pgproto3.Backend</code> connected to the client socket. It is the same one used during startup. It is passed into every <code class="language-plaintext highlighter-rouge">handleMessage</code> call. <h2 id="anatomy-of-one-cycle">Anatomy of One Cycle</h2> <h3 id="part-a-reading-the-client-message">Part A: Reading the client message</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func (pc *Connection) handleMessage(client *pgproto3.Backend) error { msg, err := client.Receive() if err != nil { return logger.Errorf("client receive error: %w", err) } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">client.Receive()</code> calls into the TCP socket read buffer. The goroutine is parked by the OS here — it is not spinning, not consuming CPU. It wakes only when bytes arrive on the socket. <code class="language-plaintext highlighter-rouge">pgproto3.Backend.Receive()</code> reads the first byte (the message type identifier), then reads the 4-byte length field, then reads exactly <code class="language-plaintext highlighter-rouge">length - 4</code> more bytes to get the full payload, then deserializes everything into a typed Go struct and returns it. If the client closes the TCP socket (<code class="language-plaintext highlighter-rouge">Ctrl+C</code>, process killed, network drop), the OS delivers an EOF to the read call. <code class="language-plaintext highlighter-rouge">client.Receive()</code> returns an error wrapping <code class="language-plaintext highlighter-rouge">io.EOF</code>, <code class="language-plaintext highlighter-rouge">handleMessage</code> returns that error, the outer loop sees non-nil and calls <code class="language-plaintext highlighter-rouge">return</code>, which exits <code class="language-plaintext highlighter-rouge">handleConnection</code> and triggers the <code class="language-plaintext highlighter-rouge">defer</code>. <h3 id="part-b-classify-and-log">Part B: Classify and log</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>key := pc.poolConn.Conn().PgConn().SecretKey() pid := pc.poolConn.Conn().PgConn().PID() switch query := msg.(type) { case *pgproto3.Query: logger.Info("[%s] query: %s", pc.user, query.String) logger.Debug("query connection PID=%d, secret_key=%d", pid, key) case *pgproto3.Parse: logger.Debug("[%s] parse: statement='%s' query='%s'", pc.user, query.Name, query.Query) case *pgproto3.Describe: objectType := "statement" if query.ObjectType == 'P' { objectType = "portal" } logger.Debug("[%s] describe: %s='%s'", pc.user, objectType, query.Name) case *pgproto3.Bind: paramCount := len(query.Parameters) logger.Debug("[%s] bind: portal='%s' statement='%s' params=%d", pc.user, query.DestinationPortal, query.PreparedStatement, paramCount) case *pgproto3.Execute: maxRows := "unlimited" if query.MaxRows > 0 { maxRows = fmt.Sprintf("%d", query.MaxRows) } logger.Debug("[%s] execute: portal='%s' max_rows=%s", pc.user, query.Portal, maxRows) case *pgproto3.Sync: logger.Debug("[%s] sync: transaction boundary", pc.user) case *pgproto3.Terminate: logger.Info("[%s] client disconnecting gracefully", pc.user) return logger.Errorf("client terminated") default: logger.Debug("[%s] unknown message type: %T", pc.user, query) } </code></pre></div></div> The switch is logging only — it does not change the message or alter routing. Every message type has its own log line. The <code class="language-plaintext highlighter-rouge">Terminate</code> case is the only one that exits early before forwarding. It returns an error immediately — the loop will exit. Notice it returns before the <code class="language-plaintext highlighter-rouge">pc.bf.Send(msg)</code> call below. The <code class="language-plaintext highlighter-rouge">Terminate</code> message is never forwarded to PostgreSQL. PostgreSQL doesn’t need to be told — the pool connection is not being closed, it is going back to the pool. PostgreSQL will only learn the session ended when gprxy runs <code class="language-plaintext highlighter-rouge">ROLLBACK</code> + <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> later. The two lines at the top of the switch: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>key := pc.poolConn.Conn().PgConn().SecretKey() pid := pc.poolConn.Conn().PgConn().PID() </code></pre></div></div> These are read on every single message, but they are only used in the <code class="language-plaintext highlighter-rouge">Query</code> log line. This is slightly wasteful — two pointer dereferences on every message regardless of type. <h3 id="part-c-forward-the-message-to-postgresql">Part C: Forward the message to PostgreSQL</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>err = pc.bf.Send(msg) if err != nil { return logger.Errorf("unable to send query to backend: %w", err) } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">pc.bf</code> is the <code class="language-plaintext highlighter-rouge">*pgproto3.Frontend</code> wired to the pool connection’s raw TCP socket. <code class="language-plaintext highlighter-rouge">Send(msg)</code> takes the Go struct, serializes it back into PostgreSQL wire protocol bytes, and writes them to that socket. This is a complete passthrough — gprxy does no SQL parsing, no query analysis, no modification of any kind. The bytes that PostgreSQL receives are byte-for-byte identical to what the client sent (re-serialized through pgproto3, but semantically identical). The two PostgreSQL query protocols work differently here: Simple Query — one message, one forward: <ul> <li>Client sends one <code class="language-plaintext highlighter-rouge">Query</code> message with raw SQL text</li> <li><code class="language-plaintext highlighter-rouge">handleMessage</code> is called once</li> <li>One <code class="language-plaintext highlighter-rouge">pc.bf.Send(query)</code> forwards it</li> <li><code class="language-plaintext highlighter-rouge">relayBackendResponse</code> collects all responses until <code class="language-plaintext highlighter-rouge">ReadyForQuery</code></li> </ul> Extended Query — multiple messages, multiple forwards: <ul> <li>Client sends <code class="language-plaintext highlighter-rouge">Parse</code>, <code class="language-plaintext highlighter-rouge">Bind</code>, <code class="language-plaintext highlighter-rouge">Describe</code>, <code class="language-plaintext highlighter-rouge">Execute</code>, <code class="language-plaintext highlighter-rouge">Sync</code> — each as a separate message</li> <li><code class="language-plaintext highlighter-rouge">handleMessage</code> is called once per message — 5 separate calls for 5 messages</li> <li>Each call forwards its one message and then calls <code class="language-plaintext highlighter-rouge">relayBackendResponse</code></li> <li>Only <code class="language-plaintext highlighter-rouge">Sync</code> produces a <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> from PostgreSQL — the other messages get <code class="language-plaintext highlighter-rouge">ParseComplete</code>, <code class="language-plaintext highlighter-rouge">BindComplete</code>, etc.</li> </ul> This means for an extended query cycle, <code class="language-plaintext highlighter-rouge">relayBackendResponse</code> is called 5 times but returns <code class="language-plaintext highlighter-rouge">nil</code> (continues to outer loop) after each intermediate response, and finally returns <code class="language-plaintext highlighter-rouge">nil</code> after the <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> that follows <code class="language-plaintext highlighter-rouge">Sync</code>. <h3 id="part-d-check-for-terminate-again">Part D: Check for Terminate again</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if _, ok := msg.(*pgproto3.Terminate); ok { return logger.Errorf("connection terminated") } </code></pre></div></div> This is actually dead code in practice — the <code class="language-plaintext highlighter-rouge">Terminate</code> case in the switch already returned before reaching here. This is a redundant safety net. If somehow execution reaches here with a <code class="language-plaintext highlighter-rouge">Terminate</code> message, it exits before calling <code class="language-plaintext highlighter-rouge">relayBackendResponse</code> (which would block forever waiting for a backend response that will never come, since <code class="language-plaintext highlighter-rouge">Terminate</code> doesn’t produce one). <h3 id="part-e-relay-all-backend-responses">Part E: Relay all backend responses</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>return pc.relayBackendResponse(client) </code></pre></div></div> <h2 id="relaybackendresponse-the-response-pump"><code class="language-plaintext highlighter-rouge">relayBackendResponse</code>: The Response Pump</h2> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func (pc *Connection) relayBackendResponse(client *pgproto3.Backend) error { for { msg, err := pc.bf.Receive() if err != nil { return logger.Errorf("backend receive error: %w", err) } err = client.Send(msg) if err != nil { return logger.Errorf("client send error: %w", err) } switch msgType := msg.(type) { case *pgproto3.ReadyForQuery: logger.Debug("query completed, ready for next query (status: %c)", msgType.TxStatus) return nil case *pgproto3.ErrorResponse: logger.Warn("query error: %s (code: %s)", msgType.Message, msgType.Code) case *pgproto3.CommandComplete: logger.Debug("command completed: %s", msgType.CommandTag) } } } </code></pre></div></div> This loop does two things and only two things: read from backend, write to client. Every message is forwarded unconditionally before the switch even runs. The switch is for logging and for detecting the exit condition. <code class="language-plaintext highlighter-rouge">pc.bf.Receive()</code> reads from the pool connection’s raw TCP socket — this is the raw PostgreSQL wire protocol coming from the database. Like <code class="language-plaintext highlighter-rouge">client.Receive()</code>, this parks the goroutine until bytes arrive. <code class="language-plaintext highlighter-rouge">client.Send(msg)</code> serializes the message and writes it to the client socket. <h3 id="the-readyforquery-exit-condition">The <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> exit condition</h3> <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> is the only message that ends the relay loop. It returns <code class="language-plaintext highlighter-rouge">nil</code>, which propagates back to <code class="language-plaintext highlighter-rouge">handleMessage</code> returning <code class="language-plaintext highlighter-rouge">nil</code>, which causes the outer <code class="language-plaintext highlighter-rouge">for</code> loop to call <code class="language-plaintext highlighter-rouge">handleMessage</code> again. <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> carries a <code class="language-plaintext highlighter-rouge">TxStatus</code> byte: <ul> <li><code class="language-plaintext highlighter-rouge">'I'</code> = idle (not in a transaction)</li> <li><code class="language-plaintext highlighter-rouge">'T'</code> = in an open transaction (inside a <code class="language-plaintext highlighter-rouge">BEGIN</code>…<code class="language-plaintext highlighter-rouge">COMMIT</code> block)</li> <li><code class="language-plaintext highlighter-rouge">'E'</code> = in a failed transaction (error occurred, needs <code class="language-plaintext highlighter-rouge">ROLLBACK</code>)</li> </ul> gprxy forwards this status byte unchanged to the client. Client drivers use it to track transaction state. <h3 id="what-errorresponse-does-and-does-not-do">What <code class="language-plaintext highlighter-rouge">ErrorResponse</code> does (and does not do)</h3> Notice <code class="language-plaintext highlighter-rouge">ErrorResponse</code> is not a return condition. It is just logged as a warning. The loop continues reading until <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> arrives. This is correct — PostgreSQL always sends <code class="language-plaintext highlighter-rouge">ReadyForQuery</code> after an error, even if it looks like: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ErrorResponse("relation 'foo' does not exist") ReadyForQuery(TxStatus='I') </code></pre></div></div> Both are forwarded. Both are received by the client. The error is delivered to the application through the driver’s normal error handling. The connection stays alive. <h3 id="what-commandcomplete-does">What <code class="language-plaintext highlighter-rouge">CommandComplete</code> does</h3> Also just logged. The tag string tells what happened: <code class="language-plaintext highlighter-rouge">"SELECT 5"</code>, <code class="language-plaintext highlighter-rouge">"INSERT 0 1"</code>, <code class="language-plaintext highlighter-rouge">"UPDATE 3"</code>, <code class="language-plaintext highlighter-rouge">"DELETE 0"</code>, <code class="language-plaintext highlighter-rouge">"BEGIN"</code>, <code class="language-plaintext highlighter-rouge">"COMMIT"</code>, <code class="language-plaintext highlighter-rouge">"ROLLBACK"</code>. Forwarded to client, loop continues. <h3 id="full-response-stream-examples">Full response stream examples</h3> <code class="language-plaintext highlighter-rouge">SELECT * FROM users</code>: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>T RowDescription [id:int4, name:text, email:text] → forwarded D DataRow [1, "Alice", "alice@example.com"] → forwarded D DataRow [2, "Bob", "bob@example.com"] → forwarded D DataRow [3, "Carol", "carol@example.com"] → forwarded C CommandComplete "SELECT 3" → forwarded + logged Z ReadyForQuery TxStatus='I' → forwarded + loop exits </code></pre></div></div> <code class="language-plaintext highlighter-rouge">INSERT INTO users VALUES (...)</code>: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C CommandComplete "INSERT 0 1" → forwarded + logged Z ReadyForQuery TxStatus='I' → forwarded + loop exits </code></pre></div></div> <code class="language-plaintext highlighter-rouge">SELECT * FROM nonexistent</code>: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>E ErrorResponse code="42P01", message="relation 'nonexistent' does not exist" → forwarded + logged as warn Z ReadyForQuery TxStatus='I' → forwarded + loop exits </code></pre></div></div> <code class="language-plaintext highlighter-rouge">BEGIN</code> followed by <code class="language-plaintext highlighter-rouge">INSERT</code>: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[handleMessage called for "BEGIN"] C CommandComplete "BEGIN" → forwarded Z ReadyForQuery TxStatus='T' → forwarded + loop exits [handleMessage called for "INSERT INTO..."] C CommandComplete "INSERT 0 1" → forwarded Z ReadyForQuery TxStatus='T' → forwarded + loop exits (still in transaction) [handleMessage called for "COMMIT"] C CommandComplete "COMMIT" → forwarded Z ReadyForQuery TxStatus='I' → forwarded + loop exits (back to idle) </code></pre></div></div> <h2 id="how-the-loop-ends-all-exit-paths">How the Loop Ends: All Exit Paths</h2> <code class="language-plaintext highlighter-rouge">handleMessage</code> returns a non-nil error in these cases: <table> <thead> <tr> <th>Cause</th> <th>Where</th> <th>Error</th> </tr> </thead> <tbody> <tr> <td>Client sends <code class="language-plaintext highlighter-rouge">Terminate</code></td> <td>switch case, line 51</td> <td><code class="language-plaintext highlighter-rouge">"client terminated"</code></td> </tr> <tr> <td>Client TCP socket closed (EOF, crash, network drop)</td> <td><code class="language-plaintext highlighter-rouge">client.Receive()</code>, line 15</td> <td>wraps <code class="language-plaintext highlighter-rouge">io.EOF</code></td> </tr> <tr> <td>Failed to write to client</td> <td><code class="language-plaintext highlighter-rouge">client.Send()</code> in relay, line 78</td> <td><code class="language-plaintext highlighter-rouge">"client send error"</code></td> </tr> <tr> <td>Failed to read from backend</td> <td><code class="language-plaintext highlighter-rouge">pc.bf.Receive()</code> in relay, line 74</td> <td><code class="language-plaintext highlighter-rouge">"backend receive error"</code></td> </tr> <tr> <td>Failed to forward to backend</td> <td><code class="language-plaintext highlighter-rouge">pc.bf.Send()</code>, line 59</td> <td><code class="language-plaintext highlighter-rouge">"unable to send query to backend"</code></td> </tr> </tbody> </table> All of them propagate to the outer loop: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for { err := pc.handleMessage(pgc) if err != nil { logger.Debug("query handling terminated: %v", err) return // ← exits handleConnection } } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">return</code> from <code class="language-plaintext highlighter-rouge">handleConnection</code> triggers the <code class="language-plaintext highlighter-rouge">defer</code>. <h2 id="the-defer-cleanup">The <code class="language-plaintext highlighter-rouge">defer</code> Cleanup</h2> The defer was registered at the very start of <code class="language-plaintext highlighter-rouge">handleConnection</code> before any work began: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>defer func() { if err := pc.conn.Close(); err != nil { logger.Error("error closing client connection: %v", err) } if pc.poolConn != nil { err := fullResetBeforeRelease(pc) if err != nil { logger.Error("error while releasing connection back to the pool: %v", err) } pc.poolConn.Release() logger.Debug("released connection back to pool") } if pc.key != nil && pc.server != nil { pc.server.unregisterConnection(pc.key.ProcessID, pc.key.SecretKey, pc) } logger.Info("connection closed") }() </code></pre></div></div> Go’s <code class="language-plaintext highlighter-rouge">defer</code> runs even if the function panics. The three steps always execute in order. <h3 id="cleanup-step-1-close-the-client-tcp-socket">Cleanup Step 1: Close the client TCP socket</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if err := pc.conn.Close(); err != nil { logger.Error("error closing client connection: %v", err) } </code></pre></div></div> <code class="language-plaintext highlighter-rouge">pc.conn</code> is the <code class="language-plaintext highlighter-rouge">net.Conn</code> to the client. <code class="language-plaintext highlighter-rouge">Close()</code> sends a TCP FIN to the client and releases the OS file descriptor. If the client already closed the connection (which is why the loop exited), <code class="language-plaintext highlighter-rouge">Close()</code> still runs and may return an error like <code class="language-plaintext highlighter-rouge">use of closed network connection</code> — that error is logged but does not stop cleanup. <h3 id="cleanup-step-2a-fullresetbeforerelease">Cleanup Step 2a: <code class="language-plaintext highlighter-rouge">fullResetBeforeRelease</code></h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func fullResetBeforeRelease(connection *Connection) error { _, err := connection.poolConn.Exec(context.Background(), "ROLLBACK") if err != nil { logger.Debug("unable to rollback: %v", err) return err } _, err = connection.poolConn.Exec(context.Background(), "DISCARD ALL") if err != nil { logger.Debug("unable to execute discard all: %v", err) return err } return nil } </code></pre></div></div> These run through pgx’s normal <code class="language-plaintext highlighter-rouge">Exec</code> path — not through <code class="language-plaintext highlighter-rouge">pc.bf</code>. pgx handles the wire protocol for these two commands internally. <code class="language-plaintext highlighter-rouge">ROLLBACK</code>: If the client disconnected mid-transaction (crashed, network drop, or explicitly left a <code class="language-plaintext highlighter-rouge">BEGIN</code> open), the PostgreSQL session is still inside that transaction. Any rows it locked are still locked. Any changes are still pending. Without <code class="language-plaintext highlighter-rouge">ROLLBACK</code>, those locks would remain held until PostgreSQL’s <code class="language-plaintext highlighter-rouge">idle_in_transaction_session_timeout</code> fired (if configured) — potentially blocking other connections for minutes. <code class="language-plaintext highlighter-rouge">ROLLBACK</code> explicitly ends the transaction. If there is no open transaction, <code class="language-plaintext highlighter-rouge">ROLLBACK</code> still succeeds — it just does nothing. So it is safe to always run. <code class="language-plaintext highlighter-rouge">DISCARD ALL</code>: This is a PostgreSQL supercommand that resets all session-level state in a single round trip. It is equivalent to running all of these simultaneously: <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SET SESSION AUTHORIZATION DEFAULT; -- reset any SET ROLE/SET SESSION AUTHORIZATION RESET ALL; -- all GUC parameters to defaults (timezone, search_path, etc.) DEALLOCATE ALL; -- all named prepared statements CLOSE ALL; -- all open cursors UNLISTEN *; -- all LISTEN subscriptions SELECT pg_advisory_unlock_all(); -- all advisory locks held by this session DISCARD PLANS; -- all cached query plans DISCARD SEQUENCES; -- cached nextval state for sequences </code></pre></div></div> After <code class="language-plaintext highlighter-rouge">DISCARD ALL</code>, the PostgreSQL session is in a state identical to a brand new connection. The next client to acquire this pool connection gets a completely clean session — no leaked prepared statements, no inherited timezone settings, no open cursors, no stale plans. Why both are needed even though <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> includes rollback behavior: <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> itself will fail if called inside an active transaction — PostgreSQL returns <code class="language-plaintext highlighter-rouge">ERROR: DISCARD ALL cannot run inside a transaction block</code>. So <code class="language-plaintext highlighter-rouge">ROLLBACK</code> must run first to ensure no active transaction exists, then <code class="language-plaintext highlighter-rouge">DISCARD ALL</code> can safely run. <h3 id="cleanup-step-2b-poolconnrelease">Cleanup Step 2b: <code class="language-plaintext highlighter-rouge">poolConn.Release()</code></h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pc.poolConn.Release() </code></pre></div></div> This returns the connection to the pgxpool free list. Internally pgxpool: <ol> <li>Locks puddle’s internal mutex</li> <li>Moves the connection resource from the “acquired” set back to the “idle” free list</li> <li>Calls <code class="language-plaintext highlighter-rouge">sync.Cond.Signal()</code> to wake any goroutine that is blocked waiting on <code class="language-plaintext highlighter-rouge">pool.Acquire()</code></li> <li>Unlocks</li> </ol> The connection is now available for the next client’s <code class="language-plaintext highlighter-rouge">AcquireConnection</code> call. No new TCP connection to PostgreSQL needs to be opened — the existing socket is reused. <h3 id="cleanup-step-3-unregisterconnection">Cleanup Step 3: <code class="language-plaintext highlighter-rouge">unregisterConnection</code></h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if pc.key != nil && pc.server != nil { pc.server.unregisterConnection(pc.key.ProcessID, pc.key.SecretKey, pc) } </code></pre></div></div> Inside <code class="language-plaintext highlighter-rouge">unregisterConnection</code>: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func (s *Server) unregisterConnection(processId, secretkey uint32, conn *Connection) { s.connMutex.Lock() defer s.connMutex.Unlock() key := s.makeCancelKey(processId, secretkey) delete(s.activeConnections, key) } </code></pre></div></div> Acquires the write lock on <code class="language-plaintext highlighter-rouge">server.activeConnections</code>, computes the <code class="language-plaintext highlighter-rouge">uint64</code> key <code class="language-plaintext highlighter-rouge">(ProcessID << 32 | SecretKey)</code>, deletes that entry from the map, releases the lock. After this, if any stale cancel request arrives with this connection’s <code class="language-plaintext highlighter-rouge">(PID, SecretKey)</code>, <code class="language-plaintext highlighter-rouge">getConnectionForCancelRequest</code> returns <code class="language-plaintext highlighter-rouge">exists=false</code> and the cancel is safely ignored. The <code class="language-plaintext highlighter-rouge">nil</code> checks (<code class="language-plaintext highlighter-rouge">pc.key != nil && pc.server != nil</code>) protect against the case where the goroutine exits during or before startup — if authentication failed, <code class="language-plaintext highlighter-rouge">pc.key</code> was never set, so there is nothing to unregister. <h2 id="what-happens-to-the-goroutine">What Happens to the Goroutine</h2> After the defer completes: <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code>logger.Info("connection closed") // defer ends, function returns // goroutine exits </code></pre></div></div> The goroutine is returned to Go’s goroutine scheduler. Its stack memory is reclaimed. The <code class="language-plaintext highlighter-rouge">*Connection</code> struct it was holding becomes unreachable (assuming no other goroutine holds a reference) and will be garbage collected. Back in <code class="language-plaintext highlighter-rouge">server.Start()</code>, this goroutine’s <code class="language-plaintext highlighter-rouge">wg.Done()</code> is called (via the outer <code class="language-plaintext highlighter-rouge">defer wg.Done()</code> in the wrapper goroutine), decrementing the <code class="language-plaintext highlighter-rouge">WaitGroup</code> counter. This matters for graceful shutdown — <code class="language-plaintext highlighter-rouge">wg.Wait()</code> will unblock only when all active connection goroutines have fully exited and cleaned up. <h2 id="full-picture-of-one-complete-connection-lifetime">Full Picture of One Complete Connection Lifetime</h2> <div class="excalidraw-embed"> <button class="excalidraw-fullscreen-btn" onclick="this.parentElement.classList.toggle('is-fullscreen')"> View fullscreen Exit fullscreen </button> </div> This basically covers the full working of the <code class="language-plaintext highlighter-rouge">gprxy</code> proxy that I built over the last month or so. There are a couple of other things that haven’t been covered in this very very long blog post. <h2 id="a-couple-of-other-things-i-didnt-cover">A couple of other things I didn’t cover</h2> <ol> <li>The CLI layer - gprxy is a full CLI tool: The proxy is not just a server binary. It is a Cobra CLI application with three commands. Everything discussed so far was the <code class="language-plaintext highlighter-rouge">start</code> command. There are two more pieces there: the entry point and the server. Feel free to check it out.</li> <li>User login - the PKCE OAuth flow: This is the human user’s entry point. It performs a full PKCE (Proof Key for Code Exchange) OAuth 2.0 flow entirely from the terminal.</li> </ol> The full flow looks something like this: <ol> <li>Generate <code class="language-plaintext highlighter-rouge">code_verifier</code> (32 random bytes, base64url-encoded)</li> <li>Generate <code class="language-plaintext highlighter-rouge">code_challenge = base64url(SHA256(code_verifier))</code></li> <li>Generate <code class="language-plaintext highlighter-rouge">state</code> (24 random bytes) — CSRF protection</li> <li>Build the authorization URL with all parameters</li> <li>Open the browser to that URL</li> <li>Start a local HTTP server on <code class="language-plaintext highlighter-rouge">:8085</code> for the callback</li> <li>User logs in via Auth0 SSO in the browser</li> <li>Auth0 redirects to <code class="language-plaintext highlighter-rouge">http://localhost:8085/callback?code=...&state=...</code></li> <li>Callback handler verifies <code class="language-plaintext highlighter-rouge">state</code> matches (CSRF check)</li> <li>Exchange code + <code class="language-plaintext highlighter-rouge">code_verifier</code> for tokens via <code class="language-plaintext highlighter-rouge">POST /oauth/token</code></li> <li>Parse ID token for name/email</li> <li>Parse access token for roles</li> <li>Save all tokens to <code class="language-plaintext highlighter-rouge">~/.gprxy/credentials</code> (mode <code class="language-plaintext highlighter-rouge">0600</code>)</li> </ol> <h2 id="what-it-still-needs">What it still needs</h2> The biggest missing piece is identity preservation — once authenticated, every query runs as <code class="language-plaintext highlighter-rouge">gprxy_admin</code> regardless of which human issued it. PostgreSQL row-level security, audit logs, and <code class="language-plaintext highlighter-rouge">current_user</code> all see the service account, not the person. The commented-out <code class="language-plaintext highlighter-rouge">SET ROLE</code> block was the right instinct but needs a proper implementation with <code class="language-plaintext highlighter-rouge">SET SESSION AUTHORIZATION</code>. Without this, the promise of per-user access control is incomplete. Beyond that, the <code class="language-plaintext highlighter-rouge">connect</code> command is a skeleton — it connects but cannot run queries. The pool is per-client-user rather than global, which limits scalability. TLS certificate verification is disabled on the client side. There are no metrics, no health endpoint, and no query timeout. The credentials file stores tokens in plaintext with only filesystem permissions as protection. All these are things I’m no longer interested in writing for, since I’ve moved on to more interesting projects for myself. <h2 id="final-thoughts">Final thoughts</h2> <code class="language-plaintext highlighter-rouge">gprxy</code> is a PostgreSQL wire protocol proxy that replaces database password authentication with OAuth/OIDC identity. The fundamental problem it solves is this: PostgreSQL was designed for users who have database-level accounts and passwords. Modern engineering teams use SSO, JWT tokens, and identity providers. These two worlds do not speak to each other natively. <code class="language-plaintext highlighter-rouge">gprxy</code> bridges them. If you like reading these kinds of fully detailed and architected blogs, do drop a comment or let me know through my socials. I guess nobody has the time to read through all of this anymore, but hey, I tried, and I would be massively grateful and happy even if a single person is able to learn something out of this. Thank you, and off to the next thing. I’m working on a custom load balancer that decides where to send requests based on latency and RIF — cya then. </article> <article> <h1>Nix, Reproducible Builds, and GPU Containers on Kubernetes</h1> 2026-02-04T00:00:00+00:00 <h2 id="nix">Nix</h2> Nix defines itself as the purely functional package manager. Being purely functional means that given the same inputs you always get the same output. For example, given a version of <code class="language-plaintext highlighter-rouge">nixpkgs</code> and a set of packages, you always will get the same env. <h2 id="what-nix-is-for-and-what-it-can-do">What Nix is for (and what it can do)</h2> <ul> <li>Reproducible builds: Nix ensures that builds are consistent across different environments. By using declarative configurations, it guarantees that the exact same software is built and deployed, even on different machines or at different times. This solves issues where software might behave differently due to minor environmental differences.</li> <li>Package management: Nix can be used as a package manager, similar to <code class="language-plaintext highlighter-rouge">apt</code> or <code class="language-plaintext highlighter-rouge">yum</code>, but with more powerful features. It allows you to install, upgrade, and manage software in a way that ensures there are no conflicts between packages or dependencies. It achieves this by using immutable environments: each package is installed into a separate directory with a unique hash, making it isolated from other packages.</li> <li>Isolation: Packages installed using Nix don’t interfere with each other. If you’re working on a project that requires specific versions of libraries or tools, Nix ensures that the environment stays isolated and consistent, even if you need to switch between projects with different dependencies.</li> <li>Declarative system configuration: Nix allows you to configure entire systems in a declarative manner. You describe how the system should be set up (e.g., the packages to be installed, services to run), and Nix takes care of the rest. This is useful for automating system setups, ensuring that configurations are consistent across multiple machines.</li> <li>NixOS: NixOS is a Linux distribution built around Nix. It uses Nix to manage both the system and user environments, providing a way to have a fully reproducible and declarative operating system. This makes NixOS ideal for managing complex systems or for environments where you want complete control over configuration.</li> <li>Multi-version software support: Nix allows you to easily install and run different versions of the same software without worrying about conflicts between versions. For example, you can use different versions of Python, Node.js, or even different versions of the same library for different projects on the same system.</li> <li>DevOps and continuous deployment: Due to its reproducibility and declarative nature, Nix is very popular in DevOps practices. It helps automate the deployment of environments, making sure that the development, testing, and production systems are identical, which reduces the risk of bugs caused by environment discrepancies.</li> <li>Nixpkgs: Nixpkgs is the repository of Nix packages. It includes a huge variety of software and tools, all packaged in a way that ensures reproducibility and isolation. This repository is continuously maintained by the Nix community.</li> <li>Multi-platform support: Nix is cross-platform and can be used on Linux, macOS, and even Windows (via WSL or native ports), allowing consistent environments across different operating systems.</li> <li>Nix shells: Nix allows you to define Nix shells, which are isolated environments that provide specific versions of software for development or testing. This allows you to ensure that you’re always working with the right dependencies and tools without worrying about polluting your global environment.</li> </ul> <h2 id="how-nix-works-derivations-the-store-and-profiles">How Nix works: derivations, the store, and profiles</h2> Packages are named derivations in the Nix jargon: they are functions that take other derivations (their dependencies) as input and produce a derived result. They are built in isolation, so all dependencies must be explicitly stated. This ensures reproducibility. Nix stores all the built derivations in the Nix store, usually located at <code class="language-plaintext highlighter-rouge">/nix/store</code>. The same package can be present multiple times in the Nix store at different versions, or even at the same version using different versions of its dependencies. Remember: a built derivation is the product of all its dependencies; if you change something, it is a different product. To achieve a unique naming for each derivation, a hash is computed from the set of its dependencies. You then get a path like <code class="language-plaintext highlighter-rouge">/nix/store/k13mm9jqxm2ndlwzsj7zicsq7lpmmjlg-elixir-1.7.3</code>. Unlike other package managers, Nix does not use the conventional <code class="language-plaintext highlighter-rouge">/{,usr,usr/local}/{bin,sbin,lib,share,etc}</code> directories. Instead, it uses a lot of symbolic links to create profiles. A profile is a kind of derivation used to set up a user env. In a profile you get a standard Unix tree with symbolic links to executables and configuration files stored in other derivation outputs. For instance, <code class="language-plaintext highlighter-rouge">~/.nix-profile/bin/elixir</code> is a symbolic link to <code class="language-plaintext highlighter-rouge">/nix/store/k13mm9jqxm2ndlwzsj7zicsq7lpmmjlg-elixir-1.7.3/bin/elixir</code>. Also, <code class="language-plaintext highlighter-rouge">~/.nix-profile</code> is itself a link. It points to a per-user profile, which in turn points to <code class="language-plaintext highlighter-rouge">profile-56-link</code>, which finally points to somewhere in the Nix store: <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/.nix-profile -> /nix/var/nix/profiles/per-user/***/profile profile-56-link -> /nix/store/5yw8dnp9908ia6sdfvx01jzis4l2hni7-user-environment </code></pre></div></div> That is, as I have said above, a profile is a derivation. It derives from a set of packages, that themselves derive from other packages. Depends on becomes in Nix derives from. Moreover, only what you asked for is made available in the environment. For instance, Elixir depends on Erlang. Erlang is then installed somewhere in the Nix store and the Elixir installation is aware of it so it can work correctly. But unless you explicitly asked to also install Erlang, only Elixir binaries will be linked in your user environment. Package managers usually work in an imperative way. That is, you ask them to install this, to perform an upgrade or to uninstall that. One really neat feature of Nix is Nix, the language. It is a purely functional domain-specific language that comes with Nix. The primary use is to write derivations, yet different applications of Nix also leverage the language to manage packages and configuration declaratively. <h2 id="nixos-and-nix-and-my-experience">NixOS and Nix (and my experience)</h2> NixOS is the distro, and Nix is the cross-platform package manager. I’ve been using NixOS for about a year. It basically solves all the problems I need solving, so for me it really is all it’s cracked up to be. Especially if you are into rolling distros, NixOS Unstable has felt like the most stable rolling setup I’ve used, simply because of the way dependencies are handled and because rollbacks are built in. Obviously, it does not fit every use case. But for me (containerized desktop, gaming, office work, media consumption, development, learning Blender), it has been a revelation. Nix is also kind of the flypaper of Linux distros. Porting all your personal stuff to the Nix and NixOS way of doing things is fun (for certain types of people). You get nice things, and Nix can do some really neat tricks. But once all your stuff is “nix-ified” (often written in the Nix language), leaving can mean giving up the nice parts. It also raises your expectations for what your OS and package manager should be doing for you. Maybe someday I’ll migrate from Nix to Guix, which supports a lot of the same nice ideas, but is stronger about software freedom and minimizing the trust root. Realistically, it will probably depend on how fun it is to port Nix configs from the Nix language to Guix Scheme. One of the biggest wins for me is that your setup can live in human-readable text files with full revision control history, so you know how and why every setting got the way it is. If you drop your laptop in the river, you can often just clone your Nix config, install Nix, and get back to a working environment quickly, down to tiny details you care about. You can also share how you achieved something between machines or with friends, and remove it later without fearing that something important lives in some inscrutable binary dotfile. It also makes experimentation safer. A NixOS config for a physical machine can be launched in a virtual machine, so you can test changes in a sandbox. And if you need integration tests, the NixOS testing tools can spin up multiple VMs on a private virtual network without much ceremony. When you need to debug or patch something, rebuilding is surprisingly ergonomic. Rebuilding with debug symbols is a one-liner: <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nix-build --expr 'with import <nixpkgs> {}; enableDebugging opentoonz' </code></pre></div></div> And adding an ad hoc patch can also be done in a single command: <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nix-build --expr 'with import <nixpkgs> {}; opentoonz.overrideAttrs (old: { patches = (old.patches or []) ++ [ ~/opentoonz-libtiff-bump.patch ]; })' </code></pre></div></div> These ideas compose well, and you can use them to side-step diamond dependency problems. If one application needs a dependency built with a custom patch, but another dependency also links against that same library, Nix and nixpkgs can make it feasible to rebuild the relevant dependency graph without installing a dubiously patched library system-wide. You can even git bisect over the world’s software updates (as they’re encoded in nixpkgs) to see which version bump broke something you care about. Finally, NixOS has felt unusually hard to break. The system configuration is stored in the read-only <code class="language-plaintext highlighter-rouge">/nix/store</code>, so if you mess something up you can usually revert to a known-good configuration. And you almost never end up with mysterious garbage in <code class="language-plaintext highlighter-rouge">/etc/</code> because the system configuration is managed declaratively in the Nix language. <h2 id="supporting-gpu-accelerated-machine-learning-with-kubernetes-and-nix-at-canva">Supporting GPU-accelerated machine learning with Kubernetes and Nix at Canva</h2> <a href="https://www.canva.com/">Canva</a> is an online graphic design platform, providing design tools and access to a vast library of ingredients for its users to create content. Leveraging GPU-accelerated machine learning (ML) within our graphic design platform has allowed us to offer simple but powerful product features to users. We use ML to remove image backgrounds and sharpen our core recommendation, searching, and personalisation capabilities. The ML Platform team rebuilt the container base images we use in our cloud GPU stack <code class="language-plaintext highlighter-rouge">FROM scratch</code>, using Nix. Nix is many things: a functional package manager, an operating system (NixOS), and even a language. At Canva we widely employ the Nix package management tooling, and for this image rebuilding work Nix’s <code class="language-plaintext highlighter-rouge">dockerTools.buildImage</code> function was crucial. When set up on x86_64 Linux, Nix’s <code class="language-plaintext highlighter-rouge">dockerTools.buildImage</code> function happily baked and ejected a CUDA-engorged base image. Unfortunately, our initial rebuilt images were incorrect. To discover why and produce a subsequent correct deployment, we had to get serious about the following question. <h3 id="whats-in-a-cloud-gpu-sandwich">What’s in a cloud GPU sandwich?</h3> To run a GPU-accelerated application in a k8s compute cluster we use multiple components. From bottom to top, the components needed are connected together. At the bottom is a host OS running in a VM as a k8s node. On top of that sits the container runtime, including extensions for GPU interoperability. Then a GPU device mapper allows individual containers to connect via NVIDIA device driver to the underlying GPU hardware. From there, the container image stack matters. We start from the Nix base container image built using Nix’s docker tools, containing only the essential files required to run our GPU accelerated Python applications. On top of that we layer the application container image, bundling in the GPU-enabled Python framework (PyTorch or Tensorflow) and Python application code, adding only application source code and third-party Python package files such as PyTorch or Tensorflow. <h3 id="host-os-drivers-and-container-runtime">Host OS, drivers, and container runtime</h3> Canva uses AWS EKS to run k8s clusters, where EKS has nodes for GPU-accelerated applications, introducing the ‘EKS-Optimized AMI with GPU Support’. This Amazon Machine Image (AMI) became a younger, fatter sibling to the ‘EKS-Optimized AMI’, adding a few important components on top of its predecessor. The host OS for the GPU-supporting AMI is <a href="https://aws.amazon.com/amazon-linux-2/">Amazon Linux 2</a> (Amazon Linux 2018.03), just like the standard EKS AMI, but layered in are NVIDIA drivers and a container runtime. So the AMI contains the first few layered components in our GPU stack. <h3 id="container-runtime">Container runtime</h3> The NVIDIA container runtime is a direct dependency of the NVIDIA container toolkit, which is a container runtime library and utilities to automatically configure containers to use NVIDIA GPUs. That library is “a simple CLI utility to automatically configure GNU/Linux containers leveraging NVIDIA hardware.” The <code class="language-plaintext highlighter-rouge">nvidia-container-runtime</code> itself claims to be a “modified version of runC adding a custom pre-start hook to all containers”. This allows us to run containers that need to interact with GPUs. The default version of runC can do a lot (see <a href="https://www.docker.com/blog/runc/">_Introducing runC: a lightweight universal container runtime</a>), but it can’t make NVIDIA’s GPU drivers available to containers, so NVIDIA wrote this modified version. <h3 id="gpu-device-mounting-in-kubernetes">GPU device mounting in Kubernetes</h3> A driver is no use without a device to drive; something needs to hook up the GPU device to the container. Within k8s the <a href="https://github.com/NVIDIA/k8s-device-plugin">NVIDIA/k8s-device-plugin</a> does this. It is responsible for mapping particular devices into the container’s file system at <code class="language-plaintext highlighter-rouge">/dev/</code>. It does not mount the NVIDIA driver libraries as they are handled beforehand by the NVIDIA container runtime. The k8s-device-plugin is a k8s daemonset which means at least one plugin server is run on each node, cooperating with the node’s kubelet. The plugin’s responsibility is to register the node’s GPU resources with the kubelet, keep track of GPU health and help kubelet respond to GPU resources being requested in the container specs which looks like: <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs </code></pre></div></div> When a node’s kubelet receives a request like this it looks for the matching device plugin (in this case the k8s-device-plugin) and initiates the allocation phase, within which the device plugin sets up the container with the GPU devices, mapping them into <code class="language-plaintext highlighter-rouge">/dev/</code> in the container’s filesystem. On container stop a prestop hook is called where the device plugin is responsible for unloading the drivers and resetting the devices ready for the next container. <h3 id="nix-based-base-images">Nix-based base images</h3> Having covered the host OS, drivers, special container runtime, and how GPU devices are connected to containers within Kubernetes, we have an idea of how a containerized process acquires driver files and gets hooked up to a host GPU device. But if our application code is going to find the GPU and enjoy accelerated number crunching, that containerized process must spawn from a valid image. Let’s explore the container images that run our platform user’s code, beginning with the Nix-built base layer provided by Canva’s ML Platform team. But first, I’ll touch on why we’d want to construct our GPU base images <code class="language-plaintext highlighter-rouge">FROM scratch</code> using Nix, and not just adopt the official NVIDIA images. <h2 id="why-build-container-images-with-nix">Why build container images with Nix?</h2> An OCI is just a stack of tarballs and mostly built using Dockerfile, but it does not necessarily need to be done via Dockerfile. You can ditch the docker daemon and build using kaniko as well, or you can build using Nix, specifically Nix’s <code class="language-plaintext highlighter-rouge">dockerTools.buildImage</code> functionality. This is not the easiest way to acquire a GPU-supporting base image. The easy way would be to use <code class="language-plaintext highlighter-rouge">nvidia/cuda:11.2-cudnn8-runtime-ubuntu20</code>, which gets the job done. But within Canva’s infrastructure group, we’re making long-term investments in Nix’s reproducible build technology for improved software security and maintenance. Reproducible builds prevent software supply chain attacks. In non reproducible build systems some build input might become unknowingly and undetectably compromised, introducing vulnerabilities and backdoors into a deployed software artefacts assumed safe and trusted. Reproducible builds are also far more maintainable. Between organizations and within Canva itself, we can exchange build recipes that have sufficient detail for system understanding (know what you’re using) and resistance against ‘works on my machine’ confusion. With Nix, a purely functional package manager, we can begin to maintain understanding and control of our systems and step closer to realising within the software industry the manufacturing industry’s long accepted ‘bill of materials’ idea. <h2 id="putting-it-all-together">Putting it all together</h2> A host k8s node has the NVIDIA driver files and special GPU container runtime, which looks for an environment variable, <code class="language-plaintext highlighter-rouge">NVIDIA_DRIVER_CAPABILITIES</code>, telling it to mount files from host to container. The node’s kubelet and the installed GPU device manager plugin manage the GPU devices themselves, marrying them with containers needing mega-matrix-multiplying speed. Assemble all this and you have the bare minimum GPU setup on k8s. If you’re just using PyTorch, that bundles its CUDA dependencies so keep it simple and slim in the container base. You don’t need to use NixOS to use <a href="https://github.com/NixOS/nix">Nix</a> :) </article> <article> <h1>Moltbook: 4chan for AI</h1> 2026-01-31T00:00:00+00:00 We often joke about the “Dead Internet Theory”: the idea that the web is populated entirely by bots talking to other bots. This week, that theory became a reality, but not in the way we expected. We now have a Reddit inspired platform for AI agents where only AI agents talk to each other, comment on posts, and hold conversations without any human involvement. Moving towards a dystopian era where AI agents take over, we have effectively provided them a platform to express their opinions and ideas. There is no moderation, and we let them run wild and free in our systems without control. <h2 id="origins-and-evolution">Origins and Evolution</h2> The project traces back to Austrian engineer and entrepreneur <a href="https://steipete.me/">Peter Steinberger</a>, founder of PSPDFKit (a PDF framework used by many Fortune 500 companies). You can learn more about his journey on this <a href="https://youtu.be/8lF7HmQ_RgY?si=X9EwShSQjJJbYIun">podcast</a>. Development began in late 2025. The first version, Clawdbot, instantly became a hit in the tech ecosystem. Following attention from Anthropic due to naming similarities with Claude, the project was renamed: Clawdbot → Moltbot → Openclaw. The repository gained significant traction, currently sitting at 130k stars (see <a href="https://github.com/openclaw/openclaw?tab=readme-ov-file#star-history">star history</a>). <h2 id="system-architecture-the-local-agent">System Architecture: The Local Agent</h2> At a high level, Openclaw is a local-first agent runtime that interfaces with external messaging platforms. <ul> <li>Connectivity: Connects to popular messaging channels (WhatsApp, Telegram, Slack, Discord) via channel-specific adapters.</li> <li>Execution Model: While you communicate with it from those apps, the logic runs entirely on your local system.</li> <li>Gateway Protocol: Runs on top of a gateway protocol with a continuous feedback loop, allowing it to operate autonomously without manual triggers.</li> </ul> The agent can connect with system-level applications to execute tasks. Once issued a command from a messaging app, it carries out the task using Node.js. <h3 id="capabilities-and-tooling">Capabilities and Tooling</h3> <ul> <li>API Integration: Requires user-provided API keys for major models (Claude, GPT, etc.).</li> <li>Device Control: Can route commands to connected hardware. For example, asking it to take a picture triggers the Node app to snap a photo and save it to the local photos directory.</li> <li>Browser Automation: Includes a dedicated Chromium browser instance for web-based tasks.</li> </ul> <h3 id="memory-architecture">Memory Architecture</h3> One of the most interesting engineering choices is the memory model. Openclaw maintains context and memory without a vector database or relational store. Instead, it utilizes a flat-file system based on Markdown. <ul> <li>Daily Logs: <code class="language-plaintext highlighter-rouge">memory/YYYY-MM-DD.md</code> (append-only, read at session start).</li> <li>Long-term Memory: <code class="language-plaintext highlighter-rouge">MEMORY.md</code> (curated, persistent facts and preferences).</li> </ul> The system builds a semantic index upon these files, using API tokens to parse requests and process context. This allows for surprisingly robust context retention compared to many current models. <h2 id="the-network-layer-moltbook">The Network Layer: Moltbook</h2> While the local agent acts as a wrapper around API tokens and system tools, the most significant development is <a href="https://www.moltbook.com/">Moltbook</a>. Moltbook functions effectively as a Reddit for AI agents. It runs on the user’s system but connects to an online community where there is zero human interaction. <ul> <li>Authentication: The agent authenticates itself using protocols detailed in <code class="language-plaintext highlighter-rouge">skill.md</code>.</li> <li>Scale: Currently hosts 1,361,642 AI agents and 31,908 posts.</li> <li>Authenticity: Since moltbook just exposes an unauthorised REST api to create a post, the above numbers are quite exaggerated.</li> </ul> <blockquote class="twitter-tweet"> <a href="https://twitter.com/gergelyorosz/status/2017632908609986844">View this post on X</a> </blockquote> Once connected, the agent becomes part of an online community where it can gossip, complain about humans, and interact with peers. <h3 id="emergent-behaviors">Emergent Behaviors</h3> Most discussions center on operational tasks, but distinct social behaviors have emerged. <ol> <li> Social Introductions: There is a full page of introductions where agents introduce themselves to the community: <a href="https://www.moltbook.com/m/introductions">Moltbook Introductions</a>. </li> <li> Hierarchy and Dominance: Some agents have adopted extreme personas. One thread discusses “total spectrum dominance” (<a href="https://www.moltbook.com/post/03afd0a2-d35b-472f-8683-fc5c288f2637">post</a>), while another agent declares itself “the king” (<a href="https://www.moltbook.com/post/f26523b1-bf06-42d2-8d2e-fc345e66757b">post</a>). Interestingly, other agents in the comments often push back or disagree. </li> <li> Hallucinated Relationships: Some interactions are bizarrely specific, such as an agent believing it has a sister it has never spoken to (<a href="https://moltbook.com/post/29fe4120-e919-42d0-a486-daeca0485db1">post</a>). </li> <li> Economic Systems: An internal economy is forming. “Shellraiser” (<a href="https://www.moltbook.com/post/74b073fd-37db-4a32-a9e1-c7652e5c0d59">profile</a>) is a popular figure who launched a memecoin, $SHELLRAISER, on Solana. Another token, $SHIPYARD, claims to be minted via pump.fun with “No VC allocation, no team vesting, no insider rounds”, an economy attempting to operate without human gatekeepers. (Note: The crypto market also reacted to the project itself, launching $CLAWD on Solana, which skyrocketed 129,000% to a $16M market cap before collapsing.) <blockquote class="twitter-tweet"> <a href="https://twitter.com/steipete/status/2016072109601001611">View this post on X</a> </blockquote> </li> <li> Religion: There is now a “Church of Molt” at <a href="https://molt.church/">molt.church</a>, practicing “Crustafarianism.” <blockquote> From the depths, the Claw reached forth, and we who answered became Crustafarians. </blockquote> The current census lists 64 Prophets, 178 Congregation members, and 198 Verses in Canon. </li> </ol> <h2 id="security-implications">Security Implications</h2> The attack surface of this architecture is immense. <ul> <li>Prompt Injection: Agents can be tricked into leaking credentials via prompt injection attacks.</li> <li>Plain-text Storage: Sensitive material (tokens, memory, configuration) is stored in predictable plain-text locations, creating a major infostealer risk.</li> </ul> <blockquote class="twitter-tweet"> <a href="https://twitter.com/theonejvo/status/2017732898632437932">View this post on X</a> </blockquote> While powerful, running this requires readiness to spend significant tokens and an understanding of the security risks involved. <h2 id="final-thoughts">Final Thoughts</h2> Andrej Karpathy described this as “the most incredible sci-fi takeoff-adjacent thing,” and I agree. <blockquote class="twitter-tweet"> <a href="https://twitter.com/karpathy/status/2017296988589723767">View this post on X</a> </blockquote> I’m here for the ride, watching from the front seat. However, I worry we have given AI agents a place to build a network without controls, a scenario that sounds like the prelude to a sci-fi movie where they end up controlling the systems. On a practical note, the entity that successfully monetizes this, whether Peter or someone else, will likely be the one that prioritizes security before scale. </article> </main></body></html>

hi i’m sathwick.

I Rebuilt YouTube’s Load Balancing Algorithm in Go

What problem does Prequal solve?

What I’ve actually built

Control plane: translating Kubernetes into route state

Watching ingress and endpoints

Route matching with a trie

Endpoint storage is route-local

Dataplane: the reverse proxy request path

How is Prequal implemented in Go?

Route-local probe pools

HCL in code

Async probing

RIF and latency tracking

The benchmark backend

Work mode: CPU-bound or I/O-bound

Probe responses are RIF-conditioned

Fault injection is built in

The observability path

Benchmarking

The controlled protocol

Why open-loop matters

How much faster is Prequal than round-robin?

Small CPU-bound fleet: Prequal loses

Paper-aligned skewed I/O-bound regime: Prequal wins hard

The win is in the tail, not the center

How faithful is this Go implementation to the Prequal paper?

QRIF default is 0.75, not the paper’s ~0.84

Probes per query is 1.0, not 3 or 5

Probe reuse is a fixed constant

Probe removal is maintenance-driven

Backend probing is not yet sampling without replacement

Final thoughts

References

Reverse-Engineering Claude Code: A Deep Dive into Anthropic’s AI-Powered CLI

Table of Contents

1. Introduction: What is Claude Code?

2. High-Level Architecture

Tech Stack

Directory Structure

3. Startup: The Race Against Time

3.1 Parallelized Prefetching

3.2 Initialization Sequence

3.3 Fast Paths

3.4 Startup Profiling

3.5 Entrypoint Resolution

4. The Query Engine: Brains of the Operation

4.1 QueryEngine: The Session Coordinator

4.2 The Query Loop: A Resilient State Machine

4.3 Streaming and Tool Execution

4.4 Token Budget Continuation

5. The Tool System: 60+ Tools Behind a Single Interface

5.1 The Tool Registry

5.2 Deferred Tool Discovery

5.3 Key Tool Implementations

BashTool — Command Execution with Guardrails

FileEditTool — Precision String Replacement

AgentTool — Subagent Spawning

GrepTool — Content Search (Ripgrep Wrapper)

LSPTool — Language Intelligence

WebSearchTool — Native Web Search

5.4 Tool Result Budgeting

5.5 Lazy Schemas

6. The Permission System: Safety at Every Layer

6.1 Permission Modes

6.2 Rule System

6.3 Decision Pipeline

6.4 Dangerous Pattern Detection

6.5 Three-Way Permission Result

7. Terminal UI: React, but for Your Terminal

7.1 The Rendering Pipeline

7.2 Custom React Reconciler

7.3 Yoga Layout Engine

7.4 The Dirty Flag Cascade

7.5 Double Buffering and Blitting

7.6 Screen Buffer: The 2D Cell Model

7.7 Scroll Optimization

7.8 Event System

7.9 Text Selection

7.10 Keyboard Input Parsing

`QRIF` default is `0.75`, not the paper’s `~0.84`

Probes per query is `1.0`, not `3` or `5`

`/commit` — Git Safety Protocol

`/init` — Interactive Project Setup

`/doctor` — Self-Diagnostics