Performance · Protocols

Built for Speed

Real benchmark numbers from real hardware. All 10 protocols benchmarked end-to-end: client → gateway → backend, with direct baselines for comparison.

102,183

HTTP/1.1 RPS

101,317

HTTP/1.1 + TLS RPS

108,841

TCP Proxy RPS

~49%

TCP Overhead

Mutex Locks (hot path)

O(1)

Route Lookup Complexity

Benchmark Methodology

How We Measure

Hardware

Apple Silicon M-series MacBook Pro. ARM64 native binary compiled with --release and jemalloc allocator.

Load Generator

proto_bench — a custom multi-protocol load generator. 200 concurrent connections, 10-second duration, 64-byte echo payload. Localhost loopback network.

Backend

Minimal echo server returning a small fixed response. Isolates gateway overhead from backend latency. Direct backend numbers measured bypassing Ferrum Edge.

⚠️

Benchmark context: Localhost benchmarks represent peak theoretical throughput. Production performance depends on network latency, TLS handshake costs, plugin configuration, hardware, and backend response times. Always benchmark against your own workload.

Results

Protocol Performance Comparison

Gateway RPS vs. Direct RPS (higher is better)

TCP Proxy

Gateway

108,841

Direct

214,113

TCP + TLS

Gateway

107,340

Direct

207,103

WebSocket

Gateway

103,830

Direct

207,507

HTTP/1.1

Gateway

102,183

Direct

209,910

HTTP/1.1 + TLS

Gateway

101,317

Direct

209,361

HTTP/2

Gateway

108,138

Direct

355,544

UDP

Gateway

82,042

Direct

276,526

UDP + DTLS

Gateway

76,107

Direct

101,839

gRPC

Gateway

68,352

Direct

205,927

HTTP/3 (QUIC)

Gateway

53,085

Direct

83,592

Protocol	Gateway RPS	Direct RPS	Overhead %	Notes
HTTP/1.1	102,183	209,910	~51%	reqwest connection pool with keep-alive
HTTP/1.1 + TLS	101,317	209,361	~52%	TLS termination at gateway, plain HTTP to backend
HTTP/2	108,138	355,544	~70%	hyper-native H2 pool with two-phase ready() multiplexing
HTTP/3 (QUIC)	53,085	83,592	~37%	QUIC connection pool via quinn
gRPC	68,352	205,927	~67%	H2 multiplexing + protobuf passthrough
WebSocket	103,830	207,507	~50%	Upgrade overhead amortized over many messages
TCP Proxy	108,841	214,113	~49%	Bidirectional copy, minimal per-byte overhead
TCP + TLS	107,340	207,103	~48%	TLS termination + bidirectional copy (cached TLS config)
UDP	82,042	276,526	~70%	Per-datagram session lookup + forwarding
UDP + DTLS	76,107	101,839	~25%	DTLS termination + plain UDP forwarding

Architecture Details

Lock-Free Data Structure Complexity

Every lookup on the hot path is O(1) or better — no linear scans, no tree traversals under load.

Component	Data Structure	Lookup	Update	Concurrency
Global Config	`ArcSwap<Config>`	Atomic load	Atomic swap	Lock-free reads, single writer
Route Table	`DashMap<Host, Routes>`	O(1)	O(1) shard-locked	Concurrent reads, shard-locked writes
Plugin Registry	`DashMap<Name, Plugin>`	O(1)	O(1) shard-locked	Concurrent reads
Consumer Store	`DashMap<Key, Consumer>`	O(1)	O(1) shard-locked	Concurrent reads, shard-locked writes
Rate Limit State	`DashMap<Key, Bucket>`	O(1)	Atomic CAS	Lock-free per-key updates
Connection Pool	Per-host `Vec` + `Mutex`	O(1) + lock	O(1) + lock	Short-duration lock on pool access only
Health State	`AtomicBool` per upstream	Atomic load	Atomic store	True lock-free

Per-Request Breakdown

What Happens on Every Request

Config Load — O(1), lock-free

Single atomic pointer load from ArcSwap. Returns a reference-counted Arc to the current config. No mutex, no blocking.

Route Lookup — O(1)

DashMap lookup by host + path prefix. Consistent hashing or longest-prefix match depending on route config. Hash map lookup is O(1) average.

Plugin Execution — O(P)

Linear scan over sorted plugin list (P = number of enabled plugins, typically 1-5). Each plugin executes its hook in priority order. Most requests have <5 plugins.

Upstream Selection — O(1) to O(log N)

Load balancing algorithm selects upstream. Round robin is O(1) with an atomic counter. Least connections requires O(N) scan but N is typically small (<20).

Connection Pool — O(1) + brief lock

Per-host pool lookup. Brief mutex to pop a connection from the pool (microseconds). HTTP/2 pools are almost always lock-free (multiplex on existing connection).

Proxy + Response — network-bound

Forward request to upstream, await response. This dominates total latency. All Ferrum Edge overhead is in steps 1-5 and is measured in microseconds.

DIY Benchmarking

Run Your Own Benchmarks

Don't take our word for it. Here's how to reproduce the benchmarks on your own hardware.

bash

# Install wrk (macOS)
brew install wrk

# Start a minimal echo backend (e.g., using Python)
python3 -c "
import http.server, socketserver
class H(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b'OK')
    def log_message(self, *args): pass
with socketserver.TCPServer(('', 3000), H) as s:
    s.serve_forever()
" &

# Start Ferrum Edge in file mode
cat > bench.yaml << 'EOF'
proxies:
  - id: "bench"
    name: "Bench Proxy"
    listen_path: "/"
    backend_protocol: http
    backend_host: "localhost"
    backend_port: 3000
    strip_listen_path: true
    auth_mode: single
    plugins: []
EOF
ferrum-edge run --spec bench.yaml -v &

# Benchmark direct (baseline)
wrk -t4 -c200 -d10s http://localhost:3000/

# Benchmark through Ferrum Edge
wrk -t4 -c200 -d10s http://localhost:8000/

ℹ️

For TLS benchmarks, configure a self-signed certificate in the proxy config and use wrk --tls. For HTTP/2 benchmarks, use h2load from nghttp2 instead of wrk.

Scale Testing

Performance at 30,000 Proxies

Real-world multi-tenant simulation: 30k proxies, each secured with key_auth + access_control, each with a unique consumer — config added live while the gateway serves traffic.

13.8%

Throughput degradation, 3k → 30k proxies

49K

RPS baseline at 3,000 proxies

1.0ms

P50 latency — unchanged through 27k proxies

100%

Success rate — zero failures across all 10 batches

Proxies	RPS	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	Max (ms)	% Baseline
3,000	49,236	1.0	1.0	1.5	2.0	24.9	100%
6,000	48,788	1.0	1.0	1.5	2.1	85.0	99%
9,000	48,892	1.0	1.0	1.5	2.1	16.6	99%
12,000	48,448	1.0	1.0	1.5	2.1	24.0	98%
15,000	47,562	1.0	1.0	1.6	2.3	34.0	97%
18,000	47,324	1.0	1.0	1.6	2.2	52.0	96%
21,000	46,656	1.1	1.0	1.6	2.3	32.0	95%
24,000	45,612	1.1	1.0	1.7	2.4	25.6	93%
27,000	44,363	1.1	1.0	1.7	2.5	26.2	90%
30,000	42,434	1.2	1.1	1.9	2.8	113.2	86%

SQLite, release build, Apple Silicon. 50 concurrent workers, 30s per batch. Config updates applied live between batches. View full test details →

RPS % of Baseline as Config Scales

100%

99%

15k

97%

21k

95%

27k

90%

30k

86%

Why This Matters

Most API gateways struggle under large configuration sets because route lookups, plugin resolution, and consumer auth involve linear scans or tree traversals that slow with scale. Ferrum Edge's lock-free architecture uses pre-computed indexes and O(1) HashMap lookups — so adding 27,000 more proxy configs costs only 14% throughput, not 50–80%.

Config updates were applied live between each batch — the atomic ArcSwap config swap means zero downtime, zero failed requests during reload, and no request ever blocks waiting for a new config to land.

In practice, 99.99% of companies will never need more than 9,000 proxies — a scale where Ferrum Edge operates within 1% of its minimal config throughput.