Built for Speed
Real benchmark numbers from real hardware. All 10 protocols benchmarked end-to-end: client → gateway → backend, with direct baselines for comparison.
How We Measure
Hardware
Apple Silicon M-series MacBook Pro. ARM64 native binary compiled with --release and jemalloc allocator.
Load Generator
proto_bench — a custom multi-protocol load generator. 200 concurrent connections, 10-second duration, 64-byte echo payload. Localhost loopback network.
Backend
Minimal echo server returning a small fixed response. Isolates gateway overhead from backend latency. Direct backend numbers measured bypassing Ferrum Edge.
Protocol Performance Comparison
| Protocol | Gateway RPS | Direct RPS | Overhead % | Notes |
|---|---|---|---|---|
| HTTP/1.1 | 102,183 | 209,910 | ~51% | reqwest connection pool with keep-alive |
| HTTP/1.1 + TLS | 101,317 | 209,361 | ~52% | TLS termination at gateway, plain HTTP to backend |
| HTTP/2 | 108,138 | 355,544 | ~70% | hyper-native H2 pool with two-phase ready() multiplexing |
| HTTP/3 (QUIC) | 53,085 | 83,592 | ~37% | QUIC connection pool via quinn |
| gRPC | 68,352 | 205,927 | ~67% | H2 multiplexing + protobuf passthrough |
| WebSocket | 103,830 | 207,507 | ~50% | Upgrade overhead amortized over many messages |
| TCP Proxy | 108,841 | 214,113 | ~49% | Bidirectional copy, minimal per-byte overhead |
| TCP + TLS | 107,340 | 207,103 | ~48% | TLS termination + bidirectional copy (cached TLS config) |
| UDP | 82,042 | 276,526 | ~70% | Per-datagram session lookup + forwarding |
| UDP + DTLS | 76,107 | 101,839 | ~25% | DTLS termination + plain UDP forwarding |
Lock-Free Data Structure Complexity
Every lookup on the hot path is O(1) or better — no linear scans, no tree traversals under load.
| Component | Data Structure | Lookup | Update | Concurrency |
|---|---|---|---|---|
| Global Config | ArcSwap<Config> |
Atomic load | Atomic swap | Lock-free reads, single writer |
| Route Table | DashMap<Host, Routes> |
O(1) | O(1) shard-locked | Concurrent reads, shard-locked writes |
| Plugin Registry | DashMap<Name, Plugin> |
O(1) | O(1) shard-locked | Concurrent reads |
| Consumer Store | DashMap<Key, Consumer> |
O(1) | O(1) shard-locked | Concurrent reads, shard-locked writes |
| Rate Limit State | DashMap<Key, Bucket> |
O(1) | Atomic CAS | Lock-free per-key updates |
| Connection Pool | Per-host Vec + Mutex |
O(1) + lock | O(1) + lock | Short-duration lock on pool access only |
| Health State | AtomicBool per upstream |
Atomic load | Atomic store | True lock-free |
What Happens on Every Request
Config Load — O(1), lock-free
Single atomic pointer load from ArcSwap. Returns a reference-counted Arc to the current config. No mutex, no blocking.
Route Lookup — O(1)
DashMap lookup by host + path prefix. Consistent hashing or longest-prefix match depending on route config. Hash map lookup is O(1) average.
Plugin Execution — O(P)
Linear scan over sorted plugin list (P = number of enabled plugins, typically 1-5). Each plugin executes its hook in priority order. Most requests have <5 plugins.
Upstream Selection — O(1) to O(log N)
Load balancing algorithm selects upstream. Round robin is O(1) with an atomic counter. Least connections requires O(N) scan but N is typically small (<20).
Connection Pool — O(1) + brief lock
Per-host pool lookup. Brief mutex to pop a connection from the pool (microseconds). HTTP/2 pools are almost always lock-free (multiplex on existing connection).
Proxy + Response — network-bound
Forward request to upstream, await response. This dominates total latency. All Ferrum Edge overhead is in steps 1-5 and is measured in microseconds.
Run Your Own Benchmarks
Don't take our word for it. Here's how to reproduce the benchmarks on your own hardware.
# Install wrk (macOS)
brew install wrk
# Start a minimal echo backend (e.g., using Python)
python3 -c "
import http.server, socketserver
class H(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b'OK')
def log_message(self, *args): pass
with socketserver.TCPServer(('', 3000), H) as s:
s.serve_forever()
" &
# Start Ferrum Edge in file mode
cat > bench.yaml << 'EOF'
proxies:
- id: "bench"
name: "Bench Proxy"
listen_path: "/"
backend_protocol: http
backend_host: "localhost"
backend_port: 3000
strip_listen_path: true
auth_mode: single
plugins: []
EOF
ferrum-edge run --spec bench.yaml -v &
# Benchmark direct (baseline)
wrk -t4 -c200 -d10s http://localhost:3000/
# Benchmark through Ferrum Edge
wrk -t4 -c200 -d10s http://localhost:8000/
wrk --tls. For HTTP/2 benchmarks, use h2load from
nghttp2 instead of wrk.
Performance at 30,000 Proxies
Real-world multi-tenant simulation: 30k proxies, each secured with key_auth + access_control,
each with a unique consumer — config added live while the gateway serves traffic.
| Proxies | RPS | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Max (ms) | % Baseline |
|---|---|---|---|---|---|---|---|
| 3,000 | 49,236 | 1.0 | 1.0 | 1.5 | 2.0 | 24.9 | 100% |
| 6,000 | 48,788 | 1.0 | 1.0 | 1.5 | 2.1 | 85.0 | 99% |
| 9,000 | 48,892 | 1.0 | 1.0 | 1.5 | 2.1 | 16.6 | 99% |
| 12,000 | 48,448 | 1.0 | 1.0 | 1.5 | 2.1 | 24.0 | 98% |
| 15,000 | 47,562 | 1.0 | 1.0 | 1.6 | 2.3 | 34.0 | 97% |
| 18,000 | 47,324 | 1.0 | 1.0 | 1.6 | 2.2 | 52.0 | 96% |
| 21,000 | 46,656 | 1.1 | 1.0 | 1.6 | 2.3 | 32.0 | 95% |
| 24,000 | 45,612 | 1.1 | 1.0 | 1.7 | 2.4 | 25.6 | 93% |
| 27,000 | 44,363 | 1.1 | 1.0 | 1.7 | 2.5 | 26.2 | 90% |
| 30,000 | 42,434 | 1.2 | 1.1 | 1.9 | 2.8 | 113.2 | 86% |
SQLite, release build, Apple Silicon. 50 concurrent workers, 30s per batch. Config updates applied live between batches. View full test details →
RPS % of Baseline as Config Scales
Why This Matters
Most API gateways struggle under large configuration sets because route lookups, plugin resolution, and consumer auth involve linear scans or tree traversals that slow with scale. Ferrum Edge's lock-free architecture uses pre-computed indexes and O(1) HashMap lookups — so adding 27,000 more proxy configs costs only 14% throughput, not 50–80%.
Config updates were applied live between each batch — the atomic ArcSwap config swap means zero downtime,
zero failed requests during reload, and no request ever blocks waiting for a new config to land.
In practice, 99.99% of companies will never need more than 9,000 proxies — a scale where Ferrum Edge operates within 1% of its minimal config throughput.