Performance · Protocols

Built for Speed

Real benchmark numbers from real hardware. All 10 protocols benchmarked end-to-end: client → gateway → backend, with direct baselines for comparison.

102,183
HTTP/1.1 RPS
101,317
HTTP/1.1 + TLS RPS
108,841
TCP Proxy RPS
~49%
TCP Overhead
0
Mutex Locks (hot path)
O(1)
Route Lookup Complexity

How We Measure

Hardware

Apple Silicon M-series MacBook Pro. ARM64 native binary compiled with --release and jemalloc allocator.

Load Generator

proto_bench — a custom multi-protocol load generator. 200 concurrent connections, 10-second duration, 64-byte echo payload. Localhost loopback network.

Backend

Minimal echo server returning a small fixed response. Isolates gateway overhead from backend latency. Direct backend numbers measured bypassing Ferrum Edge.

⚠️
Benchmark context: Localhost benchmarks represent peak theoretical throughput. Production performance depends on network latency, TLS handshake costs, plugin configuration, hardware, and backend response times. Always benchmark against your own workload.

Protocol Performance Comparison

Gateway RPS vs. Direct RPS (higher is better)
TCP Proxy
Gateway
108,841
Direct
214,113
TCP + TLS
Gateway
107,340
Direct
207,103
WebSocket
Gateway
103,830
Direct
207,507
HTTP/1.1
Gateway
102,183
Direct
209,910
HTTP/1.1 + TLS
Gateway
101,317
Direct
209,361
HTTP/2
Gateway
108,138
Direct
355,544
UDP
Gateway
82,042
Direct
276,526
UDP + DTLS
Gateway
76,107
Direct
101,839
gRPC
Gateway
68,352
Direct
205,927
HTTP/3 (QUIC)
Gateway
53,085
Direct
83,592
Protocol Gateway RPS Direct RPS Overhead % Notes
HTTP/1.1 102,183 209,910 ~51% reqwest connection pool with keep-alive
HTTP/1.1 + TLS 101,317 209,361 ~52% TLS termination at gateway, plain HTTP to backend
HTTP/2 108,138 355,544 ~70% hyper-native H2 pool with two-phase ready() multiplexing
HTTP/3 (QUIC) 53,085 83,592 ~37% QUIC connection pool via quinn
gRPC 68,352 205,927 ~67% H2 multiplexing + protobuf passthrough
WebSocket 103,830 207,507 ~50% Upgrade overhead amortized over many messages
TCP Proxy 108,841 214,113 ~49% Bidirectional copy, minimal per-byte overhead
TCP + TLS 107,340 207,103 ~48% TLS termination + bidirectional copy (cached TLS config)
UDP 82,042 276,526 ~70% Per-datagram session lookup + forwarding
UDP + DTLS 76,107 101,839 ~25% DTLS termination + plain UDP forwarding

Lock-Free Data Structure Complexity

Every lookup on the hot path is O(1) or better — no linear scans, no tree traversals under load.

Component Data Structure Lookup Update Concurrency
Global Config ArcSwap<Config> Atomic load Atomic swap Lock-free reads, single writer
Route Table DashMap<Host, Routes> O(1) O(1) shard-locked Concurrent reads, shard-locked writes
Plugin Registry DashMap<Name, Plugin> O(1) O(1) shard-locked Concurrent reads
Consumer Store DashMap<Key, Consumer> O(1) O(1) shard-locked Concurrent reads, shard-locked writes
Rate Limit State DashMap<Key, Bucket> O(1) Atomic CAS Lock-free per-key updates
Connection Pool Per-host Vec + Mutex O(1) + lock O(1) + lock Short-duration lock on pool access only
Health State AtomicBool per upstream Atomic load Atomic store True lock-free

What Happens on Every Request

1

Config Load — O(1), lock-free

Single atomic pointer load from ArcSwap. Returns a reference-counted Arc to the current config. No mutex, no blocking.

2

Route Lookup — O(1)

DashMap lookup by host + path prefix. Consistent hashing or longest-prefix match depending on route config. Hash map lookup is O(1) average.

3

Plugin Execution — O(P)

Linear scan over sorted plugin list (P = number of enabled plugins, typically 1-5). Each plugin executes its hook in priority order. Most requests have <5 plugins.

4

Upstream Selection — O(1) to O(log N)

Load balancing algorithm selects upstream. Round robin is O(1) with an atomic counter. Least connections requires O(N) scan but N is typically small (<20).

5

Connection Pool — O(1) + brief lock

Per-host pool lookup. Brief mutex to pop a connection from the pool (microseconds). HTTP/2 pools are almost always lock-free (multiplex on existing connection).

6

Proxy + Response — network-bound

Forward request to upstream, await response. This dominates total latency. All Ferrum Edge overhead is in steps 1-5 and is measured in microseconds.

Run Your Own Benchmarks

Don't take our word for it. Here's how to reproduce the benchmarks on your own hardware.

bash
# Install wrk (macOS)
brew install wrk

# Start a minimal echo backend (e.g., using Python)
python3 -c "
import http.server, socketserver
class H(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b'OK')
    def log_message(self, *args): pass
with socketserver.TCPServer(('', 3000), H) as s:
    s.serve_forever()
" &

# Start Ferrum Edge in file mode
cat > bench.yaml << 'EOF'
proxies:
  - id: "bench"
    name: "Bench Proxy"
    listen_path: "/"
    backend_protocol: http
    backend_host: "localhost"
    backend_port: 3000
    strip_listen_path: true
    auth_mode: single
    plugins: []
EOF
ferrum-edge run --spec bench.yaml -v &

# Benchmark direct (baseline)
wrk -t4 -c200 -d10s http://localhost:3000/

# Benchmark through Ferrum Edge
wrk -t4 -c200 -d10s http://localhost:8000/
ℹ️
For TLS benchmarks, configure a self-signed certificate in the proxy config and use wrk --tls. For HTTP/2 benchmarks, use h2load from nghttp2 instead of wrk.

Performance at 30,000 Proxies

Real-world multi-tenant simulation: 30k proxies, each secured with key_auth + access_control, each with a unique consumer — config added live while the gateway serves traffic.

13.8%
Throughput degradation, 3k → 30k proxies
49K
RPS baseline at 3,000 proxies
1.0ms
P50 latency — unchanged through 27k proxies
100%
Success rate — zero failures across all 10 batches
Proxies RPS Avg (ms) P50 (ms) P95 (ms) P99 (ms) Max (ms) % Baseline
3,00049,2361.01.01.52.024.9100%
6,00048,7881.01.01.52.185.099%
9,00048,8921.01.01.52.116.699%
12,00048,4481.01.01.52.124.098%
15,00047,5621.01.01.62.334.097%
18,00047,3241.01.01.62.252.096%
21,00046,6561.11.01.62.332.095%
24,00045,6121.11.01.72.425.693%
27,00044,3631.11.01.72.526.290%
30,00042,4341.21.11.92.8113.286%

SQLite, release build, Apple Silicon. 50 concurrent workers, 30s per batch. Config updates applied live between batches. View full test details →

RPS % of Baseline as Config Scales

3k
100%
9k
99%
15k
97%
21k
95%
27k
90%
30k
86%

Why This Matters

Most API gateways struggle under large configuration sets because route lookups, plugin resolution, and consumer auth involve linear scans or tree traversals that slow with scale. Ferrum Edge's lock-free architecture uses pre-computed indexes and O(1) HashMap lookups — so adding 27,000 more proxy configs costs only 14% throughput, not 50–80%.

Config updates were applied live between each batch — the atomic ArcSwap config swap means zero downtime, zero failed requests during reload, and no request ever blocks waiting for a new config to land.

In practice, 99.99% of companies will never need more than 9,000 proxies — a scale where Ferrum Edge operates within 1% of its minimal config throughput.