High-Availability Solana RPC — sub-second failover, desync-safe routing

Problem I Solved

Solana RPC nodes can be alive but wrong.

A desynced node (slot lag) still answers requests, and classic L4/L7 load balancers cannot reliably detect this condition.

My initial setup (HAProxy only) had hard limitations:

Desynced nodes kept receiving traffic
No RPC caching
No per-user or per-IP rate limits
Manual intervention required to remove bad nodes
User-facing downtime during incidents

What I Built

A desync-aware, high-availability RPC stack with deterministic failover, traffic gating, and dual-protocol support (HTTP JSON-RPC + Yellowstone gRPC).

Architecture

Clients (HTTP/gRPC)
        ↓
    HAProxy
        ↓
Custom Go Proxy (control plane)
   ├─ Redis (response cache)
   └─ Endpoint Manager (health state)
        ↓
Solana RPC nodes (bare metal)

Key design choice: HAProxy stays dumb and fast. All intelligence lives in a custom Go proxy — health decisions, caching, rate limiting, and protocol-aware routing.

Core Engineering Contributions

1) Desync-aware health model (not just crash detection)

The system runs two independent health check loops:

Uptime checks — every 200ms (configurable), calling getSlot with 2s timeout
Slot sync checks — every 5s (configurable), comparing against a reference endpoint

Each endpoint maintains dual health state:

type Endpoint struct {
    uptimeHealthy       bool
    slotSyncHealthy     bool
    consecutiveTimeouts int32
}

Classification logic:

crash → node unreachable (5 consecutive timeouts → marked unhealthy)
desync → node reachable but slot lag > max_slot_lag (default: 50 slots)
503 response → immediate unhealthy marking

Result: nodes serving incorrect data receive zero traffic, not "degraded" traffic.

Hysteresis to prevent flapping

Dual-threshold recovery with configurable lag values:

{
  "max_slot_lag": 50,
  "recovery_slot_lag": 10
}

An endpoint marked unhealthy at 50+ slots lag won't recover until it's within 10 slots — preventing rapid on/off cycling during network instability.

Smart desync protection

When use_smart_desync_logic: true, the system refuses to mark the last healthy endpoint as unhealthy due to slot lag:

if em.useSmartDesyncLogic && wasSlotSyncHealthy {
    healthyBackups := em.countHealthyEndpointsExcluding(endpoint.URL)
    if healthyBackups == 0 {
        shouldMarkUnhealthy = false // Keep serving with stale data vs. total outage
    }
}

Trade-off accepted: slightly stale data is better than zero availability.

2) Sub-second failover without client retries

Proxy-side failover via round-robin across healthy endpoints only
Node removed from rotation immediately after health check failure
Effective failover: under 1 second (200ms check interval × max 5 consecutive failures = 1s worst case)
No client-side complexity, no retry storms

Trade-off accepted: in-flight requests fail on crash. Predictable > magical.

3) Custom Go RPC proxy (in-house)

I built the proxy from scratch to handle things HAProxy cannot:

HTTP JSON-RPC Layer

RPC request parsing — batch and single request support with proper ID handling
Method-level caching via Redis:
- getTransaction (TTL: 10 min, only non-null results)
- getProgramAccounts (TTL: 5s)
- getAccountInfo (TTL: 5s)
- getMultipleAccounts (TTL: 5s)
Cache key: SHA-256 hash of method + params
Cache hit indicated via X-Cache-Status: HIT header

Yellowstone gRPC Gateway

Transparent bidirectional streaming proxy — raw bytes, no serialization overhead
All Geyser methods supported: Subscribe, Ping, GetSlot, GetLatestBlockhash, etc.
Per-IP concurrent stream limits (max_grpc_streams)
Automatic TLS for port 443 endpoints

Rate Limiting

Sharded in-memory rate limiters (16 shards for lock contention reduction)
Per-IP RPS limits with token bucket algorithm
IP whitelist with hot-reload (file watcher on ips.json)

Operational Features

Dynamic config reloads without restart (fsnotify on config.json)
Debug endpoint (/debug/status) with per-method call statistics
Structured logging with Seq integration

Performance Optimizations

HTTP/2 support enabled (ForceAttemptHTTP2: true)
Connection pooling: 1000 idle connections, 500 per host
64MB max message size for gRPC

4) Bare-metal RPC cluster

Separate physical servers
Same region / same DC
No shared state
Automatic re-entry into rotation when healthy

Production Results (last 6 months)

99.95% uptime
10k RPS sustained for weeks (read-heavy workload)
0 manual traffic intervention
0 desync incidents served to users

This system has been running unchanged in production for 6+ months.

Operability Philosophy

Minimal, intentional observability:

Grafana: HAProxy 200/500 stats
Telegram: critical alerts only (real on-call)
Seq: structured logs for:
- proxy errors
- config updates
- node health transitions (with slot numbers)

Why This Is Non-Trivial

Most "HA RPC" setups solve availability.

This system solves correctness under partial failure.

On Solana: a responsive RPC node can still be a broken RPC node. Detecting and enforcing that distinction at routing level — with hysteresis, smart failover protection, and dual-protocol support — is what makes this production-grade.

Source Code

The gateway is open source: github.com/jxad/solana-rpc-gateway