High-Availability Solana RPC — sub-second failover, desync-safe routing

January 5, 2026

Problem I Solved

Solana RPC nodes can be alive but wrong.

A desynced node (slot lag) still answers requests, and classic L4/L7 load balancers cannot reliably detect this condition.

My initial setup (HAProxy only) had hard limitations:

  • Desynced nodes kept receiving traffic
  • No RPC caching
  • No per-user or per-IP rate limits
  • Manual intervention required to remove bad nodes
  • User-facing downtime during incidents

What I Built

A desync-aware, high-availability RPC stack with deterministic failover, traffic gating, and dual-protocol support (HTTP JSON-RPC + Yellowstone gRPC).

Architecture

Clients (HTTP/gRPC)
        
    HAProxy
        
Custom Go Proxy (control plane)
   ├─ Redis (response cache)
   └─ Endpoint Manager (health state)
        
Solana RPC nodes (bare metal)

Key design choice: HAProxy stays dumb and fast. All intelligence lives in a custom Go proxy — health decisions, caching, rate limiting, and protocol-aware routing.

Core Engineering Contributions

1) Desync-aware health model (not just crash detection)

The system runs two independent health check loops:

  • Uptime checks — every 200ms (configurable), calling getSlot with 2s timeout
  • Slot sync checks — every 5s (configurable), comparing against a reference endpoint

Each endpoint maintains dual health state:

type Endpoint struct {
    uptimeHealthy       bool
    slotSyncHealthy     bool
    consecutiveTimeouts int32
}

Classification logic:

  • crash → node unreachable (5 consecutive timeouts → marked unhealthy)
  • desync → node reachable but slot lag > max_slot_lag (default: 50 slots)
  • 503 response → immediate unhealthy marking

Result: nodes serving incorrect data receive zero traffic, not "degraded" traffic.

Hysteresis to prevent flapping

Dual-threshold recovery with configurable lag values:

{
  "max_slot_lag": 50,
  "recovery_slot_lag": 10
}

An endpoint marked unhealthy at 50+ slots lag won't recover until it's within 10 slots — preventing rapid on/off cycling during network instability.

Smart desync protection

When use_smart_desync_logic: true, the system refuses to mark the last healthy endpoint as unhealthy due to slot lag:

if em.useSmartDesyncLogic && wasSlotSyncHealthy {
    healthyBackups := em.countHealthyEndpointsExcluding(endpoint.URL)
    if healthyBackups == 0 {
        shouldMarkUnhealthy = false // Keep serving with stale data vs. total outage
    }
}

Trade-off accepted: slightly stale data is better than zero availability.

2) Sub-second failover without client retries

  • Proxy-side failover via round-robin across healthy endpoints only
  • Node removed from rotation immediately after health check failure
  • Effective failover: under 1 second (200ms check interval × max 5 consecutive failures = 1s worst case)
  • No client-side complexity, no retry storms

Trade-off accepted: in-flight requests fail on crash. Predictable > magical.

3) Custom Go RPC proxy (in-house)

I built the proxy from scratch to handle things HAProxy cannot:

HTTP JSON-RPC Layer

  • RPC request parsing — batch and single request support with proper ID handling
  • Method-level caching via Redis:
    • getTransaction (TTL: 10 min, only non-null results)
    • getProgramAccounts (TTL: 5s)
    • getAccountInfo (TTL: 5s)
    • getMultipleAccounts (TTL: 5s)
  • Cache key: SHA-256 hash of method + params
  • Cache hit indicated via X-Cache-Status: HIT header

Yellowstone gRPC Gateway

  • Transparent bidirectional streaming proxy — raw bytes, no serialization overhead
  • All Geyser methods supported: Subscribe, Ping, GetSlot, GetLatestBlockhash, etc.
  • Per-IP concurrent stream limits (max_grpc_streams)
  • Automatic TLS for port 443 endpoints

Rate Limiting

  • Sharded in-memory rate limiters (16 shards for lock contention reduction)
  • Per-IP RPS limits with token bucket algorithm
  • IP whitelist with hot-reload (file watcher on ips.json)

Operational Features

  • Dynamic config reloads without restart (fsnotify on config.json)
  • Debug endpoint (/debug/status) with per-method call statistics
  • Structured logging with Seq integration

Performance Optimizations

  • HTTP/2 support enabled (ForceAttemptHTTP2: true)
  • Connection pooling: 1000 idle connections, 500 per host
  • 64MB max message size for gRPC

4) Bare-metal RPC cluster

  • Separate physical servers
  • Same region / same DC
  • No shared state
  • Automatic re-entry into rotation when healthy

Production Results (last 6 months)

  • 99.95% uptime
  • 10k RPS sustained for weeks (read-heavy workload)
  • 0 manual traffic intervention
  • 0 desync incidents served to users

This system has been running unchanged in production for 6+ months.

Operability Philosophy

Minimal, intentional observability:

  • Grafana: HAProxy 200/500 stats
  • Telegram: critical alerts only (real on-call)
  • Seq: structured logs for:
    • proxy errors
    • config updates
    • node health transitions (with slot numbers)

Why This Is Non-Trivial

Most "HA RPC" setups solve availability.

This system solves correctness under partial failure.

On Solana: a responsive RPC node can still be a broken RPC node. Detecting and enforcing that distinction at routing level — with hysteresis, smart failover protection, and dual-protocol support — is what makes this production-grade.

Source Code

The gateway is open source: github.com/jxad/solana-rpc-gateway