Problem I Solved
Solana RPC nodes can be alive but wrong.
A desynced node (slot lag) still answers requests, and classic L4/L7 load balancers cannot reliably detect this condition.
My initial setup (HAProxy only) had hard limitations:
- Desynced nodes kept receiving traffic
- No RPC caching
- No per-user or per-IP rate limits
- Manual intervention required to remove bad nodes
- User-facing downtime during incidents
What I Built
A desync-aware, high-availability RPC stack with deterministic failover, traffic gating, and dual-protocol support (HTTP JSON-RPC + Yellowstone gRPC).
Architecture
Clients (HTTP/gRPC)
↓
HAProxy
↓
Custom Go Proxy (control plane)
├─ Redis (response cache)
└─ Endpoint Manager (health state)
↓
Solana RPC nodes (bare metal)
Key design choice: HAProxy stays dumb and fast. All intelligence lives in a custom Go proxy — health decisions, caching, rate limiting, and protocol-aware routing.
Core Engineering Contributions
1) Desync-aware health model (not just crash detection)
The system runs two independent health check loops:
- Uptime checks — every 200ms (configurable), calling
getSlotwith 2s timeout - Slot sync checks — every 5s (configurable), comparing against a reference endpoint
Each endpoint maintains dual health state:
type Endpoint struct {
uptimeHealthy bool
slotSyncHealthy bool
consecutiveTimeouts int32
}
Classification logic:
- crash → node unreachable (5 consecutive timeouts → marked unhealthy)
- desync → node reachable but slot lag >
max_slot_lag(default: 50 slots) - 503 response → immediate unhealthy marking
Result: nodes serving incorrect data receive zero traffic, not "degraded" traffic.
Hysteresis to prevent flapping
Dual-threshold recovery with configurable lag values:
{
"max_slot_lag": 50,
"recovery_slot_lag": 10
}
An endpoint marked unhealthy at 50+ slots lag won't recover until it's within 10 slots — preventing rapid on/off cycling during network instability.
Smart desync protection
When use_smart_desync_logic: true, the system refuses to mark the last healthy endpoint as unhealthy due to slot lag:
if em.useSmartDesyncLogic && wasSlotSyncHealthy {
healthyBackups := em.countHealthyEndpointsExcluding(endpoint.URL)
if healthyBackups == 0 {
shouldMarkUnhealthy = false // Keep serving with stale data vs. total outage
}
}
Trade-off accepted: slightly stale data is better than zero availability.
2) Sub-second failover without client retries
- Proxy-side failover via round-robin across healthy endpoints only
- Node removed from rotation immediately after health check failure
- Effective failover: under 1 second (200ms check interval × max 5 consecutive failures = 1s worst case)
- No client-side complexity, no retry storms
Trade-off accepted: in-flight requests fail on crash. Predictable > magical.
3) Custom Go RPC proxy (in-house)
I built the proxy from scratch to handle things HAProxy cannot:
HTTP JSON-RPC Layer
- RPC request parsing — batch and single request support with proper ID handling
- Method-level caching via Redis:
getTransaction(TTL: 10 min, only non-null results)getProgramAccounts(TTL: 5s)getAccountInfo(TTL: 5s)getMultipleAccounts(TTL: 5s)
- Cache key: SHA-256 hash of
method + params - Cache hit indicated via
X-Cache-Status: HITheader
Yellowstone gRPC Gateway
- Transparent bidirectional streaming proxy — raw bytes, no serialization overhead
- All Geyser methods supported:
Subscribe,Ping,GetSlot,GetLatestBlockhash, etc. - Per-IP concurrent stream limits (
max_grpc_streams) - Automatic TLS for port 443 endpoints
Rate Limiting
- Sharded in-memory rate limiters (16 shards for lock contention reduction)
- Per-IP RPS limits with token bucket algorithm
- IP whitelist with hot-reload (file watcher on
ips.json)
Operational Features
- Dynamic config reloads without restart (fsnotify on
config.json) - Debug endpoint (
/debug/status) with per-method call statistics - Structured logging with Seq integration
Performance Optimizations
- HTTP/2 support enabled (
ForceAttemptHTTP2: true) - Connection pooling: 1000 idle connections, 500 per host
- 64MB max message size for gRPC
4) Bare-metal RPC cluster
- Separate physical servers
- Same region / same DC
- No shared state
- Automatic re-entry into rotation when healthy
Production Results (last 6 months)
- 99.95% uptime
- 10k RPS sustained for weeks (read-heavy workload)
- 0 manual traffic intervention
- 0 desync incidents served to users
This system has been running unchanged in production for 6+ months.
Operability Philosophy
Minimal, intentional observability:
- Grafana: HAProxy 200/500 stats
- Telegram: critical alerts only (real on-call)
- Seq: structured logs for:
- proxy errors
- config updates
- node health transitions (with slot numbers)
Why This Is Non-Trivial
Most "HA RPC" setups solve availability.
This system solves correctness under partial failure.
On Solana: a responsive RPC node can still be a broken RPC node. Detecting and enforcing that distinction at routing level — with hysteresis, smart failover protection, and dual-protocol support — is what makes this production-grade.
Source Code
The gateway is open source: github.com/jxad/solana-rpc-gateway