High-Availability Solana RPC — sub-second failover, desync-safe routing

Problem I Solved

Solana RPC nodes can be alive but wrong.

A desynced node (slot lag) still answers requests, and classic L4/L7 load balancers cannot reliably detect this condition.

My initial setup had hard limitations:

Desynced nodes kept receiving traffic
No RPC-level caching
No per-user or per-IP rate limits
Manual intervention required to remove bad nodes
User-facing downtime during incidents

What I Built

A desync-aware, high-availability RPC gateway with deterministic failover, traffic gating, and dual-protocol support (HTTP JSON-RPC + Yellowstone gRPC), written entirely in TypeScript and running on the Bun runtime.

Architecture

Clients (HTTP/gRPC)
        ↓
  Bun RPC Server / gRPC Server
        ↓
  GatewayEngine (plugin pipeline)
   ├─ RateLimitPlugin
   ├─ CachePlugin
   └─ MetricsPlugin
        ↓
  NodeManager (health-aware round-robin)
        ↓
Solana RPC nodes (bare metal)

Supporting services:
  ├─ Redis (cache + rate limits + stream counters)
  ├─ PostgreSQL (node registry, API keys, IP access, runtime config)
  └─ Admin API (JWT-authenticated REST management plane)

Core Engineering Contributions

1) Desync-aware health model (not just crash detection)

The health check job runs on a configurable interval (default: 15s) and does two things per node:

Reachability check — calls getHealth and getSlot JSON-RPC methods with a timeout; marks unhealthy after 3 consecutive errors
Slot sync check — compares each node's reported slot against a reference value (a configured reference endpoint URL, or the highest slot seen across the pool)

Each node maintains runtime state:

interface NodeRuntimeState {
  isHealthy: boolean;
  lastSlot: number | null;
  lastLatency: number | null;
  errorCount: number;
  lastHealthCheck: Date | null;
}

Classification logic:

crash → node unreachable (3 consecutive errors → marked unhealthy)
desync → slot lag exceeds DESYNC_SLOT_THRESHOLD_UNHEALTHY (default: 15 slots)
recovery → node re-enters rotation when lag drops below DESYNC_SLOT_THRESHOLD_HEALTHY (default: 5 slots)

Result: desynced nodes receive zero traffic.

Hysteresis to prevent flapping

Dual-threshold recovery prevents rapid on/off cycling during network instability:

DESYNC_SLOT_THRESHOLD_UNHEALTHY=15
DESYNC_SLOT_THRESHOLD_HEALTHY=5

A node marked unhealthy at 15+ slots of lag won't re-enter rotation until it's within 5 slots. The thresholds are configurable at runtime via the Admin API without a restart.

Smart desync protection

When DESYNC_KEEP_ONE_ONLINE=true, the system refuses to mark the last healthy node as unhealthy due to slot lag:

// Before marking a desynced node unhealthy,
// check if it's the last one standing
if (env.DESYNC_KEEP_ONE_ONLINE) {
  const healthyCount = /* count of currently healthy nodes excluding this one */;
  if (healthyCount === 0) {
    // Keep serving with stale data rather than causing total outage
    return; // skip marking unhealthy
  }
}

Trade-off accepted: slightly stale data is better than zero availability.

2) Failover without client retries

NodeManager maintains in-memory round-robin state per node pool (separate pools for RPC and gRPC)
Unhealthy nodes are removed from the rotation immediately after the health check job updates runtime state
In-flight requests fail on crash; subsequent requests route automatically to the next healthy node
Effective failover time is bounded by HEALTH_CHECK_INTERVAL_MS (default: 15s)

No database is consulted per request. All routing state is held in memory and refreshed on a background schedule.

3) Plugin-based gateway engine

Rather than hardcoding cross-cutting concerns into the proxy, all non-routing logic is implemented as plugins:

interface GatewayPlugin {
  name: string;
  onRequest(ctx: GatewayContext): Promise<void>;
  onResponse?(ctx: GatewayContext): Promise<void>;
}

Plugins run sequentially. Any plugin can set ctx.handled = true to short-circuit the pipeline (e.g. a cache hit). The gateway is a pure proxy if instantiated with no plugins.

RPC plugin pipeline:

RpcRateLimitPlugin — Redis token bucket, API key or IP
RpcCachePlugin — method-level Redis cache, returns X-Cache: HIT on hit
RpcMetricsPlugin — Prometheus counters and histograms

gRPC plugin pipeline:

GrpcRateLimitPlugin — same token bucket, includes gRPC access check
GrpcStreamLimiterPlugin — concurrent stream counter per entity
GrpcMetricsPlugin — Prometheus counters and histograms

4) Redis-based rate limiting (token bucket, atomic)

Rate limiting runs entirely on Redis via a Lua script for atomicity. No in-memory state is involved, making it correct across multiple gateway instances behind a load balancer.

Algorithm: Token Bucket

Key: ratelimit:{entityId} stored as a Redis hash (tokens, last_refill)
Refill rate: rpsLimit / 1000 tokens per millisecond
Capacity: rpsLimit tokens
Single Lua call per request (atomic check + consume)

Entity resolution:

API key present → rate limit by API key (x-api-key header or api-key query param)
No API key → rate limit by IP

Fail-open policy:

On Redis failure: allows request if RATE_LIMIT_FAIL_OPEN=true (default), denies otherwise
Logs error with context; does not crash the gateway

Response headers on rejection:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 0.43

5) Method-level RPC caching

The cache plugin uses Redis with deterministic SHA-256 cache keys computed via Bun's native CryptoHasher:

rpc_cache:{method}:{sha256(params)}

Default TTLs:

| Method                | TTL  |
| --------------------- | ---- |
| getTransaction        | 600s |
| getProgramAccounts    | 5s   |
| getAccountInfo        | 5s   |
| getMultipleAccounts   | 5s   |

These are configurable at runtime via CACHE_TTL_MAP (JSON env var) or through the Admin API. Cache can be enabled/disabled without a restart.

6) Yellowstone gRPC gateway

The gRPC server uses @grpc/grpc-js, loading the Geyser proto at startup.

Supported methods:

Bidirectional streaming: Subscribe
Unary: Ping, GetSlot, GetLatestBlockhash, GetBlockHeight, GetVersion, IsBlockhashValid, SubscribeReplayInfo

Stream proxying:

Transparent byte-level bidirectional proxy — no serialization/deserialization in the hot path
Client → Backend: forwards subscription requests
Backend → Client: forwards updates
Handles stream errors, cancellation, and cleanup
Marks backend node unhealthy on UNAVAILABLE gRPC status

Per-entity concurrent stream limiting:

Redis counter per entity: grpc_streams:{entityType}:{entityId}
Increments on stream open, decrements on close, auto-expires (TTL: 1h)
Rejected streams return gRPC RESOURCE_EXHAUSTED

Connection pooling:

GrpcClientPool reuses clients per backend address
Supports TLS (grpcs://) and plaintext (grpc://)
Keepalive: 10s interval, 5s timeout
Max message size: 64MB

7) Admin API and runtime reconfiguration

All operational parameters are configurable at runtime without a restart via a JWT-authenticated REST API:

/nodes — add/remove/update backend nodes; view live runtime health state per node
/api-keys — create/update/delete API keys with per-key RPS and stream limits
/ip-access — IP allowlist/denylist management
/config — update desync thresholds, cache settings, rate limit behavior
/metrics — Prometheus scrape endpoint
/health — liveness probe (no auth), returns 503 when draining or no healthy nodes

Changes propagate via in-memory refresh jobs (default: 30s staleness acceptable for non-critical config).

8) Graceful shutdown and draining

The gateway supports a draining state for zero-downtime deploys:

Sets is_draining Prometheus gauge
Stops accepting new requests
Waits for active in-flight requests and open gRPC streams to complete
Configurable drain timeout prevents hanging indefinitely

Observability

Prometheus metrics (exposed at /metrics):

| Metric                       | Type      | Description                        |
| ---------------------------- | --------- | ---------------------------------- |
| rpc_requests_total           | Counter   | Total RPC requests                 |
| rpc_request_duration_ms      | Histogram | End-to-end latency                 |
| rpc_method_requests_total    | Counter   | Requests per JSON-RPC method       |
| rpc_errors_total             | Counter   | Errors by method and status        |
| grpc_requests_total          | Counter   | Total gRPC requests                |
| grpc_stream_open_total       | Counter   | Total streams opened               |
| grpc_stream_rejected_total   | Counter   | Streams rejected by limiter        |
| active_grpc_streams          | Gauge     | Currently open streams             |
| cache_hits_total             | Counter   | Cache hits                         |
| cache_misses_total           | Counter   | Cache misses                       |
| rate_limit_triggered_total   | Counter   | Rate limit triggers by entity type |
| healthy_nodes                | Gauge     | Healthy nodes by type (rpc/grpc)   |
| node_latency_ms              | Histogram | Backend node latency               |
| is_draining                  | Gauge     | Draining state                     |
| active_requests              | Gauge     | In-flight requests                 |

Structured logging via Winston with request IDs for tracing across log lines.

Source Code

The gateway is open source: github.com/jxad/solana-rpc-gateway