Problem I Solved
Solana RPC nodes can be alive but wrong.
A desynced node (slot lag) still answers requests, and classic L4/L7 load balancers cannot reliably detect this condition.
My initial setup had hard limitations:
- Desynced nodes kept receiving traffic
- No RPC-level caching
- No per-user or per-IP rate limits
- Manual intervention required to remove bad nodes
- User-facing downtime during incidents
What I Built
A desync-aware, high-availability RPC gateway with deterministic failover, traffic gating, and dual-protocol support (HTTP JSON-RPC + Yellowstone gRPC), written entirely in TypeScript and running on the Bun runtime.
Architecture
Clients (HTTP/gRPC)
↓
Bun RPC Server / gRPC Server
↓
GatewayEngine (plugin pipeline)
├─ RateLimitPlugin
├─ CachePlugin
└─ MetricsPlugin
↓
NodeManager (health-aware round-robin)
↓
Solana RPC nodes (bare metal)
Supporting services:
├─ Redis (cache + rate limits + stream counters)
├─ PostgreSQL (node registry, API keys, IP access, runtime config)
└─ Admin API (JWT-authenticated REST management plane)
Core Engineering Contributions
1) Desync-aware health model (not just crash detection)
The health check job runs on a configurable interval (default: 15s) and does two things per node:
- Reachability check — calls
getHealthandgetSlotJSON-RPC methods with a timeout; marks unhealthy after 3 consecutive errors - Slot sync check — compares each node's reported slot against a reference value (a configured reference endpoint URL, or the highest slot seen across the pool)
Each node maintains runtime state:
interface NodeRuntimeState {
isHealthy: boolean;
lastSlot: number | null;
lastLatency: number | null;
errorCount: number;
lastHealthCheck: Date | null;
}
Classification logic:
- crash → node unreachable (3 consecutive errors → marked unhealthy)
- desync → slot lag exceeds
DESYNC_SLOT_THRESHOLD_UNHEALTHY(default: 15 slots) - recovery → node re-enters rotation when lag drops below
DESYNC_SLOT_THRESHOLD_HEALTHY(default: 5 slots)
Result: desynced nodes receive zero traffic.
Hysteresis to prevent flapping
Dual-threshold recovery prevents rapid on/off cycling during network instability:
DESYNC_SLOT_THRESHOLD_UNHEALTHY=15
DESYNC_SLOT_THRESHOLD_HEALTHY=5
A node marked unhealthy at 15+ slots of lag won't re-enter rotation until it's within 5 slots. The thresholds are configurable at runtime via the Admin API without a restart.
Smart desync protection
When DESYNC_KEEP_ONE_ONLINE=true, the system refuses to mark the last healthy node as unhealthy due to slot lag:
// Before marking a desynced node unhealthy,
// check if it's the last one standing
if (env.DESYNC_KEEP_ONE_ONLINE) {
const healthyCount = /* count of currently healthy nodes excluding this one */;
if (healthyCount === 0) {
// Keep serving with stale data rather than causing total outage
return; // skip marking unhealthy
}
}
Trade-off accepted: slightly stale data is better than zero availability.
2) Failover without client retries
NodeManagermaintains in-memory round-robin state per node pool (separate pools for RPC and gRPC)- Unhealthy nodes are removed from the rotation immediately after the health check job updates runtime state
- In-flight requests fail on crash; subsequent requests route automatically to the next healthy node
- Effective failover time is bounded by
HEALTH_CHECK_INTERVAL_MS(default: 15s)
No database is consulted per request. All routing state is held in memory and refreshed on a background schedule.
3) Plugin-based gateway engine
Rather than hardcoding cross-cutting concerns into the proxy, all non-routing logic is implemented as plugins:
interface GatewayPlugin {
name: string;
onRequest(ctx: GatewayContext): Promise<void>;
onResponse?(ctx: GatewayContext): Promise<void>;
}
Plugins run sequentially. Any plugin can set ctx.handled = true to short-circuit the pipeline (e.g. a cache hit). The gateway is a pure proxy if instantiated with no plugins.
RPC plugin pipeline:
RpcRateLimitPlugin— Redis token bucket, API key or IPRpcCachePlugin— method-level Redis cache, returnsX-Cache: HITon hitRpcMetricsPlugin— Prometheus counters and histograms
gRPC plugin pipeline:
GrpcRateLimitPlugin— same token bucket, includes gRPC access checkGrpcStreamLimiterPlugin— concurrent stream counter per entityGrpcMetricsPlugin— Prometheus counters and histograms
4) Redis-based rate limiting (token bucket, atomic)
Rate limiting runs entirely on Redis via a Lua script for atomicity. No in-memory state is involved, making it correct across multiple gateway instances behind a load balancer.
Algorithm: Token Bucket
- Key:
ratelimit:{entityId}stored as a Redis hash (tokens,last_refill) - Refill rate:
rpsLimit / 1000tokens per millisecond - Capacity:
rpsLimittokens - Single Lua call per request (atomic check + consume)
Entity resolution:
- API key present → rate limit by API key (
x-api-keyheader orapi-keyquery param) - No API key → rate limit by IP
Fail-open policy:
- On Redis failure: allows request if
RATE_LIMIT_FAIL_OPEN=true(default), denies otherwise - Logs error with context; does not crash the gateway
Response headers on rejection:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 0.43
5) Method-level RPC caching
The cache plugin uses Redis with deterministic SHA-256 cache keys computed via Bun's native CryptoHasher:
rpc_cache:{method}:{sha256(params)}
Default TTLs:
| Method | TTL |
| --------------------- | ---- |
| getTransaction | 600s |
| getProgramAccounts | 5s |
| getAccountInfo | 5s |
| getMultipleAccounts | 5s |
These are configurable at runtime via CACHE_TTL_MAP (JSON env var) or through the Admin API. Cache can be enabled/disabled without a restart.
6) Yellowstone gRPC gateway
The gRPC server uses @grpc/grpc-js, loading the Geyser proto at startup.
Supported methods:
- Bidirectional streaming:
Subscribe - Unary:
Ping,GetSlot,GetLatestBlockhash,GetBlockHeight,GetVersion,IsBlockhashValid,SubscribeReplayInfo
Stream proxying:
- Transparent byte-level bidirectional proxy — no serialization/deserialization in the hot path
- Client → Backend: forwards subscription requests
- Backend → Client: forwards updates
- Handles stream errors, cancellation, and cleanup
- Marks backend node unhealthy on
UNAVAILABLEgRPC status
Per-entity concurrent stream limiting:
- Redis counter per entity:
grpc_streams:{entityType}:{entityId} - Increments on stream open, decrements on close, auto-expires (TTL: 1h)
- Rejected streams return gRPC
RESOURCE_EXHAUSTED
Connection pooling:
GrpcClientPoolreuses clients per backend address- Supports TLS (
grpcs://) and plaintext (grpc://) - Keepalive: 10s interval, 5s timeout
- Max message size: 64MB
7) Admin API and runtime reconfiguration
All operational parameters are configurable at runtime without a restart via a JWT-authenticated REST API:
/nodes— add/remove/update backend nodes; view live runtime health state per node/api-keys— create/update/delete API keys with per-key RPS and stream limits/ip-access— IP allowlist/denylist management/config— update desync thresholds, cache settings, rate limit behavior/metrics— Prometheus scrape endpoint/health— liveness probe (no auth), returns 503 when draining or no healthy nodes
Changes propagate via in-memory refresh jobs (default: 30s staleness acceptable for non-critical config).
8) Graceful shutdown and draining
The gateway supports a draining state for zero-downtime deploys:
- Sets
is_drainingPrometheus gauge - Stops accepting new requests
- Waits for active in-flight requests and open gRPC streams to complete
- Configurable drain timeout prevents hanging indefinitely
Observability
Prometheus metrics (exposed at /metrics):
| Metric | Type | Description |
| ---------------------------- | --------- | ---------------------------------- |
| rpc_requests_total | Counter | Total RPC requests |
| rpc_request_duration_ms | Histogram | End-to-end latency |
| rpc_method_requests_total | Counter | Requests per JSON-RPC method |
| rpc_errors_total | Counter | Errors by method and status |
| grpc_requests_total | Counter | Total gRPC requests |
| grpc_stream_open_total | Counter | Total streams opened |
| grpc_stream_rejected_total | Counter | Streams rejected by limiter |
| active_grpc_streams | Gauge | Currently open streams |
| cache_hits_total | Counter | Cache hits |
| cache_misses_total | Counter | Cache misses |
| rate_limit_triggered_total | Counter | Rate limit triggers by entity type |
| healthy_nodes | Gauge | Healthy nodes by type (rpc/grpc) |
| node_latency_ms | Histogram | Backend node latency |
| is_draining | Gauge | Draining state |
| active_requests | Gauge | In-flight requests |
Structured logging via Winston with request IDs for tracing across log lines.
Source Code
The gateway is open source: github.com/jxad/solana-rpc-gateway