Skip to content

feat: gateway inter-pod session forwarding for HA multi-replica deployments #1994

Description

@ajtran

Problem Statement

When the OpenShell gateway runs with multiple replicas, a sandbox supervisor establishes its relay session (a persistent bidirectional gRPC stream) to exactly one gateway pod. Client requests — relay opens, sandbox connect, loopback service HTTP — are load-balanced across all pods. Requests that land on a pod that does not hold the supervisor session fail immediately with supervisor session not connected. For multi-agent systems where inter-component traffic flows continuously through the gateway, this makes HA multi-replica deployments operationally unusable.

Technical Context

The SupervisorSessionRegistry in the gateway server is a pure in-memory data structure — a Mutex<HashMap<String, LiveSession>> per pod process. There is no cross-pod session discovery, no shared state for session ownership, and no forwarding path. When a client request arrives at a pod that does not own the session, the pod waits up to SESSION_WAIT_MAX_BACKOFF (2 seconds) and then returns Unavailable. This is by design for simplicity, but it is incompatible with HA deployments where clients may reach any replica.

Affected Components

Component Key Files Role
Session registry crates/openshell-server/src/supervisor_session.rs In-memory session map; relay open/claim lifecycle
Persistence layer crates/openshell-server/src/persistence/mod.rs Where session ownership CRUD would be added
DB migrations crates/openshell-server/migrations/ New session ownership table
Gateway config crates/openshell-server/src/config_file.rs New gateway_internal_address field
Helm chart deploy/helm/ Inject pod IP via downward API
Proto proto/openshell.proto New internal ForwardRelay RPC

Technical Investigation

Architecture Overview

The supervisor relay lifecycle (all in supervisor_session.rs):

  1. Supervisor sends SupervisorHello → gateway registers LiveSession { tx: mpsc::Sender<GatewayMessage>, session_id, ... } in sessions: Mutex<HashMap<String, LiveSession>> (line 77).
  2. Client calls open_relay_with_target → gateway calls wait_for_session(sandbox_id, timeout), looks up local sessions map, sends RelayOpen over the supervisor's tx channel.
  3. Supervisor dials back with RelayStream RPC → claim_relay resolves a oneshot::Sender<DuplexStream>, creating a tokio::io::duplex pair bridging the two sides.
  4. Bytes flow bidirectionally until either side closes.

All of this state — the tx sender, the oneshot resolvers, the in-flight channels — lives exclusively in the process that accepted the ConnectSupervisor stream. It cannot be serialized, migrated, or shared.

The reconciler lease (compute/lease.rs) is unrelated — it controls which pod runs reconcile/watch loops, not which pod serves relay requests. All pods accept gRPC connections regardless of lease ownership.

Code References

Location Description
crates/openshell-server/src/supervisor_session.rs:77 sessions HashMap — pure in-memory, no persistence
crates/openshell-server/src/supervisor_session.rs:172 wait_for_session — polls local map, returns Unavailable on timeout
crates/openshell-server/src/supervisor_session.rs:236 open_relay_with_target — where forwarding logic would be added
crates/openshell-server/src/supervisor_session.rs:573 handle_connect_supervisor — where session ownership DB write would be added
crates/openshell-server/src/supervisor_session.rs:614 Session registration (line where LiveSession is inserted into map)
crates/openshell-server/src/supervisor_session.rs:679 Session cleanup — where ownership DB delete would be added
crates/openshell-server/src/compute/mod.rs:238 replica_id() — per-process gateway identity already exists
crates/openshell-server/tests/supervisor_relay_integration.rs Relay wire-protocol tests (single-gateway only today)

Current Behavior

wait_for_session polls the local in-memory map with exponential backoff up to 2 seconds. If the session is on another pod, it never appears in the local map and the call returns Status::unavailable("supervisor session not connected").

What Would Need to Change

A. Session ownership registry (DB-backed)
New persistence table recording (sandbox_id, gateway_instance_id, gateway_internal_address, session_id, connected_at_ms). Written on handle_connect_supervisor success (line 614); deleted on session end (line 679) using remove_if_current semantics to handle supersede races. Requires a new DB migration for both SQLite and Postgres.

B. Forwarding path in open_relay_with_target (line 236)
After wait_for_session times out locally, query the ownership table. If another pod owns the session, make an internal gRPC call to that pod's ForwardRelay RPC, bridge the resulting stream, and return it to the caller. If no pod owns the session, return the original Unavailable.

C. New internal ForwardRelay RPC
A new RPC on the gateway's gRPC service (or a separate internal-only service) that accepts a relay open request, looks up the local session, and streams relay frames back. Must be authenticated (pod-to-pod mTLS or a gateway service token) to prevent external abuse.

D. Gateway internal address configuration
Each pod needs to know its own cluster-internal address so it can be written to the ownership table and dialed by peers. Options: Kubernetes downward API status.podIP injected as an env var, or a headless service providing per-pod DNS (pod-name.service.namespace.svc.cluster.local). New gateway_internal_address config field with Helm chart support.

Alternative Approaches Considered

  • Session-aware load balancing at the ingress layer: Use consistent hashing on sandbox_id at the Kubernetes Service or Envoy level to always route a given sandbox's requests to the pod holding its session. Avoids new distributed state in the gateway entirely. Tradeoff: requires infrastructure cooperation (custom Envoy filter or ingress annotation), does not handle pod-crash-and-reconnect gracefully (session moves to new pod, LB must invalidate its mapping), and puts routing logic outside the gateway's control.
  • Shared message broker (Redis/NATS): Supervisors connect to a broker instead of directly to a gateway pod; any gateway pod can publish to any supervisor. Eliminates session affinity entirely. Tradeoff: adds a new required infrastructure dependency and significantly changes the relay architecture.
  • Scale to replicaCount=1: Eliminates the problem at the cost of gateway HA. Appropriate for dev clusters and single-region deployments where gateway restart recovery time is acceptable.

Patterns to Follow

  • remove_if_current session cleanup semantics already exist in the session registry — the ownership table delete must follow the same pattern.
  • replica_id() in compute/mod.rs:238 is the right identity to use as gateway_instance_id in the ownership record.
  • Pod-to-pod auth should follow the same mTLS model used for supervisor-to-gateway auth.

Proposed Approach

Introduce a DB-backed session ownership table written on supervisor connect and cleared on disconnect. When a relay request arrives at a pod that doesn't hold the session locally, query the ownership table and forward the relay to the owning pod via an internal gRPC RPC. Pods discover each other via a new gateway_internal_address config field injected from the Kubernetes downward API. The forwarding path adds one RTT for cross-pod relays but is transparent to clients.

The session-aware LB alternative is worth evaluating as a lower-complexity option if the DB-backed approach introduces unacceptable operational complexity for operators using SQLite.

Scope Assessment

  • Complexity: High
  • Confidence: Medium — direction is clear, but distributed state correctness (supersede races, crash recovery, TTL expiry) has significant unknowns
  • Estimated files to change: ~10–14
  • Issue type: feat

Risks & Open Questions

  • Pod crash without clean teardown: If a pod crashes, its ownership records remain in the DB. Forwarding attempts to the dead pod will fail. Ownership records need a TTL or heartbeat mechanism, and the forwarding path needs a fallback for unreachable pods.
  • Supersede races: If a supervisor reconnects to pod B while pod A still holds a stale ownership record, pod B must atomically supersede pod A's record. remove_if_current semantics (matching on session_id) handle this — verify the DB operation is truly atomic.
  • Session-aware LB vs. gateway forwarding: Which approach to prioritize is a design decision that needs maintainer input. The LB approach is simpler to implement but harder to operate; the gateway forwarding approach is self-contained but adds distributed state.
  • SQLite multi-process write contention: With multiple gateway pods each writing session ownership records, SQLite WAL mode must be confirmed as sufficient or Postgres required for HA deployments. (Postgres is already required for replicaCount > 1 via server.externalDbSecret.)
  • Internal RPC auth model: Pod-to-pod forwarding RPCs must not be reachable from outside the cluster. Options: mTLS with a gateway-internal CA, network policy restriction, or a shared secret injected via Helm. Each has operational tradeoffs.
  • Latency impact: Cross-pod forwarding adds one network RTT per relay open. For latency-sensitive agent reasoning loops, this may be significant. Measure before committing to this approach.

Test Considerations

  • Unit tests for wait_for_session fallback branch (ownership lookup + forward attempt)
  • Unit tests for ownership table CRUD with supersede race scenarios
  • Unit tests for handle_connect_supervisor ownership write and session-end cleanup
  • Integration test: two gateway instances, supervisor connects to instance A, client connects to instance B, relay succeeds via forwarding
  • E2e test (Kubernetes): replicaCount=2, repeated relay requests confirm no session-affinity failures (test:e2e-kubernetes)
  • Chaos test: kill the owning gateway pod mid-relay, verify supervisor reconnects and subsequent relay requests succeed on the new pod

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions