Problem Statement
When the OpenShell gateway runs with multiple replicas, a sandbox supervisor establishes its relay session (a persistent bidirectional gRPC stream) to exactly one gateway pod. Client requests — relay opens, sandbox connect, loopback service HTTP — are load-balanced across all pods. Requests that land on a pod that does not hold the supervisor session fail immediately with supervisor session not connected. For multi-agent systems where inter-component traffic flows continuously through the gateway, this makes HA multi-replica deployments operationally unusable.
Technical Context
The SupervisorSessionRegistry in the gateway server is a pure in-memory data structure — a Mutex<HashMap<String, LiveSession>> per pod process. There is no cross-pod session discovery, no shared state for session ownership, and no forwarding path. When a client request arrives at a pod that does not own the session, the pod waits up to SESSION_WAIT_MAX_BACKOFF (2 seconds) and then returns Unavailable. This is by design for simplicity, but it is incompatible with HA deployments where clients may reach any replica.
Affected Components
| Component |
Key Files |
Role |
| Session registry |
crates/openshell-server/src/supervisor_session.rs |
In-memory session map; relay open/claim lifecycle |
| Persistence layer |
crates/openshell-server/src/persistence/mod.rs |
Where session ownership CRUD would be added |
| DB migrations |
crates/openshell-server/migrations/ |
New session ownership table |
| Gateway config |
crates/openshell-server/src/config_file.rs |
New gateway_internal_address field |
| Helm chart |
deploy/helm/ |
Inject pod IP via downward API |
| Proto |
proto/openshell.proto |
New internal ForwardRelay RPC |
Technical Investigation
Architecture Overview
The supervisor relay lifecycle (all in supervisor_session.rs):
- Supervisor sends
SupervisorHello → gateway registers LiveSession { tx: mpsc::Sender<GatewayMessage>, session_id, ... } in sessions: Mutex<HashMap<String, LiveSession>> (line 77).
- Client calls
open_relay_with_target → gateway calls wait_for_session(sandbox_id, timeout), looks up local sessions map, sends RelayOpen over the supervisor's tx channel.
- Supervisor dials back with
RelayStream RPC → claim_relay resolves a oneshot::Sender<DuplexStream>, creating a tokio::io::duplex pair bridging the two sides.
- Bytes flow bidirectionally until either side closes.
All of this state — the tx sender, the oneshot resolvers, the in-flight channels — lives exclusively in the process that accepted the ConnectSupervisor stream. It cannot be serialized, migrated, or shared.
The reconciler lease (compute/lease.rs) is unrelated — it controls which pod runs reconcile/watch loops, not which pod serves relay requests. All pods accept gRPC connections regardless of lease ownership.
Code References
| Location |
Description |
crates/openshell-server/src/supervisor_session.rs:77 |
sessions HashMap — pure in-memory, no persistence |
crates/openshell-server/src/supervisor_session.rs:172 |
wait_for_session — polls local map, returns Unavailable on timeout |
crates/openshell-server/src/supervisor_session.rs:236 |
open_relay_with_target — where forwarding logic would be added |
crates/openshell-server/src/supervisor_session.rs:573 |
handle_connect_supervisor — where session ownership DB write would be added |
crates/openshell-server/src/supervisor_session.rs:614 |
Session registration (line where LiveSession is inserted into map) |
crates/openshell-server/src/supervisor_session.rs:679 |
Session cleanup — where ownership DB delete would be added |
crates/openshell-server/src/compute/mod.rs:238 |
replica_id() — per-process gateway identity already exists |
crates/openshell-server/tests/supervisor_relay_integration.rs |
Relay wire-protocol tests (single-gateway only today) |
Current Behavior
wait_for_session polls the local in-memory map with exponential backoff up to 2 seconds. If the session is on another pod, it never appears in the local map and the call returns Status::unavailable("supervisor session not connected").
What Would Need to Change
A. Session ownership registry (DB-backed)
New persistence table recording (sandbox_id, gateway_instance_id, gateway_internal_address, session_id, connected_at_ms). Written on handle_connect_supervisor success (line 614); deleted on session end (line 679) using remove_if_current semantics to handle supersede races. Requires a new DB migration for both SQLite and Postgres.
B. Forwarding path in open_relay_with_target (line 236)
After wait_for_session times out locally, query the ownership table. If another pod owns the session, make an internal gRPC call to that pod's ForwardRelay RPC, bridge the resulting stream, and return it to the caller. If no pod owns the session, return the original Unavailable.
C. New internal ForwardRelay RPC
A new RPC on the gateway's gRPC service (or a separate internal-only service) that accepts a relay open request, looks up the local session, and streams relay frames back. Must be authenticated (pod-to-pod mTLS or a gateway service token) to prevent external abuse.
D. Gateway internal address configuration
Each pod needs to know its own cluster-internal address so it can be written to the ownership table and dialed by peers. Options: Kubernetes downward API status.podIP injected as an env var, or a headless service providing per-pod DNS (pod-name.service.namespace.svc.cluster.local). New gateway_internal_address config field with Helm chart support.
Alternative Approaches Considered
- Session-aware load balancing at the ingress layer: Use consistent hashing on
sandbox_id at the Kubernetes Service or Envoy level to always route a given sandbox's requests to the pod holding its session. Avoids new distributed state in the gateway entirely. Tradeoff: requires infrastructure cooperation (custom Envoy filter or ingress annotation), does not handle pod-crash-and-reconnect gracefully (session moves to new pod, LB must invalidate its mapping), and puts routing logic outside the gateway's control.
- Shared message broker (Redis/NATS): Supervisors connect to a broker instead of directly to a gateway pod; any gateway pod can publish to any supervisor. Eliminates session affinity entirely. Tradeoff: adds a new required infrastructure dependency and significantly changes the relay architecture.
- Scale to replicaCount=1: Eliminates the problem at the cost of gateway HA. Appropriate for dev clusters and single-region deployments where gateway restart recovery time is acceptable.
Patterns to Follow
remove_if_current session cleanup semantics already exist in the session registry — the ownership table delete must follow the same pattern.
replica_id() in compute/mod.rs:238 is the right identity to use as gateway_instance_id in the ownership record.
- Pod-to-pod auth should follow the same mTLS model used for supervisor-to-gateway auth.
Proposed Approach
Introduce a DB-backed session ownership table written on supervisor connect and cleared on disconnect. When a relay request arrives at a pod that doesn't hold the session locally, query the ownership table and forward the relay to the owning pod via an internal gRPC RPC. Pods discover each other via a new gateway_internal_address config field injected from the Kubernetes downward API. The forwarding path adds one RTT for cross-pod relays but is transparent to clients.
The session-aware LB alternative is worth evaluating as a lower-complexity option if the DB-backed approach introduces unacceptable operational complexity for operators using SQLite.
Scope Assessment
- Complexity: High
- Confidence: Medium — direction is clear, but distributed state correctness (supersede races, crash recovery, TTL expiry) has significant unknowns
- Estimated files to change: ~10–14
- Issue type:
feat
Risks & Open Questions
- Pod crash without clean teardown: If a pod crashes, its ownership records remain in the DB. Forwarding attempts to the dead pod will fail. Ownership records need a TTL or heartbeat mechanism, and the forwarding path needs a fallback for unreachable pods.
- Supersede races: If a supervisor reconnects to pod B while pod A still holds a stale ownership record, pod B must atomically supersede pod A's record.
remove_if_current semantics (matching on session_id) handle this — verify the DB operation is truly atomic.
- Session-aware LB vs. gateway forwarding: Which approach to prioritize is a design decision that needs maintainer input. The LB approach is simpler to implement but harder to operate; the gateway forwarding approach is self-contained but adds distributed state.
- SQLite multi-process write contention: With multiple gateway pods each writing session ownership records, SQLite WAL mode must be confirmed as sufficient or Postgres required for HA deployments. (Postgres is already required for
replicaCount > 1 via server.externalDbSecret.)
- Internal RPC auth model: Pod-to-pod forwarding RPCs must not be reachable from outside the cluster. Options: mTLS with a gateway-internal CA, network policy restriction, or a shared secret injected via Helm. Each has operational tradeoffs.
- Latency impact: Cross-pod forwarding adds one network RTT per relay open. For latency-sensitive agent reasoning loops, this may be significant. Measure before committing to this approach.
Test Considerations
- Unit tests for
wait_for_session fallback branch (ownership lookup + forward attempt)
- Unit tests for ownership table CRUD with supersede race scenarios
- Unit tests for
handle_connect_supervisor ownership write and session-end cleanup
- Integration test: two gateway instances, supervisor connects to instance A, client connects to instance B, relay succeeds via forwarding
- E2e test (Kubernetes):
replicaCount=2, repeated relay requests confirm no session-affinity failures (test:e2e-kubernetes)
- Chaos test: kill the owning gateway pod mid-relay, verify supervisor reconnects and subsequent relay requests succeed on the new pod
Created by spike investigation. Use build-from-issue to plan and implement.
Problem Statement
When the OpenShell gateway runs with multiple replicas, a sandbox supervisor establishes its relay session (a persistent bidirectional gRPC stream) to exactly one gateway pod. Client requests — relay opens,
sandbox connect, loopback service HTTP — are load-balanced across all pods. Requests that land on a pod that does not hold the supervisor session fail immediately withsupervisor session not connected. For multi-agent systems where inter-component traffic flows continuously through the gateway, this makes HA multi-replica deployments operationally unusable.Technical Context
The
SupervisorSessionRegistryin the gateway server is a pure in-memory data structure — aMutex<HashMap<String, LiveSession>>per pod process. There is no cross-pod session discovery, no shared state for session ownership, and no forwarding path. When a client request arrives at a pod that does not own the session, the pod waits up toSESSION_WAIT_MAX_BACKOFF(2 seconds) and then returnsUnavailable. This is by design for simplicity, but it is incompatible with HA deployments where clients may reach any replica.Affected Components
crates/openshell-server/src/supervisor_session.rscrates/openshell-server/src/persistence/mod.rscrates/openshell-server/migrations/crates/openshell-server/src/config_file.rsgateway_internal_addressfielddeploy/helm/proto/openshell.protoForwardRelayRPCTechnical Investigation
Architecture Overview
The supervisor relay lifecycle (all in
supervisor_session.rs):SupervisorHello→ gateway registersLiveSession { tx: mpsc::Sender<GatewayMessage>, session_id, ... }insessions: Mutex<HashMap<String, LiveSession>>(line 77).open_relay_with_target→ gateway callswait_for_session(sandbox_id, timeout), looks up localsessionsmap, sendsRelayOpenover the supervisor'stxchannel.RelayStreamRPC →claim_relayresolves aoneshot::Sender<DuplexStream>, creating atokio::io::duplexpair bridging the two sides.All of this state — the
txsender, the oneshot resolvers, the in-flight channels — lives exclusively in the process that accepted theConnectSupervisorstream. It cannot be serialized, migrated, or shared.The reconciler lease (
compute/lease.rs) is unrelated — it controls which pod runs reconcile/watch loops, not which pod serves relay requests. All pods accept gRPC connections regardless of lease ownership.Code References
crates/openshell-server/src/supervisor_session.rs:77sessionsHashMap — pure in-memory, no persistencecrates/openshell-server/src/supervisor_session.rs:172wait_for_session— polls local map, returns Unavailable on timeoutcrates/openshell-server/src/supervisor_session.rs:236open_relay_with_target— where forwarding logic would be addedcrates/openshell-server/src/supervisor_session.rs:573handle_connect_supervisor— where session ownership DB write would be addedcrates/openshell-server/src/supervisor_session.rs:614LiveSessionis inserted into map)crates/openshell-server/src/supervisor_session.rs:679crates/openshell-server/src/compute/mod.rs:238replica_id()— per-process gateway identity already existscrates/openshell-server/tests/supervisor_relay_integration.rsCurrent Behavior
wait_for_sessionpolls the local in-memory map with exponential backoff up to 2 seconds. If the session is on another pod, it never appears in the local map and the call returnsStatus::unavailable("supervisor session not connected").What Would Need to Change
A. Session ownership registry (DB-backed)
New persistence table recording
(sandbox_id, gateway_instance_id, gateway_internal_address, session_id, connected_at_ms). Written onhandle_connect_supervisorsuccess (line 614); deleted on session end (line 679) usingremove_if_currentsemantics to handle supersede races. Requires a new DB migration for both SQLite and Postgres.B. Forwarding path in
open_relay_with_target(line 236)After
wait_for_sessiontimes out locally, query the ownership table. If another pod owns the session, make an internal gRPC call to that pod'sForwardRelayRPC, bridge the resulting stream, and return it to the caller. If no pod owns the session, return the originalUnavailable.C. New internal
ForwardRelayRPCA new RPC on the gateway's gRPC service (or a separate internal-only service) that accepts a relay open request, looks up the local session, and streams relay frames back. Must be authenticated (pod-to-pod mTLS or a gateway service token) to prevent external abuse.
D. Gateway internal address configuration
Each pod needs to know its own cluster-internal address so it can be written to the ownership table and dialed by peers. Options: Kubernetes downward API
status.podIPinjected as an env var, or a headless service providing per-pod DNS (pod-name.service.namespace.svc.cluster.local). Newgateway_internal_addressconfig field with Helm chart support.Alternative Approaches Considered
sandbox_idat the Kubernetes Service or Envoy level to always route a given sandbox's requests to the pod holding its session. Avoids new distributed state in the gateway entirely. Tradeoff: requires infrastructure cooperation (custom Envoy filter or ingress annotation), does not handle pod-crash-and-reconnect gracefully (session moves to new pod, LB must invalidate its mapping), and puts routing logic outside the gateway's control.Patterns to Follow
remove_if_currentsession cleanup semantics already exist in the session registry — the ownership table delete must follow the same pattern.replica_id()incompute/mod.rs:238is the right identity to use asgateway_instance_idin the ownership record.Proposed Approach
Introduce a DB-backed session ownership table written on supervisor connect and cleared on disconnect. When a relay request arrives at a pod that doesn't hold the session locally, query the ownership table and forward the relay to the owning pod via an internal gRPC RPC. Pods discover each other via a new
gateway_internal_addressconfig field injected from the Kubernetes downward API. The forwarding path adds one RTT for cross-pod relays but is transparent to clients.The session-aware LB alternative is worth evaluating as a lower-complexity option if the DB-backed approach introduces unacceptable operational complexity for operators using SQLite.
Scope Assessment
featRisks & Open Questions
remove_if_currentsemantics (matching onsession_id) handle this — verify the DB operation is truly atomic.replicaCount > 1viaserver.externalDbSecret.)Test Considerations
wait_for_sessionfallback branch (ownership lookup + forward attempt)handle_connect_supervisorownership write and session-end cleanupreplicaCount=2, repeated relay requests confirm no session-affinity failures (test:e2e-kubernetes)Created by spike investigation. Use
build-from-issueto plan and implement.