feat: gateway inter-pod session forwarding for HA multi-replica deployments

## Problem Statement

When the OpenShell gateway runs with multiple replicas, a sandbox supervisor establishes its relay session (a persistent bidirectional gRPC stream) to exactly one gateway pod. Client requests — relay opens, `sandbox connect`, loopback service HTTP — are load-balanced across all pods. Requests that land on a pod that does not hold the supervisor session fail immediately with `supervisor session not connected`. For multi-agent systems where inter-component traffic flows continuously through the gateway, this makes HA multi-replica deployments operationally unusable.

## Technical Context

The `SupervisorSessionRegistry` in the gateway server is a pure in-memory data structure — a `Mutex<HashMap<String, LiveSession>>` per pod process. There is no cross-pod session discovery, no shared state for session ownership, and no forwarding path. When a client request arrives at a pod that does not own the session, the pod waits up to `SESSION_WAIT_MAX_BACKOFF` (2 seconds) and then returns `Unavailable`. This is by design for simplicity, but it is incompatible with HA deployments where clients may reach any replica.

## Affected Components

| Component | Key Files | Role |
|-----------|-----------|------|
| Session registry | `crates/openshell-server/src/supervisor_session.rs` | In-memory session map; relay open/claim lifecycle |
| Persistence layer | `crates/openshell-server/src/persistence/mod.rs` | Where session ownership CRUD would be added |
| DB migrations | `crates/openshell-server/migrations/` | New session ownership table |
| Gateway config | `crates/openshell-server/src/config_file.rs` | New `gateway_internal_address` field |
| Helm chart | `deploy/helm/` | Inject pod IP via downward API |
| Proto | `proto/openshell.proto` | New internal `ForwardRelay` RPC |

## Technical Investigation

### Architecture Overview

The supervisor relay lifecycle (all in `supervisor_session.rs`):

1. Supervisor sends `SupervisorHello` → gateway registers `LiveSession { tx: mpsc::Sender<GatewayMessage>, session_id, ... }` in `sessions: Mutex<HashMap<String, LiveSession>>` (line 77).
2. Client calls `open_relay_with_target` → gateway calls `wait_for_session(sandbox_id, timeout)`, looks up local `sessions` map, sends `RelayOpen` over the supervisor's `tx` channel.
3. Supervisor dials back with `RelayStream` RPC → `claim_relay` resolves a `oneshot::Sender<DuplexStream>`, creating a `tokio::io::duplex` pair bridging the two sides.
4. Bytes flow bidirectionally until either side closes.

All of this state — the `tx` sender, the oneshot resolvers, the in-flight channels — lives exclusively in the process that accepted the `ConnectSupervisor` stream. It cannot be serialized, migrated, or shared.

The reconciler lease (`compute/lease.rs`) is unrelated — it controls which pod runs reconcile/watch loops, not which pod serves relay requests. All pods accept gRPC connections regardless of lease ownership.

### Code References

| Location | Description |
|----------|-------------|
| `crates/openshell-server/src/supervisor_session.rs:77` | `sessions` HashMap — pure in-memory, no persistence |
| `crates/openshell-server/src/supervisor_session.rs:172` | `wait_for_session` — polls local map, returns Unavailable on timeout |
| `crates/openshell-server/src/supervisor_session.rs:236` | `open_relay_with_target` — where forwarding logic would be added |
| `crates/openshell-server/src/supervisor_session.rs:573` | `handle_connect_supervisor` — where session ownership DB write would be added |
| `crates/openshell-server/src/supervisor_session.rs:614` | Session registration (line where `LiveSession` is inserted into map) |
| `crates/openshell-server/src/supervisor_session.rs:679` | Session cleanup — where ownership DB delete would be added |
| `crates/openshell-server/src/compute/mod.rs:238` | `replica_id()` — per-process gateway identity already exists |
| `crates/openshell-server/tests/supervisor_relay_integration.rs` | Relay wire-protocol tests (single-gateway only today) |

### Current Behavior

`wait_for_session` polls the local in-memory map with exponential backoff up to 2 seconds. If the session is on another pod, it never appears in the local map and the call returns `Status::unavailable("supervisor session not connected")`.

### What Would Need to Change

**A. Session ownership registry (DB-backed)**
New persistence table recording `(sandbox_id, gateway_instance_id, gateway_internal_address, session_id, connected_at_ms)`. Written on `handle_connect_supervisor` success (line 614); deleted on session end (line 679) using `remove_if_current` semantics to handle supersede races. Requires a new DB migration for both SQLite and Postgres.

**B. Forwarding path in `open_relay_with_target` (line 236)**
After `wait_for_session` times out locally, query the ownership table. If another pod owns the session, make an internal gRPC call to that pod's `ForwardRelay` RPC, bridge the resulting stream, and return it to the caller. If no pod owns the session, return the original `Unavailable`.

**C. New internal `ForwardRelay` RPC**
A new RPC on the gateway's gRPC service (or a separate internal-only service) that accepts a relay open request, looks up the local session, and streams relay frames back. Must be authenticated (pod-to-pod mTLS or a gateway service token) to prevent external abuse.

**D. Gateway internal address configuration**
Each pod needs to know its own cluster-internal address so it can be written to the ownership table and dialed by peers. Options: Kubernetes downward API `status.podIP` injected as an env var, or a headless service providing per-pod DNS (`pod-name.service.namespace.svc.cluster.local`). New `gateway_internal_address` config field with Helm chart support.

### Alternative Approaches Considered

- **Session-aware load balancing at the ingress layer**: Use consistent hashing on `sandbox_id` at the Kubernetes Service or Envoy level to always route a given sandbox's requests to the pod holding its session. Avoids new distributed state in the gateway entirely. Tradeoff: requires infrastructure cooperation (custom Envoy filter or ingress annotation), does not handle pod-crash-and-reconnect gracefully (session moves to new pod, LB must invalidate its mapping), and puts routing logic outside the gateway's control.
- **Shared message broker (Redis/NATS)**: Supervisors connect to a broker instead of directly to a gateway pod; any gateway pod can publish to any supervisor. Eliminates session affinity entirely. Tradeoff: adds a new required infrastructure dependency and significantly changes the relay architecture.
- **Scale to replicaCount=1**: Eliminates the problem at the cost of gateway HA. Appropriate for dev clusters and single-region deployments where gateway restart recovery time is acceptable.

### Patterns to Follow

- `remove_if_current` session cleanup semantics already exist in the session registry — the ownership table delete must follow the same pattern.
- `replica_id()` in `compute/mod.rs:238` is the right identity to use as `gateway_instance_id` in the ownership record.
- Pod-to-pod auth should follow the same mTLS model used for supervisor-to-gateway auth.

## Proposed Approach

Introduce a DB-backed session ownership table written on supervisor connect and cleared on disconnect. When a relay request arrives at a pod that doesn't hold the session locally, query the ownership table and forward the relay to the owning pod via an internal gRPC RPC. Pods discover each other via a new `gateway_internal_address` config field injected from the Kubernetes downward API. The forwarding path adds one RTT for cross-pod relays but is transparent to clients.

The session-aware LB alternative is worth evaluating as a lower-complexity option if the DB-backed approach introduces unacceptable operational complexity for operators using SQLite.

## Scope Assessment

- **Complexity:** High
- **Confidence:** Medium — direction is clear, but distributed state correctness (supersede races, crash recovery, TTL expiry) has significant unknowns
- **Estimated files to change:** ~10–14
- **Issue type:** `feat`

## Risks & Open Questions

- **Pod crash without clean teardown**: If a pod crashes, its ownership records remain in the DB. Forwarding attempts to the dead pod will fail. Ownership records need a TTL or heartbeat mechanism, and the forwarding path needs a fallback for unreachable pods.
- **Supersede races**: If a supervisor reconnects to pod B while pod A still holds a stale ownership record, pod B must atomically supersede pod A's record. `remove_if_current` semantics (matching on `session_id`) handle this — verify the DB operation is truly atomic.
- **Session-aware LB vs. gateway forwarding**: Which approach to prioritize is a design decision that needs maintainer input. The LB approach is simpler to implement but harder to operate; the gateway forwarding approach is self-contained but adds distributed state.
- **SQLite multi-process write contention**: With multiple gateway pods each writing session ownership records, SQLite WAL mode must be confirmed as sufficient or Postgres required for HA deployments. (Postgres is already required for `replicaCount > 1` via `server.externalDbSecret`.)
- **Internal RPC auth model**: Pod-to-pod forwarding RPCs must not be reachable from outside the cluster. Options: mTLS with a gateway-internal CA, network policy restriction, or a shared secret injected via Helm. Each has operational tradeoffs.
- **Latency impact**: Cross-pod forwarding adds one network RTT per relay open. For latency-sensitive agent reasoning loops, this may be significant. Measure before committing to this approach.

## Test Considerations

- Unit tests for `wait_for_session` fallback branch (ownership lookup + forward attempt)
- Unit tests for ownership table CRUD with supersede race scenarios
- Unit tests for `handle_connect_supervisor` ownership write and session-end cleanup
- Integration test: two gateway instances, supervisor connects to instance A, client connects to instance B, relay succeeds via forwarding
- E2e test (Kubernetes): `replicaCount=2`, repeated relay requests confirm no session-affinity failures (`test:e2e-kubernetes`)
- Chaos test: kill the owning gateway pod mid-relay, verify supervisor reconnects and subsequent relay requests succeed on the new pod

---
*Created by spike investigation. Use `build-from-issue` to plan and implement.*

Location	Description
`crates/openshell-server/src/supervisor_session.rs:77`	`sessions` HashMap — pure in-memory, no persistence
`crates/openshell-server/src/supervisor_session.rs:172`	`wait_for_session` — polls local map, returns Unavailable on timeout
`crates/openshell-server/src/supervisor_session.rs:236`	`open_relay_with_target` — where forwarding logic would be added
`crates/openshell-server/src/supervisor_session.rs:573`	`handle_connect_supervisor` — where session ownership DB write would be added
`crates/openshell-server/src/supervisor_session.rs:614`	Session registration (line where `LiveSession` is inserted into map)
`crates/openshell-server/src/supervisor_session.rs:679`	Session cleanup — where ownership DB delete would be added
`crates/openshell-server/src/compute/mod.rs:238`	`replica_id()` — per-process gateway identity already exists
`crates/openshell-server/tests/supervisor_relay_integration.rs`	Relay wire-protocol tests (single-gateway only today)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: gateway inter-pod session forwarding for HA multi-replica deployments #1994

Problem Statement

Technical Context

Affected Components

Technical Investigation

Architecture Overview

Code References

Current Behavior

What Would Need to Change

Alternative Approaches Considered

Patterns to Follow

Proposed Approach

Scope Assessment

Risks & Open Questions

Test Considerations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Key Files	Role
Session registry	`crates/openshell-server/src/supervisor_session.rs`	In-memory session map; relay open/claim lifecycle
Persistence layer	`crates/openshell-server/src/persistence/mod.rs`	Where session ownership CRUD would be added
DB migrations	`crates/openshell-server/migrations/`	New session ownership table
Gateway config	`crates/openshell-server/src/config_file.rs`	New `gateway_internal_address` field
Helm chart	`deploy/helm/`	Inject pod IP via downward API
Proto	`proto/openshell.proto`	New internal `ForwardRelay` RPC

Uh oh!

feat: gateway inter-pod session forwarding for HA multi-replica deployments #1994

Description

Problem Statement

Technical Context

Affected Components

Technical Investigation

Architecture Overview

Code References

Current Behavior

What Would Need to Change

Alternative Approaches Considered

Patterns to Follow

Proposed Approach

Scope Assessment

Risks & Open Questions

Test Considerations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions