fix(server): prevent exec relays from hanging on idle connections#1992
Open
Gal-Zaidman wants to merge 1 commit into
Open
fix(server): prevent exec relays from hanging on idle connections#1992Gal-Zaidman wants to merge 1 commit into
Gal-Zaidman wants to merge 1 commit into
Conversation
Add HTTP/2 keepalive on supervisor multiplex connections so half-dead sessions cannot leave in-flight exec relays parked indefinitely. Configure SSH keepalive on exec relay clients so long silent commands are not timed out on stdout idle alone; wedged or orphaned relays fail after missed keepalives instead. After a command reports exit status, bound how long the gateway waits for the trailing channel close. Return UNAVAILABLE when a relay closes before reporting exit status rather than defaulting to exit code 1. Signed-off-by: Gal Zaidman <gzaidman@nvidia.com>
|
All contributors have signed the DCO ✍️ ✅ |
Author
|
I have read the DCO document and I hereby sign the DCO. |
Author
|
recheck |
Collaborator
|
/ok to test b4878be |
Collaborator
|
@Gal-Zaidman have you been able to verify this resolves the issue in your environment? |
Author
Yes, currently ran a job with 80 concurrent agents each running an SWE bench task with long exec (that is how harbor works) - zero hangs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gateway
ExecSandboxcalls could hang indefinitely after the command finished, when a supervisor session reset mid-exec orphaned the relay channel — the exec loop blocked onchannel.wait()with no liveness backstop, so callers hung until their own deadline. This adds SSH and HTTP/2 keepalives and bounds the post-exit wait so a wedged/orphaned relay fails fast instead of hanging.Related Issue
Closes #1990
Changes
channel.wait()forever. Channel-silent execs (e.g. an agent that redirects stdout to a file) stay alive while the relay is healthy — liveness is probed via keepalive, not output-idle.UNAVAILABLEwhen a relay closes before reporting an exit status, instead of a misleading exit code 1.Timer) on supervisor multiplex connections, to reduce the session resets that orphan relays.architecture/gateway.md.Testing
mise run pre-commit—rust:format:check,cargo clippy -D warnings, and markdownlint are clean for this change (ran individually). Note: the localmise run pre-commitaborts on itspython:protostep due to a missinggrpc_toolsdev dependency in the venv, unrelated to this change; CI runs the full suite.Manually validated on a Kubernetes deployment: rebuilt and deployed the gateway image, confirmed the gateway is healthy, that long channel-silent execs are not killed by the keepalive, and that the previously-observed multi-sandbox hang no longer reproduces.
Checklist