Add external drain sentinel for graceful Listener shutdown#4461
Draft
leshikus wants to merge 1 commit into
Draft
Add external drain sentinel for graceful Listener shutdown#4461leshikus wants to merge 1 commit into
leshikus wants to merge 1 commit into
Conversation
Introduces a sentinel-file check (`<runner-root>/.drain`) at the top of the message-queue loop, decoupled from `RunnerShutdownToken`. When the file is present, the Listener exits after the current iteration completes -- in contrast with SIGTERM/SIGINT, which cancel the in-flight `GetAgentMessageAsync` HTTP call and any pending `/acknowledge` or `/acquirejob` calls via the linked `messageQueueLoopTokenSource`. This addresses a race that affects external supervisors (autoscalers, custom AMIs, ephemeral-runner orchestrators) that need to terminate an idle runner: between deciding the runner is idle and killing it, the broker can dispatch a job onto the open long-poll. SIGTERM mid-dispatch leaves the server with a committed but unservable job. The sentinel lets the supervisor say "stop after the current iteration completes" without forcing mid-HTTP cancellation, eliminating the ack-sent-no- Worker window. No behavior change for runners whose supervisor does not create the file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
External supervisors that manage GitHub Actions runners — autoscalers terminating idle hosts, ephemeral-runner controllers, CI scripts that recycle hosts — need a way to stop an idle Listener without losing jobs the broker is in the middle of dispatching to it.
The Listener today exposes only two stop primitives, and both can leave the broker with a dispatched-but-never-served job. The supervisor observes a clean shutdown; the user observes a job that took 10-15 minutes longer than expected (or, in pathological cases, was abandoned outright). This PR adds a third primitive — an opt-in sentinel file — that closes the race.
How the race happens
config.sh removecannot see in-flight dispatchesconfig.sh remove(DELETE /runners/{id}) returns"is currently running a job and cannot be deleted"only after the runner has acquired a job viaPOST /acquirejob. The server-sidebusyflag flips on commitment, not on dispatch. So this sequence is possible:The supervisor sees an idle runner and a clean unregister. The broker sees a dispatched job that times out. Wall-clock delay until another runner picks it up is 10-15 minutes.
The window between T2 and T5 is small in absolute terms but covers a full network round trip and is observably non-zero in practice (the
/acknowledgePOST and the/acquirejobPOST both incur HTTPS round trips and can each take ~100-300 ms even on a healthy connection).SIGTERM tears down in-flight HTTP
SIGTERMinvokesRunner_Unloading(Runner.cs:350), which callsHostContext.ShutdownRunner(HostContext.cs:607). That cancelsRunnerShutdownToken(HostContext.cs:612), which is linked intomessageQueueLoopTokenSourceatRunner.cs:494. The linked token is plumbed into every HTTP call in the dispatch path:GetAgentMessageAsyncatRunner.cs:529,AcknowledgeMessageAsyncatRunner.cs:696, andGetJobMessageAsyncatRunner.cs:715/:728— all the way down toHttpClient.SendAsync.When the token cancels, the BCL tears down the underlying socket. If the cancel lands between
/acknowledge(T3) and/acquirejob(T5), the broker has recorded the message as delivered but the runner never completes the acquire — same observable outcome as theconfig.sh removerace above.The fix
Adds an opt-in
<runner-root>/.drainsentinel file. When present, the message-queue loop inRunner.csexits at the next iteration boundary without aborting in-flight HTTP calls. A supervisortouches the file; the Listener finishes any in-flight long-poll, ack, and acquire normally; only the next iteration's drain check fires and breaks the loop.Because the check is a
File.Existsread between iterations rather than a cancellation token plumbed into the HTTP chain, no in-flight long-poll, ack, or acquire is interrupted. Either:DeleteSessionAsyncruns in the existingfinallyat L850, broker stops dispatching to this session.jobDispatcher.Runcomplete normally; the Worker is fully spawned; only the next iteration's drain check fires.Behavior
File.Existsper long-poll iteration (~once per ~50 seconds in steady state) — negligible overhead.WellKnownDirectory.Rootso each runner installation has its own sentinel — safe with multiple ephemeral runners on one host.Known limitation (still a draft)
The existing
finallyblock atRunner.cs:842callsawait jobDispatcher.ShutdownAsync(), which internally callsEnsureDispatchFinished(currentDispatch, cancelRunningJob: true)(JobDispatcher.cs:218). That cancels the in-flight Worker rather than waiting for it. So as currently implemented this patch achieves "exit between iterations" but not "wait for the taken job to complete."For a true drain semantic (taken job runs to completion, no new jobs accepted), the loop should route through the existing
runOnceJobReceivedwait-for-completion branch atRunner.cs:567-602whenjobDispatcher.Busyis true at drain time. Happy to refine in this PR before review.