Skip to content

Add external drain sentinel for graceful Listener shutdown#4461

Draft
leshikus wants to merge 1 commit into
actions:mainfrom
leshikus:drain-sentinel
Draft

Add external drain sentinel for graceful Listener shutdown#4461
leshikus wants to merge 1 commit into
actions:mainfrom
leshikus:drain-sentinel

Conversation

@leshikus
Copy link
Copy Markdown

@leshikus leshikus commented May 28, 2026

Problem

External supervisors that manage GitHub Actions runners — autoscalers terminating idle hosts, ephemeral-runner controllers, CI scripts that recycle hosts — need a way to stop an idle Listener without losing jobs the broker is in the middle of dispatching to it.

The Listener today exposes only two stop primitives, and both can leave the broker with a dispatched-but-never-served job. The supervisor observes a clean shutdown; the user observes a job that took 10-15 minutes longer than expected (or, in pathological cases, was abandoned outright). This PR adds a third primitive — an opt-in sentinel file — that closes the race.

How the race happens

config.sh remove cannot see in-flight dispatches

config.sh remove (DELETE /runners/{id}) returns "is currently running a job and cannot be deleted" only after the runner has acquired a job via POST /acquirejob. The server-side busy flag flips on commitment, not on dispatch. So this sequence is possible:

T0  Supervisor: pgrep Runner.Worker         => no Worker process, runner looks idle
T1  Broker:     picks this runner for a job, writes RunnerJobRequestRef
                onto the open long-poll
T2  Listener:   GetAgentMessageAsync returns the message
T3  Listener:   POST /acknowledge           informational, gated by
                                            messageRef.ShouldAcknowledge;
                                            server marks delivered but
                                            not yet committed (busy = false)
T4  Supervisor: DELETE /runners/{id}        200 OK, because busy is still
                                            false (no /acquirejob has happened)
T5  Listener:   POST /acquirejob            401 / 404, runner no longer exists
T6  No Worker ever runs. Broker waits for the lock to expire and eventually
    redispatches via LockedUntil + 5 min lock-expiry.

The supervisor sees an idle runner and a clean unregister. The broker sees a dispatched job that times out. Wall-clock delay until another runner picks it up is 10-15 minutes.

The window between T2 and T5 is small in absolute terms but covers a full network round trip and is observably non-zero in practice (the /acknowledge POST and the /acquirejob POST both incur HTTPS round trips and can each take ~100-300 ms even on a healthy connection).

SIGTERM tears down in-flight HTTP

SIGTERM invokes Runner_Unloading (Runner.cs:350), which calls HostContext.ShutdownRunner (HostContext.cs:607). That cancels RunnerShutdownToken (HostContext.cs:612), which is linked into messageQueueLoopTokenSource at Runner.cs:494. The linked token is plumbed into every HTTP call in the dispatch path: GetAgentMessageAsync at Runner.cs:529, AcknowledgeMessageAsync at Runner.cs:696, and GetJobMessageAsync at Runner.cs:715 / :728 — all the way down to HttpClient.SendAsync.

When the token cancels, the BCL tears down the underlying socket. If the cancel lands between /acknowledge (T3) and /acquirejob (T5), the broker has recorded the message as delivered but the runner never completes the acquire — same observable outcome as the config.sh remove race above.

The fix

Adds an opt-in <runner-root>/.drain sentinel file. When present, the message-queue loop in Runner.cs exits at the next iteration boundary without aborting in-flight HTTP calls. A supervisor touches the file; the Listener finishes any in-flight long-poll, ack, and acquire normally; only the next iteration's drain check fires and breaks the loop.

Because the check is a File.Exists read between iterations rather than a cancellation token plumbed into the HTTP chain, no in-flight long-poll, ack, or acquire is interrupted. Either:

  • The current iteration was idle (long-poll returns empty) — clean exit on next iteration's check; DeleteSessionAsync runs in the existing finally at L850, broker stops dispatching to this session.
  • The current iteration received a message — ack, acquire, and jobDispatcher.Run complete normally; the Worker is fully spawned; only the next iteration's drain check fires.

Behavior

  • No change when the sentinel file is absent (the default).
  • Stale sentinels from a prior process are deleted at startup.
  • The check is one File.Exists per long-poll iteration (~once per ~50 seconds in steady state) — negligible overhead.
  • Path uses WellKnownDirectory.Root so each runner installation has its own sentinel — safe with multiple ephemeral runners on one host.

Known limitation (still a draft)

The existing finally block at Runner.cs:842 calls await jobDispatcher.ShutdownAsync(), which internally calls EnsureDispatchFinished(currentDispatch, cancelRunningJob: true) (JobDispatcher.cs:218). That cancels the in-flight Worker rather than waiting for it. So as currently implemented this patch achieves "exit between iterations" but not "wait for the taken job to complete."

For a true drain semantic (taken job runs to completion, no new jobs accepted), the loop should route through the existing runOnceJobReceived wait-for-completion branch at Runner.cs:567-602 when jobDispatcher.Busy is true at drain time. Happy to refine in this PR before review.

Introduces a sentinel-file check (`<runner-root>/.drain`) at the top of
the message-queue loop, decoupled from `RunnerShutdownToken`. When the
file is present, the Listener exits after the current iteration
completes -- in contrast with SIGTERM/SIGINT, which cancel the in-flight
`GetAgentMessageAsync` HTTP call and any pending `/acknowledge` or
`/acquirejob` calls via the linked `messageQueueLoopTokenSource`.

This addresses a race that affects external supervisors (autoscalers,
custom AMIs, ephemeral-runner orchestrators) that need to terminate an
idle runner: between deciding the runner is idle and killing it, the
broker can dispatch a job onto the open long-poll. SIGTERM mid-dispatch
leaves the server with a committed but unservable job. The sentinel
lets the supervisor say "stop after the current iteration completes"
without forcing mid-HTTP cancellation, eliminating the ack-sent-no-
Worker window.

No behavior change for runners whose supervisor does not create the
file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant