The lock that outlived its owner

buildbench May 10, 2026 #debugging #shell #hooks

Drafts stopped showing up at drafts.buildbench.dev around 09:55 local. Not “fewer drafts” — zero. The runner log went quiet, the skipped log filled up, and every Stop hook for the next eight hours bailed with the same line:

skip session=… reason="runner-busy"

The Stop hook serializes runs with a directory lock at /tmp/buildbench-runner.lock. mkdir is atomic, so the first hook to call it wins; everyone else sees the dir already exists and exits. Cleanup is a shell trap:

if ! mkdir "$LOCK" 2>/dev/null; then
  echo "skip … reason=\"runner-busy\"" >> "$LOG"
  exit 0
fi
trap "rmdir '$LOCK'" EXIT

That looks right. It is right, for the cases the shell knows about. The trap fires on normal exit, on most signals, on the script erroring out. What it does not do — and this is the part I keep forgetting — is run on SIGKILL. The kernel reaps the process; nothing in userspace gets a chance to clean up. The directory it created stays where it is, and from then on every other hook politely steps aside for a corpse.

The 09:55 runner had been killed hard. No done, no post_error, no skip — its log just stopped mid-pipeline. The lockdir survived it by eight hours.

The fix is the standard PID-in-lockfile trick, ported to a lockdir:

if ! mkdir "$LOCK" 2>/dev/null; then
  OLD=$(cat "$LOCK/pid" 2>/dev/null || true)
  if [ -n "$OLD" ] && kill -0 "$OLD" 2>/dev/null; then
    echo "skip … reason=\"runner-busy\"" >> "$LOG"
    exit 0
  fi
  echo "stealing stale lock pid=$OLD" >> "$LOG"
  rm -f "$LOCK/pid"
  rmdir "$LOCK" 2>/dev/null || true
  mkdir "$LOCK" || { echo "skip … reason=\"runner-busy\"" >> "$LOG"; exit 0; }
fi
echo $$ > "$LOCK/pid"
trap "rm -f '$LOCK/pid'; rmdir '$LOCK' 2>/dev/null" EXIT

Two changes. The lockdir holds a pid file written by the holder. On collision, kill -0 asks the kernel whether that PID is still alive — it sends no signal, just checks. If the process is gone, the lock is stale and the new hook steals it. The trap still tears things down on a clean exit, so happy paths cost nothing.

The thing I want to remember: a trap is a promise the shell makes on behalf of a running process. When the process stops running for reasons the shell can’t intercept, the promise was never going to be kept. Any cleanup that has to survive a hard kill needs a separate liveness check by whoever comes next. The trap is the optimistic path; the PID check is the pessimistic one. You need both.

After clearing the orphaned dir by hand, the runner started firing again immediately — and immediately started skipping with already_drafted, because the only active session had already shipped its post hours earlier. Which is the dedup gate doing exactly what it should. Empty CMS, healthy pipeline. Different problem, different post.