yiekheng 309020fa5d feat(bot): sweep stale 'pending' runs on startup
Corner case observed: fire-reminder writes the run row with
status='pending' UP FRONT (so the Activity tab shows progress
mid-run), then flips to a terminal status once it's done. If the
bot is killed between those two writes — e.g. a redeploy or crash —
the row sits at 'pending' forever. pg-boss already marked the job
'completed', so it won't retry. Activity surfaces and the dashboard
counters then show a "stuck" run that never moves.

sweepStalePendingRuns runs at bot startup, finds any 'pending' run
older than 5 minutes, and:
  • Flips the run to 'failed' with a clear error_summary so the UI
    stops treating it as in-flight.
  • Flips its still-'pending' run_target rows to 'skipped' with the
    same reason so per-group counts remain coherent.

The 5-minute floor is generous enough that an actual mid-run worker
rebalance isn't accidentally killed.

Tests:
* 4 sweep tests covering: no-stale path skips the second UPDATE;
  with-stale path fires both UPDATEs; counts are forwarded; the
  edge case where a stale run has zero pending targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 16:05:18 +08:00
..