7 Commits

Author SHA1 Message Date
2e6fbfa7a5 fix(bot): reschedule was silently dropped under stately policy
Scheduled reminder for May 10 8:20 PM never fired. Bot logs showed
"reminder.fire: scheduled" with jobId: null at 12:18 UTC — pg-boss
returned null because the queue was on policy=stately, which dedupes
sends across the (created/active/retry) state cone by singletonKey.
A previous schedule for the same reminder (next recurring fire,
created earlier) was still in 'created' state, so the new send for
today 8:20 PM hit the dedupe and was silently rejected.

Two fixes:

  1. Switch the queue policy back to 'standard' (the default) and
     force-flip any existing 'stately' queue row on boot. Standard
     lets us enqueue across reschedules.
  2. scheduleReminderFire now does a pre-send cancel: any 'created'
     job for this singletonKey is moved to 'cancelled' before the new
     boss.send. The new schedule wins; old stale jobs are tombstoned
     so the recurring/edit path produces exactly-one upcoming fire.

Duplicate-fire safety (the 'qwerd msg three times' bug) is already
covered at the handler level by the inner-mutex recent-run check
inside fireReminderInner — that's what stately was guarding against,
and the inner check works under standard too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:27:52 +08:00
a789b61e1f fix(bot): triple-fire reminder bug — force pg-boss policy + close TOCTOU dedupe
Repro: fire a reminder, message lands 2-3 times in WhatsApp (logs
showed three 'fire-reminder: done' entries within 1.5 s for the same
reminderId).

Two interlocking root causes:

  1. The queue was created at 'standard' policy (pre-dating the
     stately rollout). pg-boss's createQueue is idempotent and DOES
     NOT update the policy on an existing queue row, so re-deploying
     the code that requested policy=stately silently kept the
     standard policy. Standard accepts duplicate enqueues with the
     same singletonKey — three reminder.fire jobs for the same
     reminderId could all land at once.

  2. The handler-level recent-run dedupe was TOCTOU. The check ran
     OUTSIDE the per-account mutex, so three concurrent invocations
     all read 'no recent run', then queued up on the mutex one at a
     time and each INSERTed a fresh run + sent the message.

Fixes:

  - registerReminderJobs now forces the queue policy via direct SQL
    (UPDATE pgboss.queue SET policy = 'stately' WHERE name = ...
    AND policy <> 'stately') on every boot. Idempotent + survives
    pre-existing standard-policy rows.
  - fireReminderInner re-checks for a recent run AFTER the mutex is
    held but BEFORE the INSERT. By that point any concurrent winner
    has already inserted, so the duplicate sees the row and bails
    cleanly.

New test in fire-reminder.test.ts (the TOCTOU repro): outer check
returns no recent run, inner check returns a freshly-inserted one,
asserts the mutex was acquired but the second findFirst was hit
(i.e. we got past the outer check and the inner check stopped us).

Verified live: pgboss.queue.policy is now 'stately' for reminder.fire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:43:41 +08:00
4cb4015666 fix(bot): dedupe duplicate reminder.fire jobs (msg sent twice)
Observed: reminder fired twice within ~2s. The bot logs showed two
distinct pg-boss jobIds for the same reminder enqueued at the same
scheduledAt — both ran fire-reminder, both sent the message.

Root cause: pg-boss's `singletonKey` only deduplicates on queues with
a 'singleton' / 'stately' / 'short' policy. Our queue was created
without specifying a policy, defaulting to 'standard', which IGNORES
the singletonKey. Two sends with the same key produced two jobs.

Fix lives at two layers:

* Layer 1 — queue policy. createQueue(REMINDER_FIRE_QUEUE) now
  passes `{ policy: 'stately' }`. With this, future fresh deploys
  fold a duplicate send (same singletonKey) into the existing
  'created' job rather than producing a second one. This doesn't
  retroactively change an existing queue's policy (pg-boss doesn't
  support that), but new queues are correct from creation.

* Layer 2 — defense-in-depth check inside fireReminder. Before
  acquiring the per-account mutex, query reminderRuns for any row
  with the same reminderId fired in the last 30s. If found, log
  + bail. This guards against:
    - Existing queues stuck on policy='standard'.
    - Race windows even within 'stately' policy.
    - The operator double-clicking Save in the wizard.
    - A jittery pg_notify('bot.command') replay.
  Resume jobs (payload.runId set) skip this check — they're meant
  to attach to an existing run.

Tests:
* New "BAILS OUT when a fresh fire collides with a recent run" case
  in fire-reminder.test.ts.
* beforeEach now resets findExistingRunMock too, since both the
  resume and dedupe paths share that mock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 16:41:11 +08:00
376bbe595b feat(web,bot): resumeReminderRunAction + cancelReminderRunAction
Web actions:

* resumeReminderRunAction({ runId }) → validates ownership and that
  the run is in 'paused' state, then publishes a reminder.resume
  command via pg_notify('bot.command'). The bot's command-consumer
  picks it up and enqueues a fresh pg-boss job at REMINDER_FIRE_QUEUE
  carrying { reminderId, runId }; fire-reminder's existing resume
  branch attaches to the row.
* cancelReminderRunAction({ runId }) → flips remaining 'pending'
  targets to 'skipped' with error="canceled by operator", marks the
  run 'partial' with a clear errorSummary, and lifts the parent
  reminder out of 'paused' (recurring → active so the next
  occurrence fires; one-off → ended).

Bot:

* New BotCommand variant { type: "reminder.resume"; reminderId; runId }
* command-consumer registers handleResumeReminder which calls
  enqueueReminderResume(boss, reminderId, runId) — a sibling of
  scheduleReminderFire that posts the job at REMINDER_FIRE_QUEUE
  with { reminderId, runId } and singletonKey "reminder:resume:<runId>"
  so the resume doesn't conflict with a future-occurrence schedule.

Tests:
* reminders.run-actions.test.ts (11 tests) — every guard rail
  (invalid uuid, missing run, missing reminder, foreign operator,
  wrong status) and the recurring/one-off lifecycle branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 15:54:21 +08:00
c9a7e6f089 feat(bot): cross-account parallel + same-account serial fan-out
Replaces the single-threaded, 1.5s-sleep-per-part loop with a
concurrency model that:

* Wraps inner work in PerKeyMutex(accountId) so two reminders on the
  SAME account take turns (running them concurrently would double the
  effective send rate and risk a WhatsApp ban). Different accounts run
  in parallel.
* Bumps pg-boss localConcurrency to BOT_FIRE_CONCURRENCY (default 8),
  so up to 8 different-account reminders can fire simultaneously.
* Bulk-loads groups + media in 2 queries (drops ~3000 round-trips to
  ~3 for a 1000-group run) and pre-creates run_target rows so the
  Activity tab shows progress mid-run.
* Pre-uploads each unique media via MediaUploadCache (one
  generateWAMessageContent call per mediaId, then relayMessage to
  every group). For 1000 groups × 5 MB image, this turns 5 GB of
  upload into 5 MB.
* Runs BOT_GROUP_CONCURRENCY (default 3) groups in parallel within
  one account; parts within a group stay serial so chat order is
  preserved.
* Gates every send on a per-account TokenBucket
  (BOT_MAX_SEND_PER_MINUTE, default 40).
* Replaces the rigid 1.5s inter-part sleep with 200..499 ms jitter.

Adds a unit test verifying accountMutex.run is called keyed by
accountId for active reminders, and skipped for inactive / missing.

Window enforcement, paused/resume, and ETA preview are deferred to
later phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 14:44:23 +08:00
01eb5752ee feat(scheduler): add fire-reminder handler + job registration
Also fix rrule default-import workaround so the shared package loads
correctly under NodeNext ESM resolution (rrule@2.8.1 has no exports field).
2026-05-09 17:29:21 +08:00
113adc7edf feat(scheduler): add pg-boss client + lifecycle
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 17:19:01 +08:00