Scheduled reminder for May 10 8:20 PM never fired. Bot logs showed
"reminder.fire: scheduled" with jobId: null at 12:18 UTC — pg-boss
returned null because the queue was on policy=stately, which dedupes
sends across the (created/active/retry) state cone by singletonKey.
A previous schedule for the same reminder (next recurring fire,
created earlier) was still in 'created' state, so the new send for
today 8:20 PM hit the dedupe and was silently rejected.
Two fixes:
1. Switch the queue policy back to 'standard' (the default) and
force-flip any existing 'stately' queue row on boot. Standard
lets us enqueue across reschedules.
2. scheduleReminderFire now does a pre-send cancel: any 'created'
job for this singletonKey is moved to 'cancelled' before the new
boss.send. The new schedule wins; old stale jobs are tombstoned
so the recurring/edit path produces exactly-one upcoming fire.
Duplicate-fire safety (the 'qwerd msg three times' bug) is already
covered at the handler level by the inner-mutex recent-run check
inside fireReminderInner — that's what stately was guarding against,
and the inner check works under standard too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repro: fire a reminder, message lands 2-3 times in WhatsApp (logs
showed three 'fire-reminder: done' entries within 1.5 s for the same
reminderId).
Two interlocking root causes:
1. The queue was created at 'standard' policy (pre-dating the
stately rollout). pg-boss's createQueue is idempotent and DOES
NOT update the policy on an existing queue row, so re-deploying
the code that requested policy=stately silently kept the
standard policy. Standard accepts duplicate enqueues with the
same singletonKey — three reminder.fire jobs for the same
reminderId could all land at once.
2. The handler-level recent-run dedupe was TOCTOU. The check ran
OUTSIDE the per-account mutex, so three concurrent invocations
all read 'no recent run', then queued up on the mutex one at a
time and each INSERTed a fresh run + sent the message.
Fixes:
- registerReminderJobs now forces the queue policy via direct SQL
(UPDATE pgboss.queue SET policy = 'stately' WHERE name = ...
AND policy <> 'stately') on every boot. Idempotent + survives
pre-existing standard-policy rows.
- fireReminderInner re-checks for a recent run AFTER the mutex is
held but BEFORE the INSERT. By that point any concurrent winner
has already inserted, so the duplicate sees the row and bails
cleanly.
New test in fire-reminder.test.ts (the TOCTOU repro): outer check
returns no recent run, inner check returns a freshly-inserted one,
asserts the mutex was acquired but the second findFirst was hit
(i.e. we got past the outer check and the inner check stopped us).
Verified live: pgboss.queue.policy is now 'stately' for reminder.fire.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed: reminder fired twice within ~2s. The bot logs showed two
distinct pg-boss jobIds for the same reminder enqueued at the same
scheduledAt — both ran fire-reminder, both sent the message.
Root cause: pg-boss's `singletonKey` only deduplicates on queues with
a 'singleton' / 'stately' / 'short' policy. Our queue was created
without specifying a policy, defaulting to 'standard', which IGNORES
the singletonKey. Two sends with the same key produced two jobs.
Fix lives at two layers:
* Layer 1 — queue policy. createQueue(REMINDER_FIRE_QUEUE) now
passes `{ policy: 'stately' }`. With this, future fresh deploys
fold a duplicate send (same singletonKey) into the existing
'created' job rather than producing a second one. This doesn't
retroactively change an existing queue's policy (pg-boss doesn't
support that), but new queues are correct from creation.
* Layer 2 — defense-in-depth check inside fireReminder. Before
acquiring the per-account mutex, query reminderRuns for any row
with the same reminderId fired in the last 30s. If found, log
+ bail. This guards against:
- Existing queues stuck on policy='standard'.
- Race windows even within 'stately' policy.
- The operator double-clicking Save in the wizard.
- A jittery pg_notify('bot.command') replay.
Resume jobs (payload.runId set) skip this check — they're meant
to attach to an existing run.
Tests:
* New "BAILS OUT when a fresh fire collides with a recent run" case
in fire-reminder.test.ts.
* beforeEach now resets findExistingRunMock too, since both the
resume and dedupe paths share that mock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Web actions:
* resumeReminderRunAction({ runId }) → validates ownership and that
the run is in 'paused' state, then publishes a reminder.resume
command via pg_notify('bot.command'). The bot's command-consumer
picks it up and enqueues a fresh pg-boss job at REMINDER_FIRE_QUEUE
carrying { reminderId, runId }; fire-reminder's existing resume
branch attaches to the row.
* cancelReminderRunAction({ runId }) → flips remaining 'pending'
targets to 'skipped' with error="canceled by operator", marks the
run 'partial' with a clear errorSummary, and lifts the parent
reminder out of 'paused' (recurring → active so the next
occurrence fires; one-off → ended).
Bot:
* New BotCommand variant { type: "reminder.resume"; reminderId; runId }
* command-consumer registers handleResumeReminder which calls
enqueueReminderResume(boss, reminderId, runId) — a sibling of
scheduleReminderFire that posts the job at REMINDER_FIRE_QUEUE
with { reminderId, runId } and singletonKey "reminder:resume:<runId>"
so the resume doesn't conflict with a future-occurrence schedule.
Tests:
* reminders.run-actions.test.ts (11 tests) — every guard rail
(invalid uuid, missing run, missing reminder, foreign operator,
wrong status) and the recurring/one-off lifecycle branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the single-threaded, 1.5s-sleep-per-part loop with a
concurrency model that:
* Wraps inner work in PerKeyMutex(accountId) so two reminders on the
SAME account take turns (running them concurrently would double the
effective send rate and risk a WhatsApp ban). Different accounts run
in parallel.
* Bumps pg-boss localConcurrency to BOT_FIRE_CONCURRENCY (default 8),
so up to 8 different-account reminders can fire simultaneously.
* Bulk-loads groups + media in 2 queries (drops ~3000 round-trips to
~3 for a 1000-group run) and pre-creates run_target rows so the
Activity tab shows progress mid-run.
* Pre-uploads each unique media via MediaUploadCache (one
generateWAMessageContent call per mediaId, then relayMessage to
every group). For 1000 groups × 5 MB image, this turns 5 GB of
upload into 5 MB.
* Runs BOT_GROUP_CONCURRENCY (default 3) groups in parallel within
one account; parts within a group stay serial so chat order is
preserved.
* Gates every send on a per-account TokenBucket
(BOT_MAX_SEND_PER_MINUTE, default 40).
* Replaces the rigid 1.5s inter-part sleep with 200..499 ms jitter.
Adds a unit test verifying accountMutex.run is called keyed by
accountId for active reminders, and skipped for inactive / missing.
Window enforcement, paused/resume, and ETA preview are deferred to
later phases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>