Folds in three rounds of requirement evolution: * Pause/resume on window close (was stop-and-report-partial). * ETA preview pill at compose / edit time so the operator sees whether their chosen window will fit before scheduling. * Interactive paused-run banner with Resume / Cancel buttons on the detail page; pause notification deep-links to it. Helper relocations: * windowEndAt() moves to packages/shared so both bot fire-reminder and the web ETA pill can import the same calculator. Plan grows from 8 to 10 tasks: adds Task 9 (run-eta + RunEtaPill, TDD) and Task 10 (resume/cancel actions + PausedRunBanner). Acceptance gains two paused-flow smoke tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
351 lines
17 KiB
Markdown
351 lines
17 KiB
Markdown
# Windowed, Pacing-Safe Reminder Fan-Out
|
||
|
||
> Design spec for a faster, ban-safe, multi-account-friendly reminder
|
||
> delivery loop. Written 2026-05-10. Implementation tracked in a
|
||
> follow-up plan doc.
|
||
|
||
## Goal
|
||
|
||
Deliver a reminder to many groups (target: 1000+) safely within a
|
||
per-reminder delivery window. If we cannot finish in the window,
|
||
**pause** the run at window-close, persist progress, and let the
|
||
operator **resume** later from the Activity / detail view. The
|
||
paused-status message tells the operator what's blocking throughput
|
||
(account at capacity, media size eating the budget) so they can
|
||
decide whether to offload to another paired account, shrink the
|
||
attachment, or just resume the next morning at 6am.
|
||
|
||
## Constraints
|
||
|
||
- WhatsApp's anti-spam is the dominant ceiling. For an established
|
||
account (years of legit history), 30–60 sends/minute is the
|
||
sustainable safe band; tighter for newer accounts.
|
||
- The system runs on a single bot process talking to multiple paired
|
||
WhatsApp accounts. Each account's Baileys socket is independent.
|
||
- Two simultaneous fan-outs on the **same** WhatsApp account would
|
||
double its effective send rate and risk a ban.
|
||
- The operator dropped multi-account fan-out (one reminder splitting
|
||
across N accounts) earlier this week. We respect that decision —
|
||
this design does not automatically split work across accounts.
|
||
|
||
## Approach (selected: B)
|
||
|
||
**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a
|
||
token-bucket rate limiter, add window-end check, cache DB lookups.
|
||
Wins ≈30% on text-only reminders; very little on media-heavy ones.
|
||
|
||
**B. Pacing + media-upload cache + bounded concurrency.** Everything
|
||
in A, plus upload each unique media file ONCE per run via Baileys'
|
||
`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload`
|
||
payload for every group send. Run up to N groups in parallel within
|
||
one account (parts within a group stay serial so order is preserved).
|
||
Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of
|
||
upload turns into 5 MB. **Recommended.**
|
||
|
||
**C. Multi-account fan-out** — dropped per operator decision.
|
||
|
||
## Per-account isolation (cross-account parallelism)
|
||
|
||
Today `boss.work()` is called with default `teamSize=1`, so a single
|
||
fan-out monopolises the whole bot. Two reminders on **different**
|
||
accounts queue serially, which surprises the operator.
|
||
|
||
The new model is **per-account serialization, cross-account
|
||
parallelism**:
|
||
|
||
- `teamSize` raised so multiple reminders on different accounts run
|
||
simultaneously.
|
||
- A per-key async mutex keyed by `accountId` wraps the inner work, so
|
||
two reminders on the **same** account take turns.
|
||
- The token-bucket rate limiter is per-account too, so one account's
|
||
pacing budget never throttles another.
|
||
|
||
```
|
||
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
|
||
├─ R1 (account A) ──┐
|
||
│ ├─ per-account-A mutex ──→ serialised within A
|
||
├─ R3 (account A) ──┘
|
||
│
|
||
├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
|
||
│
|
||
└─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B
|
||
```
|
||
|
||
## Delivery window
|
||
|
||
Each reminder gets a window in its operator timezone. If the run
|
||
cannot finish inside the window, send what we can and stop.
|
||
|
||
- New columns on `reminders`:
|
||
- `delivery_window_start_hour int default 6`
|
||
- `delivery_window_end_hour int default 18`
|
||
- Both interpreted in the row's existing `timezone` column.
|
||
- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g.
|
||
22 → 06) are rejected in v1 to keep the math obvious; can be added
|
||
later if anyone needs them.
|
||
- UI uses two number inputs in the When step (and edit-when page).
|
||
- `delivery-window.ts` exports a pure helper:
|
||
`windowEndAt(timezone, endHour, fireAt) → Date`. Returns the
|
||
end-of-window timestamp for the calendar day `fireAt` falls on,
|
||
in the given timezone. If `fireAt` is already past that day's
|
||
end-hour, the returned timestamp is in the past — the run loop's
|
||
first iteration sees `now() >= windowEndAt`, marks every target
|
||
`skipped`, and the run resolves to `failed` (zero sent). That's
|
||
the right behaviour: "we can't send after window close, even one
|
||
message".
|
||
- **Only the end hour is enforced at runtime in v1.** The start
|
||
hour is documented on the row but not gated — operators schedule
|
||
fire times that fall in their band naturally (cron + the picker's
|
||
default 09:00 time fields land inside 06–18). Enforcing the start
|
||
too would mean holding messages from a 4am cron miss-fire until
|
||
6am, which is a v2 conversation.
|
||
|
||
## Estimated finish time (ETA preview)
|
||
|
||
The operator sets the delivery window, but they need a feel for
|
||
whether the window is *enough*. We surface an ETA at compose and
|
||
edit time so they can widen the window (or shrink the run) before
|
||
hitting Schedule.
|
||
|
||
- A pure helper `estimateRunDuration({ targetCount, ratePerMinute })`
|
||
returns `{ durationMinutes, estimatedFinishAt }` given a fire time.
|
||
Calculation: `ceil(targetCount / ratePerMinute)` minutes plus a
|
||
15% buffer for per-group setup latency, with a 1-minute floor.
|
||
- Default `ratePerMinute` reads `BOT_MAX_SEND_PER_MINUTE` (40); the
|
||
number is hard-coded into the web bundle as a constant — operators
|
||
who tune the bot env are responsible for redeploying web. The web
|
||
side does NOT read bot env directly.
|
||
- Displayed in two places:
|
||
- **Wizard Review step**, between the recipients summary and the
|
||
Schedule button:
|
||
`"~28 minutes · finishes ~10:32 (Asia/Kuala_Lumpur)"`
|
||
- **Edit Groups** and **Edit When** pages, near the save button.
|
||
- Style:
|
||
- Green pill `"Fits in window"` when `estimatedFinishAt <= windowEndAt`.
|
||
- Amber pill `"Likely to pause"` when it doesn't, with a one-line
|
||
suggestion: *"Widen the window or split into smaller runs."*
|
||
- The ETA is advisory, not a hard gate — the operator can still
|
||
schedule a run that's likely to pause; pause-and-resume covers
|
||
that case. The ETA just removes the surprise.
|
||
|
||
## Run loop changes (`fire-reminder.ts`)
|
||
|
||
Up-front, once per run:
|
||
|
||
1. Load all `reminder.targets`, `reminder.messages`, and referenced
|
||
`media_files` rows into in-memory Maps. Drops ~3000 round-trips to
|
||
~3 round-trips for a 1000-group run.
|
||
2. Pre-create every `reminder_run_targets` row with
|
||
`status = "pending"` so progress is observable from the Activity
|
||
tab while the fan-out is mid-flight.
|
||
3. **Pre-upload each unique media** via Baileys'
|
||
`prepareWAMessageMedia`. Cache the resulting `WAMediaUpload`
|
||
payload keyed by `mediaId` for the duration of the run.
|
||
4. Compute `windowEndAt` and stash it.
|
||
|
||
Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups,
|
||
default 3):
|
||
|
||
1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the
|
||
target `skipped` with `error="delivery window closed"` and skip.
|
||
2. **Already-sent gate:** if the run-target row is already `sent`
|
||
(i.e. a retry is replaying), skip.
|
||
3. Acquire a token from the per-account rate limiter (default 40
|
||
msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`).
|
||
4. `assertSessions(group)` — call once per group, cache for the run.
|
||
5. For each part in `reminder.messages`:
|
||
- text → `socket.sendMessage(jid, { text })`
|
||
- media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])`
|
||
- sleep `jitter(200..500 ms)` between parts (replaces the rigid
|
||
1.5 s wait — preserves per-chat ordering at WA's natural pace).
|
||
6. Update the run-target row to `sent` with latency.
|
||
|
||
Final status:
|
||
|
||
- **success** — every target sent.
|
||
- **paused** — window closed mid-run with at least one target still in
|
||
`pending`. Run carries a resumable state: sent rows stay `sent`,
|
||
unstarted rows stay `pending` (NOT skipped), failed rows stay
|
||
`failed`. `error_summary` reads:
|
||
`"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000
|
||
groups delivered, 588 still pending. Resume from the Activity tab.
|
||
If this happens repeatedly, consider offloading to another paired
|
||
account, or shrinking the message body / media size to fit more
|
||
groups in your daily window."`
|
||
- **partial** — every target was attempted; some sent and some
|
||
failed/skipped (group missing from DB, account offline, send error
|
||
inside the window). Not resumable; the failures are real failures.
|
||
- **failed** — zero sent. Either every send errored, or the run hit
|
||
the window close BEFORE the first send (run fired too late to do
|
||
any work; nothing to resume).
|
||
|
||
## Resume action
|
||
|
||
A paused run can be resumed by the operator. Mechanism:
|
||
|
||
- New server action `resumeReminderRunAction(runId)` validates
|
||
ownership, then enqueues a pg-boss job:
|
||
`boss.send("reminder.fire", { reminderId, runId })` with NO
|
||
singletonKey (resumes don't conflict with the reminder's normal
|
||
cron firing).
|
||
- The fire-reminder handler accepts an optional `runId` in its
|
||
payload. When present, it ATTACHES to that run instead of creating
|
||
a new one:
|
||
- Skips creating a new `reminder_runs` row.
|
||
- Loads the existing run's `reminder_run_targets` rows.
|
||
- Iterates only those with `status = 'pending'`.
|
||
- Re-uses the same windowEnd / rate limiter / media cache logic as
|
||
a fresh fire.
|
||
- On window close again, status flips back to `paused` with an
|
||
updated count.
|
||
- On success this round, status becomes `success` (if no failures
|
||
accumulated) or `partial` (if some failed).
|
||
- `failed` targets from the previous run are NOT retried on resume.
|
||
They're real errors — surfacing them as actionable in the UI is a
|
||
v2 concern (manual "retry failed" button).
|
||
|
||
UI surfaces of paused runs:
|
||
|
||
- Activity tab gets an amber "Paused" pill alongside the existing
|
||
Success/Partial/Failed/Skipped/Archived filters. Resume button
|
||
inline on each paused row.
|
||
- Reminder detail page's run history shows the same Resume button on
|
||
paused rows. A prominent banner at the top of the detail page
|
||
surfaces the latest paused run with two buttons side-by-side:
|
||
**Resume** (re-enqueues the run via the action above) and
|
||
**Cancel run** (marks the run `partial` so it stops appearing as
|
||
paused; pending targets flip to `skipped` with `error="canceled by
|
||
operator"`). The banner is the operator's "interactive"
|
||
resume/cancel choice referenced from the pause notification.
|
||
- The `reminder.fired` SSE event for status=paused triggers a
|
||
notification with title "Reminder paused" and body
|
||
`"X of Y groups delivered. Tap to resume or cancel."` Clicking the
|
||
notification deep-links to the detail page where the banner lives.
|
||
|
||
Note on the Notifications API: page-side `new Notification()` does
|
||
not support inline action buttons (only service-worker push
|
||
notifications do). The "interactive" choice is therefore one
|
||
click into the detail page — fewer surfaces to keep in sync, no
|
||
service worker required.
|
||
|
||
## Notification body
|
||
|
||
The existing `reminder.fired` SSE event already carries `{ status }`.
|
||
We extend it to carry `sent` and `total` counts so the notification
|
||
can be specific. The notification mapper:
|
||
|
||
- `success` → unchanged.
|
||
- `partial` → body mentions delivered/total counts when present.
|
||
- `paused` → headline `"Reminder paused"`, body
|
||
`"X of Y groups delivered. Tap to resume or cancel."` Click
|
||
takes the operator to the reminder's detail page where the
|
||
Resume / Cancel banner lives.
|
||
- `failed` → unchanged.
|
||
- `skipped` → still filtered (bookkeeping noise).
|
||
|
||
## Components
|
||
|
||
| File | Role | LOC est. |
|
||
| --- | --- | --- |
|
||
| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 |
|
||
| `packages/db/src/schema.ts` | drizzle alignment | <10 |
|
||
| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 |
|
||
| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 |
|
||
| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 |
|
||
| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 |
|
||
| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~220 |
|
||
| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 |
|
||
| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 |
|
||
| `apps/web/src/actions/reminders.ts` | accept the two new fields + `resumeReminderRunAction` + `cancelReminderRunAction` | ~80 |
|
||
| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 |
|
||
| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 |
|
||
| `apps/web/src/lib/run-eta.ts` (new) | pure ETA calculator | ~40 |
|
||
| `apps/web/src/components/reminder-wizard/run-eta-pill.tsx` (new) | shared green/amber pill component | ~50 |
|
||
| `apps/web/src/components/reminder-detail/paused-run-banner.tsx` (new) | "Resume / Cancel run" banner | ~70 |
|
||
| `apps/web/src/lib/notifications.ts` | paused + partial body extension | <30 |
|
||
|
||
## Tests
|
||
|
||
- `delivery-window.test.ts` — pure function. Window in past →
|
||
next-day end; window crosses midnight (start > end) — explicitly
|
||
reject in the schema; timezone offsets handled correctly.
|
||
- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained,
|
||
then refill rate; backpressure via `acquire()` returning a Promise.
|
||
- `per-key-mutex.test.ts` — different keys do NOT block each other
|
||
(parallelism); same key DOES (serialisation); a throwing handler
|
||
releases the lock; cleanup removes empty entries.
|
||
- `media-upload-cache.test.ts` — mock socket: `prepare` called once
|
||
per unique mediaId regardless of how many groups consume it.
|
||
- `fire-reminder.test.ts` (extend) — window-end gate marks remaining
|
||
targets `skipped` (failed-from-the-start path) or leaves them
|
||
`pending` and resolves the run `paused` when at least one send
|
||
succeeded; resume re-attaches and only re-attempts `pending` rows.
|
||
- `run-eta.test.ts` — pure ETA helper: 1000 groups @ 40/min returns
|
||
~29 minutes (with the 15% buffer), edge cases (0 groups → 0,
|
||
rate=0 → throws, fractional minutes → rounded up).
|
||
- `notifications.test.ts` (extend) — `paused` body reads
|
||
`"X of Y groups delivered. Tap to resume or cancel."`; `partial`
|
||
body uses sent/total when present.
|
||
- `paused-run-banner.test.tsx` — banner only renders when the latest
|
||
run's status is `paused`; Resume click triggers the action;
|
||
Cancel click triggers the cancel action.
|
||
|
||
## Tuning knobs (env)
|
||
|
||
| Var | Default | Effect |
|
||
| --- | --- | --- |
|
||
| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously |
|
||
| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends |
|
||
| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response |
|
||
|
||
Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour`
|
||
default to 6/18 and can be widened (e.g. 0/24) for a specific big run.
|
||
|
||
## Out of scope (v2 candidates)
|
||
|
||
- **Crash resumability across bot restarts.** If the bot dies
|
||
mid-fan-out (mid-window), pg-boss will retry the job; the loop's
|
||
pre-loaded `pending` rows still pick up correctly, but the
|
||
in-memory rate-limiter and upload-cache state are gone — the
|
||
retry re-uploads media and starts pacing from a full bucket. The
|
||
paused-state resumability covered above is a different mechanism:
|
||
it handles the "window closed cleanly" case end-to-end. The
|
||
"bot crashed mid-window" case is degraded but not broken.
|
||
- **Auto-resume next morning** when window opens again (today the
|
||
operator clicks Resume manually).
|
||
- **Pause-by-operator** (only window-close pauses; user-triggered
|
||
pause mid-fan-out isn't wired).
|
||
- **Retry-failed-targets** action (paused-resume only re-attempts
|
||
`pending` rows; `failed` rows stay failed).
|
||
- **Native push action buttons** (would require a service worker +
|
||
push endpoint; v1 keeps the resume/cancel choice on the detail
|
||
page, one click away from the notification).
|
||
- **Adaptive ETA from observed rate** (today the ETA uses the
|
||
configured `BOT_MAX_SEND_PER_MINUTE`; a v2 could feed back the
|
||
actual sustained rate from prior runs).
|
||
- **Multi-account auto-split** of a single reminder.
|
||
- **Adaptive rate limiting** (auto-back-off on WA rate-limit response
|
||
codes; today the operator tunes the env var).
|
||
|
||
## Acceptance
|
||
|
||
- 1000-group reminder with one image, established account: completes
|
||
in roughly 30–50 minutes, comfortably inside a 6am–6pm window.
|
||
- Wizard Review shows ETA pill before submit. Setting an end hour
|
||
that won't fit flips the pill amber and surfaces the "Likely to
|
||
pause" hint; the operator can still proceed.
|
||
- Two reminders on different accounts firing within seconds of each
|
||
other: both progress simultaneously, neither blocks the other.
|
||
- A run that hits the window end mid-fan-out: stops cleanly, marks
|
||
the run `paused`, leaves un-started targets as `pending`, surfaces
|
||
the paused-status notification with delivered/total counts.
|
||
- The detail page surfaces a Resume / Cancel banner for the paused
|
||
run. **Resume** re-enqueues; if it pauses again, the banner
|
||
re-appears with an updated count. **Cancel run** flips remaining
|
||
targets to `skipped` and resolves the run `partial`; banner
|
||
disappears.
|
||
- A run that hits the window end BEFORE any send (fired too late):
|
||
resolves `failed`, no resume offered.
|
||
- 355 existing tests still pass; ≈40 new tests cover the new helpers,
|
||
the paused/resume flow, the ETA preview, and the banner.
|