cm_whatsapp_bot_v1/docs/superpowers/specs/2026-05-10-windowed-fanout-design.md
yiekheng 082a70db06 docs: consolidate windowed-fanout spec/plan with ETA + paused/resume
Folds in three rounds of requirement evolution:

* Pause/resume on window close (was stop-and-report-partial).
* ETA preview pill at compose / edit time so the operator sees
  whether their chosen window will fit before scheduling.
* Interactive paused-run banner with Resume / Cancel buttons on the
  detail page; pause notification deep-links to it.

Helper relocations:

* windowEndAt() moves to packages/shared so both bot fire-reminder
  and the web ETA pill can import the same calculator.

Plan grows from 8 to 10 tasks: adds Task 9 (run-eta + RunEtaPill,
TDD) and Task 10 (resume/cancel actions + PausedRunBanner).
Acceptance gains two paused-flow smoke tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 14:33:51 +08:00

351 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Windowed, Pacing-Safe Reminder Fan-Out
> Design spec for a faster, ban-safe, multi-account-friendly reminder
> delivery loop. Written 2026-05-10. Implementation tracked in a
> follow-up plan doc.
## Goal
Deliver a reminder to many groups (target: 1000+) safely within a
per-reminder delivery window. If we cannot finish in the window,
**pause** the run at window-close, persist progress, and let the
operator **resume** later from the Activity / detail view. The
paused-status message tells the operator what's blocking throughput
(account at capacity, media size eating the budget) so they can
decide whether to offload to another paired account, shrink the
attachment, or just resume the next morning at 6am.
## Constraints
- WhatsApp's anti-spam is the dominant ceiling. For an established
account (years of legit history), 3060 sends/minute is the
sustainable safe band; tighter for newer accounts.
- The system runs on a single bot process talking to multiple paired
WhatsApp accounts. Each account's Baileys socket is independent.
- Two simultaneous fan-outs on the **same** WhatsApp account would
double its effective send rate and risk a ban.
- The operator dropped multi-account fan-out (one reminder splitting
across N accounts) earlier this week. We respect that decision —
this design does not automatically split work across accounts.
## Approach (selected: B)
**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a
token-bucket rate limiter, add window-end check, cache DB lookups.
Wins ≈30% on text-only reminders; very little on media-heavy ones.
**B. Pacing + media-upload cache + bounded concurrency.** Everything
in A, plus upload each unique media file ONCE per run via Baileys'
`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload`
payload for every group send. Run up to N groups in parallel within
one account (parts within a group stay serial so order is preserved).
Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of
upload turns into 5 MB. **Recommended.**
**C. Multi-account fan-out** — dropped per operator decision.
## Per-account isolation (cross-account parallelism)
Today `boss.work()` is called with default `teamSize=1`, so a single
fan-out monopolises the whole bot. Two reminders on **different**
accounts queue serially, which surprises the operator.
The new model is **per-account serialization, cross-account
parallelism**:
- `teamSize` raised so multiple reminders on different accounts run
simultaneously.
- A per-key async mutex keyed by `accountId` wraps the inner work, so
two reminders on the **same** account take turns.
- The token-bucket rate limiter is per-account too, so one account's
pacing budget never throttles another.
```
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
├─ R1 (account A) ──┐
│ ├─ per-account-A mutex ──→ serialised within A
├─ R3 (account A) ──┘
├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
└─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B
```
## Delivery window
Each reminder gets a window in its operator timezone. If the run
cannot finish inside the window, send what we can and stop.
- New columns on `reminders`:
- `delivery_window_start_hour int default 6`
- `delivery_window_end_hour int default 18`
- Both interpreted in the row's existing `timezone` column.
- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g.
22 → 06) are rejected in v1 to keep the math obvious; can be added
later if anyone needs them.
- UI uses two number inputs in the When step (and edit-when page).
- `delivery-window.ts` exports a pure helper:
`windowEndAt(timezone, endHour, fireAt) → Date`. Returns the
end-of-window timestamp for the calendar day `fireAt` falls on,
in the given timezone. If `fireAt` is already past that day's
end-hour, the returned timestamp is in the past — the run loop's
first iteration sees `now() >= windowEndAt`, marks every target
`skipped`, and the run resolves to `failed` (zero sent). That's
the right behaviour: "we can't send after window close, even one
message".
- **Only the end hour is enforced at runtime in v1.** The start
hour is documented on the row but not gated — operators schedule
fire times that fall in their band naturally (cron + the picker's
default 09:00 time fields land inside 0618). Enforcing the start
too would mean holding messages from a 4am cron miss-fire until
6am, which is a v2 conversation.
## Estimated finish time (ETA preview)
The operator sets the delivery window, but they need a feel for
whether the window is *enough*. We surface an ETA at compose and
edit time so they can widen the window (or shrink the run) before
hitting Schedule.
- A pure helper `estimateRunDuration({ targetCount, ratePerMinute })`
returns `{ durationMinutes, estimatedFinishAt }` given a fire time.
Calculation: `ceil(targetCount / ratePerMinute)` minutes plus a
15% buffer for per-group setup latency, with a 1-minute floor.
- Default `ratePerMinute` reads `BOT_MAX_SEND_PER_MINUTE` (40); the
number is hard-coded into the web bundle as a constant — operators
who tune the bot env are responsible for redeploying web. The web
side does NOT read bot env directly.
- Displayed in two places:
- **Wizard Review step**, between the recipients summary and the
Schedule button:
`"~28 minutes · finishes ~10:32 (Asia/Kuala_Lumpur)"`
- **Edit Groups** and **Edit When** pages, near the save button.
- Style:
- Green pill `"Fits in window"` when `estimatedFinishAt <= windowEndAt`.
- Amber pill `"Likely to pause"` when it doesn't, with a one-line
suggestion: *"Widen the window or split into smaller runs."*
- The ETA is advisory, not a hard gate — the operator can still
schedule a run that's likely to pause; pause-and-resume covers
that case. The ETA just removes the surprise.
## Run loop changes (`fire-reminder.ts`)
Up-front, once per run:
1. Load all `reminder.targets`, `reminder.messages`, and referenced
`media_files` rows into in-memory Maps. Drops ~3000 round-trips to
~3 round-trips for a 1000-group run.
2. Pre-create every `reminder_run_targets` row with
`status = "pending"` so progress is observable from the Activity
tab while the fan-out is mid-flight.
3. **Pre-upload each unique media** via Baileys'
`prepareWAMessageMedia`. Cache the resulting `WAMediaUpload`
payload keyed by `mediaId` for the duration of the run.
4. Compute `windowEndAt` and stash it.
Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups,
default 3):
1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the
target `skipped` with `error="delivery window closed"` and skip.
2. **Already-sent gate:** if the run-target row is already `sent`
(i.e. a retry is replaying), skip.
3. Acquire a token from the per-account rate limiter (default 40
msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`).
4. `assertSessions(group)` — call once per group, cache for the run.
5. For each part in `reminder.messages`:
- text → `socket.sendMessage(jid, { text })`
- media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])`
- sleep `jitter(200..500 ms)` between parts (replaces the rigid
1.5 s wait — preserves per-chat ordering at WA's natural pace).
6. Update the run-target row to `sent` with latency.
Final status:
- **success** — every target sent.
- **paused** — window closed mid-run with at least one target still in
`pending`. Run carries a resumable state: sent rows stay `sent`,
unstarted rows stay `pending` (NOT skipped), failed rows stay
`failed`. `error_summary` reads:
`"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000
groups delivered, 588 still pending. Resume from the Activity tab.
If this happens repeatedly, consider offloading to another paired
account, or shrinking the message body / media size to fit more
groups in your daily window."`
- **partial** — every target was attempted; some sent and some
failed/skipped (group missing from DB, account offline, send error
inside the window). Not resumable; the failures are real failures.
- **failed** — zero sent. Either every send errored, or the run hit
the window close BEFORE the first send (run fired too late to do
any work; nothing to resume).
## Resume action
A paused run can be resumed by the operator. Mechanism:
- New server action `resumeReminderRunAction(runId)` validates
ownership, then enqueues a pg-boss job:
`boss.send("reminder.fire", { reminderId, runId })` with NO
singletonKey (resumes don't conflict with the reminder's normal
cron firing).
- The fire-reminder handler accepts an optional `runId` in its
payload. When present, it ATTACHES to that run instead of creating
a new one:
- Skips creating a new `reminder_runs` row.
- Loads the existing run's `reminder_run_targets` rows.
- Iterates only those with `status = 'pending'`.
- Re-uses the same windowEnd / rate limiter / media cache logic as
a fresh fire.
- On window close again, status flips back to `paused` with an
updated count.
- On success this round, status becomes `success` (if no failures
accumulated) or `partial` (if some failed).
- `failed` targets from the previous run are NOT retried on resume.
They're real errors — surfacing them as actionable in the UI is a
v2 concern (manual "retry failed" button).
UI surfaces of paused runs:
- Activity tab gets an amber "Paused" pill alongside the existing
Success/Partial/Failed/Skipped/Archived filters. Resume button
inline on each paused row.
- Reminder detail page's run history shows the same Resume button on
paused rows. A prominent banner at the top of the detail page
surfaces the latest paused run with two buttons side-by-side:
**Resume** (re-enqueues the run via the action above) and
**Cancel run** (marks the run `partial` so it stops appearing as
paused; pending targets flip to `skipped` with `error="canceled by
operator"`). The banner is the operator's "interactive"
resume/cancel choice referenced from the pause notification.
- The `reminder.fired` SSE event for status=paused triggers a
notification with title "Reminder paused" and body
`"X of Y groups delivered. Tap to resume or cancel."` Clicking the
notification deep-links to the detail page where the banner lives.
Note on the Notifications API: page-side `new Notification()` does
not support inline action buttons (only service-worker push
notifications do). The "interactive" choice is therefore one
click into the detail page — fewer surfaces to keep in sync, no
service worker required.
## Notification body
The existing `reminder.fired` SSE event already carries `{ status }`.
We extend it to carry `sent` and `total` counts so the notification
can be specific. The notification mapper:
- `success` → unchanged.
- `partial` → body mentions delivered/total counts when present.
- `paused` → headline `"Reminder paused"`, body
`"X of Y groups delivered. Tap to resume or cancel."` Click
takes the operator to the reminder's detail page where the
Resume / Cancel banner lives.
- `failed` → unchanged.
- `skipped` → still filtered (bookkeeping noise).
## Components
| File | Role | LOC est. |
| --- | --- | --- |
| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 |
| `packages/db/src/schema.ts` | drizzle alignment | <10 |
| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 |
| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 |
| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 |
| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 |
| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~220 |
| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 |
| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 |
| `apps/web/src/actions/reminders.ts` | accept the two new fields + `resumeReminderRunAction` + `cancelReminderRunAction` | ~80 |
| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 |
| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 |
| `apps/web/src/lib/run-eta.ts` (new) | pure ETA calculator | ~40 |
| `apps/web/src/components/reminder-wizard/run-eta-pill.tsx` (new) | shared green/amber pill component | ~50 |
| `apps/web/src/components/reminder-detail/paused-run-banner.tsx` (new) | "Resume / Cancel run" banner | ~70 |
| `apps/web/src/lib/notifications.ts` | paused + partial body extension | <30 |
## Tests
- `delivery-window.test.ts` pure function. Window in past
next-day end; window crosses midnight (start > end) — explicitly
reject in the schema; timezone offsets handled correctly.
- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained,
then refill rate; backpressure via `acquire()` returning a Promise.
- `per-key-mutex.test.ts` — different keys do NOT block each other
(parallelism); same key DOES (serialisation); a throwing handler
releases the lock; cleanup removes empty entries.
- `media-upload-cache.test.ts` — mock socket: `prepare` called once
per unique mediaId regardless of how many groups consume it.
- `fire-reminder.test.ts` (extend) — window-end gate marks remaining
targets `skipped` (failed-from-the-start path) or leaves them
`pending` and resolves the run `paused` when at least one send
succeeded; resume re-attaches and only re-attempts `pending` rows.
- `run-eta.test.ts` — pure ETA helper: 1000 groups @ 40/min returns
~29 minutes (with the 15% buffer), edge cases (0 groups → 0,
rate=0 → throws, fractional minutes → rounded up).
- `notifications.test.ts` (extend) — `paused` body reads
`"X of Y groups delivered. Tap to resume or cancel."`; `partial`
body uses sent/total when present.
- `paused-run-banner.test.tsx` — banner only renders when the latest
run's status is `paused`; Resume click triggers the action;
Cancel click triggers the cancel action.
## Tuning knobs (env)
| Var | Default | Effect |
| --- | --- | --- |
| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously |
| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends |
| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response |
Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour`
default to 6/18 and can be widened (e.g. 0/24) for a specific big run.
## Out of scope (v2 candidates)
- **Crash resumability across bot restarts.** If the bot dies
mid-fan-out (mid-window), pg-boss will retry the job; the loop's
pre-loaded `pending` rows still pick up correctly, but the
in-memory rate-limiter and upload-cache state are gone — the
retry re-uploads media and starts pacing from a full bucket. The
paused-state resumability covered above is a different mechanism:
it handles the "window closed cleanly" case end-to-end. The
"bot crashed mid-window" case is degraded but not broken.
- **Auto-resume next morning** when window opens again (today the
operator clicks Resume manually).
- **Pause-by-operator** (only window-close pauses; user-triggered
pause mid-fan-out isn't wired).
- **Retry-failed-targets** action (paused-resume only re-attempts
`pending` rows; `failed` rows stay failed).
- **Native push action buttons** (would require a service worker +
push endpoint; v1 keeps the resume/cancel choice on the detail
page, one click away from the notification).
- **Adaptive ETA from observed rate** (today the ETA uses the
configured `BOT_MAX_SEND_PER_MINUTE`; a v2 could feed back the
actual sustained rate from prior runs).
- **Multi-account auto-split** of a single reminder.
- **Adaptive rate limiting** (auto-back-off on WA rate-limit response
codes; today the operator tunes the env var).
## Acceptance
- 1000-group reminder with one image, established account: completes
in roughly 3050 minutes, comfortably inside a 6am6pm window.
- Wizard Review shows ETA pill before submit. Setting an end hour
that won't fit flips the pill amber and surfaces the "Likely to
pause" hint; the operator can still proceed.
- Two reminders on different accounts firing within seconds of each
other: both progress simultaneously, neither blocks the other.
- A run that hits the window end mid-fan-out: stops cleanly, marks
the run `paused`, leaves un-started targets as `pending`, surfaces
the paused-status notification with delivered/total counts.
- The detail page surfaces a Resume / Cancel banner for the paused
run. **Resume** re-enqueues; if it pauses again, the banner
re-appears with an updated count. **Cancel run** flips remaining
targets to `skipped` and resolves the run `partial`; banner
disappears.
- A run that hits the window end BEFORE any send (fired too late):
resolves `failed`, no resume offered.
- 355 existing tests still pass; ≈40 new tests cover the new helpers,
the paused/resume flow, the ETA preview, and the banner.