cm_whatsapp_bot_v1/docs/superpowers/specs/2026-05-10-windowed-fanout-design.md
yiekheng 50187a86e1 docs: design spec — windowed, pacing-safe reminder fan-out
Records the design decisions for the next planned work:

  - Per-reminder delivery window (default 6am–6pm, operator timezone).
    Window-close hard-stops the run; remaining targets become
    skipped; status reports as partial with a clear "this account is
    at capacity, consider another paired account" message.
  - Per-account isolation via pg-boss teamSize ≥ N + an in-process
    PerKeyMutex keyed by accountId. Different accounts run in
    parallel; the same account serialises (no double-rate sends
    that would risk a ban).
  - Per-account token-bucket rate limiter (default 40 msg/min,
    BOT_MAX_SEND_PER_MINUTE).
  - Up-front media-upload cache via prepareWAMessageMedia: 1000
    groups × 5 MB upload turns into 5 MB. Biggest single win for
    text+picture reminders.
  - Bounded group concurrency (default 3 in-flight per account);
    parts-within-a-group stay serial for visible message order.
  - Pre-fetched DB Maps (groups / messages / media), no inner-loop
    round-trips.
  - Replaces the rigid 1.5 s inter-part sleep with 200–500 ms
    jitter; the per-account rate-limiter is the real gate.

Out of scope for v1 (documented under "v2 candidates"): cross-day
window resume, mid-restart resumability, multi-account auto-split,
adaptive rate-limit back-off, pause/resume mid-run.

Acceptance: 1000-group reminder + one image, established account
finishes in ~30–50 minutes, well inside a 6am–6pm window. Two
reminders on different accounts at the same wall-clock minute
both progress in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 14:01:24 +08:00

217 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Windowed, Pacing-Safe Reminder Fan-Out
> Design spec for a faster, ban-safe, multi-account-friendly reminder
> delivery loop. Written 2026-05-10. Implementation tracked in a
> follow-up plan doc.
## Goal
Deliver a reminder to many groups (target: 1000+) safely within a
per-reminder delivery window. If we cannot finish in the window, stop,
mark the run `partial`, and tell the operator the account is at
capacity for this fan-out.
## Constraints
- WhatsApp's anti-spam is the dominant ceiling. For an established
account (years of legit history), 3060 sends/minute is the
sustainable safe band; tighter for newer accounts.
- The system runs on a single bot process talking to multiple paired
WhatsApp accounts. Each account's Baileys socket is independent.
- Two simultaneous fan-outs on the **same** WhatsApp account would
double its effective send rate and risk a ban.
- The operator dropped multi-account fan-out (one reminder splitting
across N accounts) earlier this week. We respect that decision —
this design does not automatically split work across accounts.
## Approach (selected: B)
**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a
token-bucket rate limiter, add window-end check, cache DB lookups.
Wins ≈30% on text-only reminders; very little on media-heavy ones.
**B. Pacing + media-upload cache + bounded concurrency.** Everything
in A, plus upload each unique media file ONCE per run via Baileys'
`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload`
payload for every group send. Run up to N groups in parallel within
one account (parts within a group stay serial so order is preserved).
Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of
upload turns into 5 MB. **Recommended.**
**C. Multi-account fan-out** — dropped per operator decision.
## Per-account isolation (cross-account parallelism)
Today `boss.work()` is called with default `teamSize=1`, so a single
fan-out monopolises the whole bot. Two reminders on **different**
accounts queue serially, which surprises the operator.
The new model is **per-account serialization, cross-account
parallelism**:
- `teamSize` raised so multiple reminders on different accounts run
simultaneously.
- A per-key async mutex keyed by `accountId` wraps the inner work, so
two reminders on the **same** account take turns.
- The token-bucket rate limiter is per-account too, so one account's
pacing budget never throttles another.
```
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
├─ R1 (account A) ──┐
│ ├─ per-account-A mutex ──→ serialised within A
├─ R3 (account A) ──┘
├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
└─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B
```
## Delivery window
Each reminder gets a window in its operator timezone. If the run
cannot finish inside the window, send what we can and stop.
- New columns on `reminders`:
- `delivery_window_start_hour int default 6`
- `delivery_window_end_hour int default 18`
- Both interpreted in the row's existing `timezone` column.
- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g.
22 → 06) are rejected in v1 to keep the math obvious; can be added
later if anyone needs them.
- UI uses two number inputs in the When step (and edit-when page).
- `delivery-window.ts` exports a pure helper:
`windowEndAt(timezone, endHour, fireAt) → Date`. Returns the
end-of-window timestamp for the calendar day `fireAt` falls on,
in the given timezone. If `fireAt` is already past that day's
end-hour, the returned timestamp is in the past — the run loop's
first iteration sees `now() >= windowEndAt`, marks every target
`skipped`, and the run resolves to `failed` (zero sent). That's
the right behaviour: "we can't send after window close, even one
message".
- **Only the end hour is enforced at runtime in v1.** The start
hour is documented on the row but not gated — operators schedule
fire times that fall in their band naturally (cron + the picker's
default 09:00 time fields land inside 0618). Enforcing the start
too would mean holding messages from a 4am cron miss-fire until
6am, which is a v2 conversation.
## Run loop changes (`fire-reminder.ts`)
Up-front, once per run:
1. Load all `reminder.targets`, `reminder.messages`, and referenced
`media_files` rows into in-memory Maps. Drops ~3000 round-trips to
~3 round-trips for a 1000-group run.
2. Pre-create every `reminder_run_targets` row with
`status = "pending"` so progress is observable from the Activity
tab while the fan-out is mid-flight.
3. **Pre-upload each unique media** via Baileys'
`prepareWAMessageMedia`. Cache the resulting `WAMediaUpload`
payload keyed by `mediaId` for the duration of the run.
4. Compute `windowEndAt` and stash it.
Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups,
default 3):
1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the
target `skipped` with `error="delivery window closed"` and skip.
2. **Already-sent gate:** if the run-target row is already `sent`
(i.e. a retry is replaying), skip.
3. Acquire a token from the per-account rate limiter (default 40
msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`).
4. `assertSessions(group)` — call once per group, cache for the run.
5. For each part in `reminder.messages`:
- text → `socket.sendMessage(jid, { text })`
- media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])`
- sleep `jitter(200..500 ms)` between parts (replaces the rigid
1.5 s wait — preserves per-chat ordering at WA's natural pace).
6. Update the run-target row to `sent` with latency.
Final status:
- **success** — every target sent.
- **partial** — at least one sent, at least one not (window-close,
failed, missing group). `error_summary` reads:
`"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000
groups delivered. This account is at capacity for this fan-out —
consider sending the remainder from another paired account."`
- **failed** — zero sent.
## Notification body
The existing `reminder.fired` SSE event already carries
`{ status }`. The web's notification mapper already handles
`partial` with a "see activity" hint. The body extends to mention
"X of Y delivered" when status === "partial".
## Components
| File | Role | LOC est. |
| --- | --- | --- |
| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 |
| `packages/db/src/schema.ts` | drizzle alignment | <10 |
| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 |
| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 |
| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 |
| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 |
| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~200 |
| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 |
| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 |
| `apps/web/src/actions/reminders.ts` | accept the two new fields | <30 |
| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 |
| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 |
| `apps/web/src/lib/notifications.ts` | partial-status body extension | <15 |
## Tests
- `delivery-window.test.ts` pure function. Window in past
next-day end; window crosses midnight (start > end) — explicitly
reject in the schema; timezone offsets handled correctly.
- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained,
then refill rate; backpressure via `acquire()` returning a Promise.
- `per-key-mutex.test.ts` — different keys do NOT block each other
(parallelism); same key DOES (serialisation); a throwing handler
releases the lock; cleanup removes empty entries.
- `media-upload-cache.test.ts` — mock socket: `prepare` called once
per unique mediaId regardless of how many groups consume it.
- `fire-reminder.test.ts` (extend) — window-end gate marks remaining
targets `skipped`; partial-status error_summary includes account /
delivered / total context.
## Tuning knobs (env)
| Var | Default | Effect |
| --- | --- | --- |
| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously |
| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends |
| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response |
Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour`
default to 6/18 and can be widened (e.g. 0/24) for a specific big run.
## Out of scope (v2 candidates)
- **Crash resumability across bot restarts.** Today, if the bot dies
mid-fan-out, pg-boss will retry the job; the loop will skip any
rows already marked `sent`, but the in-memory rate-limiter and
upload-cache state are gone — meaning the retry uploads media
again and starts pacing from a full bucket. Acceptable for v1.
- **Pause / resume mid-run** controls.
- **Cross-day window resume** (current design hard-stops at window
end and reports partial; doesn't queue the remainder for tomorrow).
- **Multi-account auto-split** of a single reminder.
- **Adaptive rate limiting** (auto-back-off on WA rate-limit response
codes; today the operator tunes the env var).
## Acceptance
- 1000-group reminder with one image, established account: completes
in roughly 3050 minutes, comfortably inside a 6am6pm window.
- Two reminders on different accounts firing within seconds of each
other: both progress simultaneously, neither blocks the other.
- A run that hits the window end: stops cleanly, marks remaining as
skipped, surfaces the partial-status message in the Activity tab
and via the browser notification.
- 355 existing tests still pass; ≈25 new tests cover the new helpers.