docs: design spec — windowed, pacing-safe reminder fan-out

Records the design decisions for the next planned work:

  - Per-reminder delivery window (default 6am–6pm, operator timezone).
    Window-close hard-stops the run; remaining targets become
    skipped; status reports as partial with a clear "this account is
    at capacity, consider another paired account" message.
  - Per-account isolation via pg-boss teamSize ≥ N + an in-process
    PerKeyMutex keyed by accountId. Different accounts run in
    parallel; the same account serialises (no double-rate sends
    that would risk a ban).
  - Per-account token-bucket rate limiter (default 40 msg/min,
    BOT_MAX_SEND_PER_MINUTE).
  - Up-front media-upload cache via prepareWAMessageMedia: 1000
    groups × 5 MB upload turns into 5 MB. Biggest single win for
    text+picture reminders.
  - Bounded group concurrency (default 3 in-flight per account);
    parts-within-a-group stay serial for visible message order.
  - Pre-fetched DB Maps (groups / messages / media), no inner-loop
    round-trips.
  - Replaces the rigid 1.5 s inter-part sleep with 200–500 ms
    jitter; the per-account rate-limiter is the real gate.

Out of scope for v1 (documented under "v2 candidates"): cross-day
window resume, mid-restart resumability, multi-account auto-split,
adaptive rate-limit back-off, pause/resume mid-run.

Acceptance: 1000-group reminder + one image, established account
finishes in ~30–50 minutes, well inside a 6am–6pm window. Two
reminders on different accounts at the same wall-clock minute
both progress in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
yiekheng 2026-05-10 14:01:24 +08:00
parent 5d91b904f2
commit 50187a86e1

View File

@ -0,0 +1,216 @@
# Windowed, Pacing-Safe Reminder Fan-Out
> Design spec for a faster, ban-safe, multi-account-friendly reminder
> delivery loop. Written 2026-05-10. Implementation tracked in a
> follow-up plan doc.
## Goal
Deliver a reminder to many groups (target: 1000+) safely within a
per-reminder delivery window. If we cannot finish in the window, stop,
mark the run `partial`, and tell the operator the account is at
capacity for this fan-out.
## Constraints
- WhatsApp's anti-spam is the dominant ceiling. For an established
account (years of legit history), 3060 sends/minute is the
sustainable safe band; tighter for newer accounts.
- The system runs on a single bot process talking to multiple paired
WhatsApp accounts. Each account's Baileys socket is independent.
- Two simultaneous fan-outs on the **same** WhatsApp account would
double its effective send rate and risk a ban.
- The operator dropped multi-account fan-out (one reminder splitting
across N accounts) earlier this week. We respect that decision —
this design does not automatically split work across accounts.
## Approach (selected: B)
**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a
token-bucket rate limiter, add window-end check, cache DB lookups.
Wins ≈30% on text-only reminders; very little on media-heavy ones.
**B. Pacing + media-upload cache + bounded concurrency.** Everything
in A, plus upload each unique media file ONCE per run via Baileys'
`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload`
payload for every group send. Run up to N groups in parallel within
one account (parts within a group stay serial so order is preserved).
Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of
upload turns into 5 MB. **Recommended.**
**C. Multi-account fan-out** — dropped per operator decision.
## Per-account isolation (cross-account parallelism)
Today `boss.work()` is called with default `teamSize=1`, so a single
fan-out monopolises the whole bot. Two reminders on **different**
accounts queue serially, which surprises the operator.
The new model is **per-account serialization, cross-account
parallelism**:
- `teamSize` raised so multiple reminders on different accounts run
simultaneously.
- A per-key async mutex keyed by `accountId` wraps the inner work, so
two reminders on the **same** account take turns.
- The token-bucket rate limiter is per-account too, so one account's
pacing budget never throttles another.
```
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
├─ R1 (account A) ──┐
│ ├─ per-account-A mutex ──→ serialised within A
├─ R3 (account A) ──┘
├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
└─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B
```
## Delivery window
Each reminder gets a window in its operator timezone. If the run
cannot finish inside the window, send what we can and stop.
- New columns on `reminders`:
- `delivery_window_start_hour int default 6`
- `delivery_window_end_hour int default 18`
- Both interpreted in the row's existing `timezone` column.
- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g.
22 → 06) are rejected in v1 to keep the math obvious; can be added
later if anyone needs them.
- UI uses two number inputs in the When step (and edit-when page).
- `delivery-window.ts` exports a pure helper:
`windowEndAt(timezone, endHour, fireAt) → Date`. Returns the
end-of-window timestamp for the calendar day `fireAt` falls on,
in the given timezone. If `fireAt` is already past that day's
end-hour, the returned timestamp is in the past — the run loop's
first iteration sees `now() >= windowEndAt`, marks every target
`skipped`, and the run resolves to `failed` (zero sent). That's
the right behaviour: "we can't send after window close, even one
message".
- **Only the end hour is enforced at runtime in v1.** The start
hour is documented on the row but not gated — operators schedule
fire times that fall in their band naturally (cron + the picker's
default 09:00 time fields land inside 0618). Enforcing the start
too would mean holding messages from a 4am cron miss-fire until
6am, which is a v2 conversation.
## Run loop changes (`fire-reminder.ts`)
Up-front, once per run:
1. Load all `reminder.targets`, `reminder.messages`, and referenced
`media_files` rows into in-memory Maps. Drops ~3000 round-trips to
~3 round-trips for a 1000-group run.
2. Pre-create every `reminder_run_targets` row with
`status = "pending"` so progress is observable from the Activity
tab while the fan-out is mid-flight.
3. **Pre-upload each unique media** via Baileys'
`prepareWAMessageMedia`. Cache the resulting `WAMediaUpload`
payload keyed by `mediaId` for the duration of the run.
4. Compute `windowEndAt` and stash it.
Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups,
default 3):
1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the
target `skipped` with `error="delivery window closed"` and skip.
2. **Already-sent gate:** if the run-target row is already `sent`
(i.e. a retry is replaying), skip.
3. Acquire a token from the per-account rate limiter (default 40
msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`).
4. `assertSessions(group)` — call once per group, cache for the run.
5. For each part in `reminder.messages`:
- text → `socket.sendMessage(jid, { text })`
- media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])`
- sleep `jitter(200..500 ms)` between parts (replaces the rigid
1.5 s wait — preserves per-chat ordering at WA's natural pace).
6. Update the run-target row to `sent` with latency.
Final status:
- **success** — every target sent.
- **partial** — at least one sent, at least one not (window-close,
failed, missing group). `error_summary` reads:
`"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000
groups delivered. This account is at capacity for this fan-out —
consider sending the remainder from another paired account."`
- **failed** — zero sent.
## Notification body
The existing `reminder.fired` SSE event already carries
`{ status }`. The web's notification mapper already handles
`partial` with a "see activity" hint. The body extends to mention
"X of Y delivered" when status === "partial".
## Components
| File | Role | LOC est. |
| --- | --- | --- |
| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 |
| `packages/db/src/schema.ts` | drizzle alignment | <10 |
| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 |
| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 |
| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 |
| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 |
| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~200 |
| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 |
| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 |
| `apps/web/src/actions/reminders.ts` | accept the two new fields | <30 |
| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 |
| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 |
| `apps/web/src/lib/notifications.ts` | partial-status body extension | <15 |
## Tests
- `delivery-window.test.ts` — pure function. Window in past →
next-day end; window crosses midnight (start > end) — explicitly
reject in the schema; timezone offsets handled correctly.
- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained,
then refill rate; backpressure via `acquire()` returning a Promise.
- `per-key-mutex.test.ts` — different keys do NOT block each other
(parallelism); same key DOES (serialisation); a throwing handler
releases the lock; cleanup removes empty entries.
- `media-upload-cache.test.ts` — mock socket: `prepare` called once
per unique mediaId regardless of how many groups consume it.
- `fire-reminder.test.ts` (extend) — window-end gate marks remaining
targets `skipped`; partial-status error_summary includes account /
delivered / total context.
## Tuning knobs (env)
| Var | Default | Effect |
| --- | --- | --- |
| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously |
| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends |
| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response |
Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour`
default to 6/18 and can be widened (e.g. 0/24) for a specific big run.
## Out of scope (v2 candidates)
- **Crash resumability across bot restarts.** Today, if the bot dies
mid-fan-out, pg-boss will retry the job; the loop will skip any
rows already marked `sent`, but the in-memory rate-limiter and
upload-cache state are gone — meaning the retry uploads media
again and starts pacing from a full bucket. Acceptable for v1.
- **Pause / resume mid-run** controls.
- **Cross-day window resume** (current design hard-stops at window
end and reports partial; doesn't queue the remainder for tomorrow).
- **Multi-account auto-split** of a single reminder.
- **Adaptive rate limiting** (auto-back-off on WA rate-limit response
codes; today the operator tunes the env var).
## Acceptance
- 1000-group reminder with one image, established account: completes
in roughly 3050 minutes, comfortably inside a 6am6pm window.
- Two reminders on different accounts firing within seconds of each
other: both progress simultaneously, neither blocks the other.
- A run that hits the window end: stops cleanly, marks remaining as
skipped, surfaces the partial-status message in the Activity tab
and via the browser notification.
- 355 existing tests still pass; ≈25 new tests cover the new helpers.