From 50187a86e1ff3adc6643ec852da2e9a3c27e19a6 Mon Sep 17 00:00:00 2001 From: yiekheng Date: Sun, 10 May 2026 14:01:24 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20design=20spec=20=E2=80=94=20windowed,?= =?UTF-8?q?=20pacing-safe=20reminder=20fan-out?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Records the design decisions for the next planned work: - Per-reminder delivery window (default 6am–6pm, operator timezone). Window-close hard-stops the run; remaining targets become skipped; status reports as partial with a clear "this account is at capacity, consider another paired account" message. - Per-account isolation via pg-boss teamSize ≥ N + an in-process PerKeyMutex keyed by accountId. Different accounts run in parallel; the same account serialises (no double-rate sends that would risk a ban). - Per-account token-bucket rate limiter (default 40 msg/min, BOT_MAX_SEND_PER_MINUTE). - Up-front media-upload cache via prepareWAMessageMedia: 1000 groups × 5 MB upload turns into 5 MB. Biggest single win for text+picture reminders. - Bounded group concurrency (default 3 in-flight per account); parts-within-a-group stay serial for visible message order. - Pre-fetched DB Maps (groups / messages / media), no inner-loop round-trips. - Replaces the rigid 1.5 s inter-part sleep with 200–500 ms jitter; the per-account rate-limiter is the real gate. Out of scope for v1 (documented under "v2 candidates"): cross-day window resume, mid-restart resumability, multi-account auto-split, adaptive rate-limit back-off, pause/resume mid-run. Acceptance: 1000-group reminder + one image, established account finishes in ~30–50 minutes, well inside a 6am–6pm window. Two reminders on different accounts at the same wall-clock minute both progress in parallel. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-10-windowed-fanout-design.md | 216 ++++++++++++++++++ 1 file changed, 216 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-10-windowed-fanout-design.md diff --git a/docs/superpowers/specs/2026-05-10-windowed-fanout-design.md b/docs/superpowers/specs/2026-05-10-windowed-fanout-design.md new file mode 100644 index 0000000..18982b6 --- /dev/null +++ b/docs/superpowers/specs/2026-05-10-windowed-fanout-design.md @@ -0,0 +1,216 @@ +# Windowed, Pacing-Safe Reminder Fan-Out + +> Design spec for a faster, ban-safe, multi-account-friendly reminder +> delivery loop. Written 2026-05-10. Implementation tracked in a +> follow-up plan doc. + +## Goal + +Deliver a reminder to many groups (target: 1000+) safely within a +per-reminder delivery window. If we cannot finish in the window, stop, +mark the run `partial`, and tell the operator the account is at +capacity for this fan-out. + +## Constraints + +- WhatsApp's anti-spam is the dominant ceiling. For an established + account (years of legit history), 30–60 sends/minute is the + sustainable safe band; tighter for newer accounts. +- The system runs on a single bot process talking to multiple paired + WhatsApp accounts. Each account's Baileys socket is independent. +- Two simultaneous fan-outs on the **same** WhatsApp account would + double its effective send rate and risk a ban. +- The operator dropped multi-account fan-out (one reminder splitting + across N accounts) earlier this week. We respect that decision — + this design does not automatically split work across accounts. + +## Approach (selected: B) + +**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a +token-bucket rate limiter, add window-end check, cache DB lookups. +Wins ≈30% on text-only reminders; very little on media-heavy ones. + +**B. Pacing + media-upload cache + bounded concurrency.** Everything +in A, plus upload each unique media file ONCE per run via Baileys' +`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload` +payload for every group send. Run up to N groups in parallel within +one account (parts within a group stay serial so order is preserved). +Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of +upload turns into 5 MB. **Recommended.** + +**C. Multi-account fan-out** — dropped per operator decision. + +## Per-account isolation (cross-account parallelism) + +Today `boss.work()` is called with default `teamSize=1`, so a single +fan-out monopolises the whole bot. Two reminders on **different** +accounts queue serially, which surprises the operator. + +The new model is **per-account serialization, cross-account +parallelism**: + +- `teamSize` raised so multiple reminders on different accounts run + simultaneously. +- A per-key async mutex keyed by `accountId` wraps the inner work, so + two reminders on the **same** account take turns. +- The token-bucket rate limiter is per-account too, so one account's + pacing budget never throttles another. + +``` +pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8) + ├─ R1 (account A) ──┐ + │ ├─ per-account-A mutex ──→ serialised within A + ├─ R3 (account A) ──┘ + │ + ├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's + │ + └─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B +``` + +## Delivery window + +Each reminder gets a window in its operator timezone. If the run +cannot finish inside the window, send what we can and stop. + +- New columns on `reminders`: + - `delivery_window_start_hour int default 6` + - `delivery_window_end_hour int default 18` + - Both interpreted in the row's existing `timezone` column. +- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g. + 22 → 06) are rejected in v1 to keep the math obvious; can be added + later if anyone needs them. +- UI uses two number inputs in the When step (and edit-when page). +- `delivery-window.ts` exports a pure helper: + `windowEndAt(timezone, endHour, fireAt) → Date`. Returns the + end-of-window timestamp for the calendar day `fireAt` falls on, + in the given timezone. If `fireAt` is already past that day's + end-hour, the returned timestamp is in the past — the run loop's + first iteration sees `now() >= windowEndAt`, marks every target + `skipped`, and the run resolves to `failed` (zero sent). That's + the right behaviour: "we can't send after window close, even one + message". +- **Only the end hour is enforced at runtime in v1.** The start + hour is documented on the row but not gated — operators schedule + fire times that fall in their band naturally (cron + the picker's + default 09:00 time fields land inside 06–18). Enforcing the start + too would mean holding messages from a 4am cron miss-fire until + 6am, which is a v2 conversation. + +## Run loop changes (`fire-reminder.ts`) + +Up-front, once per run: + +1. Load all `reminder.targets`, `reminder.messages`, and referenced + `media_files` rows into in-memory Maps. Drops ~3000 round-trips to + ~3 round-trips for a 1000-group run. +2. Pre-create every `reminder_run_targets` row with + `status = "pending"` so progress is observable from the Activity + tab while the fan-out is mid-flight. +3. **Pre-upload each unique media** via Baileys' + `prepareWAMessageMedia`. Cache the resulting `WAMediaUpload` + payload keyed by `mediaId` for the duration of the run. +4. Compute `windowEndAt` and stash it. + +Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups, +default 3): + +1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the + target `skipped` with `error="delivery window closed"` and skip. +2. **Already-sent gate:** if the run-target row is already `sent` + (i.e. a retry is replaying), skip. +3. Acquire a token from the per-account rate limiter (default 40 + msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`). +4. `assertSessions(group)` — call once per group, cache for the run. +5. For each part in `reminder.messages`: + - text → `socket.sendMessage(jid, { text })` + - media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])` + - sleep `jitter(200..500 ms)` between parts (replaces the rigid + 1.5 s wait — preserves per-chat ordering at WA's natural pace). +6. Update the run-target row to `sent` with latency. + +Final status: + +- **success** — every target sent. +- **partial** — at least one sent, at least one not (window-close, + failed, missing group). `error_summary` reads: + `"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000 + groups delivered. This account is at capacity for this fan-out — + consider sending the remainder from another paired account."` +- **failed** — zero sent. + +## Notification body + +The existing `reminder.fired` SSE event already carries +`{ status }`. The web's notification mapper already handles +`partial` with a "see activity" hint. The body extends to mention +"X of Y delivered" when status === "partial". + +## Components + +| File | Role | LOC est. | +| --- | --- | --- | +| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 | +| `packages/db/src/schema.ts` | drizzle alignment | <10 | +| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 | +| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 | +| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 | +| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 | +| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~200 | +| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 | +| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 | +| `apps/web/src/actions/reminders.ts` | accept the two new fields | <30 | +| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 | +| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 | +| `apps/web/src/lib/notifications.ts` | partial-status body extension | <15 | + +## Tests + +- `delivery-window.test.ts` — pure function. Window in past → + next-day end; window crosses midnight (start > end) — explicitly + reject in the schema; timezone offsets handled correctly. +- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained, + then refill rate; backpressure via `acquire()` returning a Promise. +- `per-key-mutex.test.ts` — different keys do NOT block each other + (parallelism); same key DOES (serialisation); a throwing handler + releases the lock; cleanup removes empty entries. +- `media-upload-cache.test.ts` — mock socket: `prepare` called once + per unique mediaId regardless of how many groups consume it. +- `fire-reminder.test.ts` (extend) — window-end gate marks remaining + targets `skipped`; partial-status error_summary includes account / + delivered / total context. + +## Tuning knobs (env) + +| Var | Default | Effect | +| --- | --- | --- | +| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously | +| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends | +| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response | + +Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour` +default to 6/18 and can be widened (e.g. 0/24) for a specific big run. + +## Out of scope (v2 candidates) + +- **Crash resumability across bot restarts.** Today, if the bot dies + mid-fan-out, pg-boss will retry the job; the loop will skip any + rows already marked `sent`, but the in-memory rate-limiter and + upload-cache state are gone — meaning the retry uploads media + again and starts pacing from a full bucket. Acceptable for v1. +- **Pause / resume mid-run** controls. +- **Cross-day window resume** (current design hard-stops at window + end and reports partial; doesn't queue the remainder for tomorrow). +- **Multi-account auto-split** of a single reminder. +- **Adaptive rate limiting** (auto-back-off on WA rate-limit response + codes; today the operator tunes the env var). + +## Acceptance + +- 1000-group reminder with one image, established account: completes + in roughly 30–50 minutes, comfortably inside a 6am–6pm window. +- Two reminders on different accounts firing within seconds of each + other: both progress simultaneously, neither blocks the other. +- A run that hits the window end: stops cleanly, marks remaining as + skipped, surfaces the partial-status message in the Activity tab + and via the browser notification. +- 355 existing tests still pass; ≈25 new tests cover the new helpers.