cm_whatsapp_bot_v1/docs/superpowers/specs/2026-05-10-windowed-fanout-design.md
yiekheng 50187a86e1 docs: design spec — windowed, pacing-safe reminder fan-out
Records the design decisions for the next planned work:

  - Per-reminder delivery window (default 6am–6pm, operator timezone).
    Window-close hard-stops the run; remaining targets become
    skipped; status reports as partial with a clear "this account is
    at capacity, consider another paired account" message.
  - Per-account isolation via pg-boss teamSize ≥ N + an in-process
    PerKeyMutex keyed by accountId. Different accounts run in
    parallel; the same account serialises (no double-rate sends
    that would risk a ban).
  - Per-account token-bucket rate limiter (default 40 msg/min,
    BOT_MAX_SEND_PER_MINUTE).
  - Up-front media-upload cache via prepareWAMessageMedia: 1000
    groups × 5 MB upload turns into 5 MB. Biggest single win for
    text+picture reminders.
  - Bounded group concurrency (default 3 in-flight per account);
    parts-within-a-group stay serial for visible message order.
  - Pre-fetched DB Maps (groups / messages / media), no inner-loop
    round-trips.
  - Replaces the rigid 1.5 s inter-part sleep with 200–500 ms
    jitter; the per-account rate-limiter is the real gate.

Out of scope for v1 (documented under "v2 candidates"): cross-day
window resume, mid-restart resumability, multi-account auto-split,
adaptive rate-limit back-off, pause/resume mid-run.

Acceptance: 1000-group reminder + one image, established account
finishes in ~30–50 minutes, well inside a 6am–6pm window. Two
reminders on different accounts at the same wall-clock minute
both progress in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 14:01:24 +08:00

10 KiB
Raw Blame History

Windowed, Pacing-Safe Reminder Fan-Out

Design spec for a faster, ban-safe, multi-account-friendly reminder delivery loop. Written 2026-05-10. Implementation tracked in a follow-up plan doc.

Goal

Deliver a reminder to many groups (target: 1000+) safely within a per-reminder delivery window. If we cannot finish in the window, stop, mark the run partial, and tell the operator the account is at capacity for this fan-out.

Constraints

  • WhatsApp's anti-spam is the dominant ceiling. For an established account (years of legit history), 3060 sends/minute is the sustainable safe band; tighter for newer accounts.
  • The system runs on a single bot process talking to multiple paired WhatsApp accounts. Each account's Baileys socket is independent.
  • Two simultaneous fan-outs on the same WhatsApp account would double its effective send rate and risk a ban.
  • The operator dropped multi-account fan-out (one reminder splitting across N accounts) earlier this week. We respect that decision — this design does not automatically split work across accounts.

Approach (selected: B)

A. Minimal pacing fix. Drop the rigid 1.5s sleep, add a token-bucket rate limiter, add window-end check, cache DB lookups. Wins ≈30% on text-only reminders; very little on media-heavy ones.

B. Pacing + media-upload cache + bounded concurrency. Everything in A, plus upload each unique media file ONCE per run via Baileys' prepareWAMessageMedia and reuse the resulting WAMediaUpload payload for every group send. Run up to N groups in parallel within one account (parts within a group stay serial so order is preserved). Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of upload turns into 5 MB. Recommended.

C. Multi-account fan-out — dropped per operator decision.

Per-account isolation (cross-account parallelism)

Today boss.work() is called with default teamSize=1, so a single fan-out monopolises the whole bot. Two reminders on different accounts queue serially, which surprises the operator.

The new model is per-account serialization, cross-account parallelism:

  • teamSize raised so multiple reminders on different accounts run simultaneously.
  • A per-key async mutex keyed by accountId wraps the inner work, so two reminders on the same account take turns.
  • The token-bucket rate limiter is per-account too, so one account's pacing budget never throttles another.
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
  ├─ R1 (account A) ──┐
  │                    ├─ per-account-A mutex ──→ serialised within A
  ├─ R3 (account A) ──┘
  │
  ├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
  │
  └─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B

Delivery window

Each reminder gets a window in its operator timezone. If the run cannot finish inside the window, send what we can and stop.

  • New columns on reminders:
    • delivery_window_start_hour int default 6
    • delivery_window_end_hour int default 18
    • Both interpreted in the row's existing timezone column.
  • Validation: 0 ≤ start < end ≤ 24. Cross-midnight windows (e.g. 22 → 06) are rejected in v1 to keep the math obvious; can be added later if anyone needs them.
  • UI uses two number inputs in the When step (and edit-when page).
  • delivery-window.ts exports a pure helper: windowEndAt(timezone, endHour, fireAt) → Date. Returns the end-of-window timestamp for the calendar day fireAt falls on, in the given timezone. If fireAt is already past that day's end-hour, the returned timestamp is in the past — the run loop's first iteration sees now() >= windowEndAt, marks every target skipped, and the run resolves to failed (zero sent). That's the right behaviour: "we can't send after window close, even one message".
  • Only the end hour is enforced at runtime in v1. The start hour is documented on the row but not gated — operators schedule fire times that fall in their band naturally (cron + the picker's default 09:00 time fields land inside 0618). Enforcing the start too would mean holding messages from a 4am cron miss-fire until 6am, which is a v2 conversation.

Run loop changes (fire-reminder.ts)

Up-front, once per run:

  1. Load all reminder.targets, reminder.messages, and referenced media_files rows into in-memory Maps. Drops ~3000 round-trips to ~3 round-trips for a 1000-group run.
  2. Pre-create every reminder_run_targets row with status = "pending" so progress is observable from the Activity tab while the fan-out is mid-flight.
  3. Pre-upload each unique media via Baileys' prepareWAMessageMedia. Cache the resulting WAMediaUpload payload keyed by mediaId for the duration of the run.
  4. Compute windowEndAt and stash it.

Per-target (limited to BOT_GROUP_CONCURRENCY parallel groups, default 3):

  1. Window-end gate: if Date.now() >= windowEndAt, mark the target skipped with error="delivery window closed" and skip.
  2. Already-sent gate: if the run-target row is already sent (i.e. a retry is replaying), skip.
  3. Acquire a token from the per-account rate limiter (default 40 msg/min, configurable BOT_MAX_SEND_PER_MINUTE).
  4. assertSessions(group) — call once per group, cache for the run.
  5. For each part in reminder.messages:
    • text → socket.sendMessage(jid, { text })
    • media → socket.sendMessage(jid, uploadedMediaCache[mediaId])
    • sleep jitter(200..500 ms) between parts (replaces the rigid 1.5 s wait — preserves per-chat ordering at WA's natural pace).
  6. Update the run-target row to sent with latency.

Final status:

  • success — every target sent.
  • partial — at least one sent, at least one not (window-close, failed, missing group). error_summary reads: "Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000 groups delivered. This account is at capacity for this fan-out — consider sending the remainder from another paired account."
  • failed — zero sent.

Notification body

The existing reminder.fired SSE event already carries { status }. The web's notification mapper already handles partial with a "see activity" hint. The body extends to mention "X of Y delivered" when status === "partial".

Components

File Role LOC est.
migrations/0008_*.sql add 2 int columns to reminders <20
packages/db/src/schema.ts drizzle alignment <10
apps/bot/src/scheduler/per-key-mutex.ts (new) accountId-keyed async mutex ~40
apps/bot/src/scheduler/rate-limiter.ts (new) per-account token bucket ~60
apps/bot/src/scheduler/media-upload-cache.ts (new) prepareWAMessageMedia results, keyed by mediaId ~50
apps/bot/src/scheduler/delivery-window.ts (new) pure window-end calculator ~30
apps/bot/src/scheduler/fire-reminder.ts (rewrite) new loop using all of the above ~200
apps/bot/src/scheduler/reminder-jobs.ts teamSize config <10
apps/bot/src/env.ts BOT_FIRE_CONCURRENCY, BOT_MAX_SEND_PER_MINUTE, BOT_GROUP_CONCURRENCY <20
apps/web/src/actions/reminders.ts accept the two new fields <30
apps/web/src/components/reminder-wizard/when-form-client.tsx "Delivery hours" inputs <40
apps/web/src/components/reminder-edit/edit-when-form.tsx same <30
apps/web/src/lib/notifications.ts partial-status body extension <15

Tests

  • delivery-window.test.ts — pure function. Window in past → next-day end; window crosses midnight (start > end) — explicitly reject in the schema; timezone offsets handled correctly.
  • rate-limiter.test.ts — fake-clock token bucket. N tokens drained, then refill rate; backpressure via acquire() returning a Promise.
  • per-key-mutex.test.ts — different keys do NOT block each other (parallelism); same key DOES (serialisation); a throwing handler releases the lock; cleanup removes empty entries.
  • media-upload-cache.test.ts — mock socket: prepare called once per unique mediaId regardless of how many groups consume it.
  • fire-reminder.test.ts (extend) — window-end gate marks remaining targets skipped; partial-status error_summary includes account / delivered / total context.

Tuning knobs (env)

Var Default Effect
BOT_FIRE_CONCURRENCY 8 pg-boss worker pool size; max accounts running simultaneously
BOT_GROUP_CONCURRENCY 3 per-account parallel group sends
BOT_MAX_SEND_PER_MINUTE 40 per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response

Per-reminder delivery_window_start_hour / delivery_window_end_hour default to 6/18 and can be widened (e.g. 0/24) for a specific big run.

Out of scope (v2 candidates)

  • Crash resumability across bot restarts. Today, if the bot dies mid-fan-out, pg-boss will retry the job; the loop will skip any rows already marked sent, but the in-memory rate-limiter and upload-cache state are gone — meaning the retry uploads media again and starts pacing from a full bucket. Acceptable for v1.
  • Pause / resume mid-run controls.
  • Cross-day window resume (current design hard-stops at window end and reports partial; doesn't queue the remainder for tomorrow).
  • Multi-account auto-split of a single reminder.
  • Adaptive rate limiting (auto-back-off on WA rate-limit response codes; today the operator tunes the env var).

Acceptance

  • 1000-group reminder with one image, established account: completes in roughly 3050 minutes, comfortably inside a 6am6pm window.
  • Two reminders on different accounts firing within seconds of each other: both progress simultaneously, neither blocks the other.
  • A run that hits the window end: stops cleanly, marks remaining as skipped, surfaces the partial-status message in the Activity tab and via the browser notification.
  • 355 existing tests still pass; ≈25 new tests cover the new helpers.