yiekheng 50187a86e1 docs: design spec — windowed, pacing-safe reminder fan-out

Records the design decisions for the next planned work:

  - Per-reminder delivery window (default 6am–6pm, operator timezone).
    Window-close hard-stops the run; remaining targets become
    skipped; status reports as partial with a clear "this account is
    at capacity, consider another paired account" message.
  - Per-account isolation via pg-boss teamSize ≥ N + an in-process
    PerKeyMutex keyed by accountId. Different accounts run in
    parallel; the same account serialises (no double-rate sends
    that would risk a ban).
  - Per-account token-bucket rate limiter (default 40 msg/min,
    BOT_MAX_SEND_PER_MINUTE).
  - Up-front media-upload cache via prepareWAMessageMedia: 1000
    groups × 5 MB upload turns into 5 MB. Biggest single win for
    text+picture reminders.
  - Bounded group concurrency (default 3 in-flight per account);
    parts-within-a-group stay serial for visible message order.
  - Pre-fetched DB Maps (groups / messages / media), no inner-loop
    round-trips.
  - Replaces the rigid 1.5 s inter-part sleep with 200–500 ms
    jitter; the per-account rate-limiter is the real gate.

Out of scope for v1 (documented under "v2 candidates"): cross-day
window resume, mid-restart resumability, multi-account auto-split,
adaptive rate-limit back-off, pause/resume mid-run.

Acceptance: 1000-group reminder + one image, established account
finishes in ~30–50 minutes, well inside a 6am–6pm window. Two
reminders on different accounts at the same wall-clock minute
both progress in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-10 14:01:24 +08:00

10 KiB

Raw Blame History

Windowed, Pacing-Safe Reminder Fan-Out

Design spec for a faster, ban-safe, multi-account-friendly reminder delivery loop. Written 2026-05-10. Implementation tracked in a follow-up plan doc.

Goal

Deliver a reminder to many groups (target: 1000+) safely within a per-reminder delivery window. If we cannot finish in the window, stop, mark the run partial, and tell the operator the account is at capacity for this fan-out.

Constraints

WhatsApp's anti-spam is the dominant ceiling. For an established account (years of legit history), 30–60 sends/minute is the sustainable safe band; tighter for newer accounts.
The system runs on a single bot process talking to multiple paired WhatsApp accounts. Each account's Baileys socket is independent.
Two simultaneous fan-outs on the same WhatsApp account would double its effective send rate and risk a ban.
The operator dropped multi-account fan-out (one reminder splitting across N accounts) earlier this week. We respect that decision — this design does not automatically split work across accounts.

Approach (selected: B)

A. Minimal pacing fix. Drop the rigid 1.5s sleep, add a token-bucket rate limiter, add window-end check, cache DB lookups. Wins ≈30% on text-only reminders; very little on media-heavy ones.

B. Pacing + media-upload cache + bounded concurrency. Everything in A, plus upload each unique media file ONCE per run via Baileys' prepareWAMessageMedia and reuse the resulting WAMediaUpload payload for every group send. Run up to N groups in parallel within one account (parts within a group stay serial so order is preserved). Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of upload turns into 5 MB. Recommended.

C. Multi-account fan-out — dropped per operator decision.

Per-account isolation (cross-account parallelism)

Today boss.work() is called with default teamSize=1, so a single fan-out monopolises the whole bot. Two reminders on different accounts queue serially, which surprises the operator.

The new model is per-account serialization, cross-account parallelism:

teamSize raised so multiple reminders on different accounts run simultaneously.
A per-key async mutex keyed by accountId wraps the inner work, so two reminders on the same account take turns.
The token-bucket rate limiter is per-account too, so one account's pacing budget never throttles another.

pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
  ├─ R1 (account A) ──┐
  │                    ├─ per-account-A mutex ──→ serialised within A
  ├─ R3 (account A) ──┘
  │
  ├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
  │
  └─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B

Delivery window

Each reminder gets a window in its operator timezone. If the run cannot finish inside the window, send what we can and stop.

New columns on reminders:
- delivery_window_start_hour int default 6
- delivery_window_end_hour int default 18
- Both interpreted in the row's existing timezone column.
Validation: 0 ≤ start < end ≤ 24. Cross-midnight windows (e.g. 22 → 06) are rejected in v1 to keep the math obvious; can be added later if anyone needs them.
UI uses two number inputs in the When step (and edit-when page).
delivery-window.ts exports a pure helper: windowEndAt(timezone, endHour, fireAt) → Date. Returns the end-of-window timestamp for the calendar day fireAt falls on, in the given timezone. If fireAt is already past that day's end-hour, the returned timestamp is in the past — the run loop's first iteration sees now() >= windowEndAt, marks every target skipped, and the run resolves to failed (zero sent). That's the right behaviour: "we can't send after window close, even one message".
Only the end hour is enforced at runtime in v1. The start hour is documented on the row but not gated — operators schedule fire times that fall in their band naturally (cron + the picker's default 09:00 time fields land inside 06–18). Enforcing the start too would mean holding messages from a 4am cron miss-fire until 6am, which is a v2 conversation.

Run loop changes (`fire-reminder.ts`)

Up-front, once per run:

Load all reminder.targets, reminder.messages, and referenced media_files rows into in-memory Maps. Drops ~3000 round-trips to ~3 round-trips for a 1000-group run.
Pre-create every reminder_run_targets row with status = "pending" so progress is observable from the Activity tab while the fan-out is mid-flight.
Pre-upload each unique media via Baileys' prepareWAMessageMedia. Cache the resulting WAMediaUpload payload keyed by mediaId for the duration of the run.
Compute windowEndAt and stash it.

Per-target (limited to BOT_GROUP_CONCURRENCY parallel groups, default 3):

Window-end gate: if Date.now() >= windowEndAt, mark the target skipped with error="delivery window closed" and skip.
Already-sent gate: if the run-target row is already sent (i.e. a retry is replaying), skip.
Acquire a token from the per-account rate limiter (default 40 msg/min, configurable BOT_MAX_SEND_PER_MINUTE).
assertSessions(group) — call once per group, cache for the run.
For each part in reminder.messages:
- text → socket.sendMessage(jid, { text })
- media → socket.sendMessage(jid, uploadedMediaCache[mediaId])
- sleep jitter(200..500 ms) between parts (replaces the rigid 1.5 s wait — preserves per-chat ordering at WA's natural pace).
Update the run-target row to sent with latency.

Final status:

success — every target sent.
partial — at least one sent, at least one not (window-close, failed, missing group). error_summary reads: "Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000 groups delivered. This account is at capacity for this fan-out — consider sending the remainder from another paired account."
failed — zero sent.

Notification body

The existing reminder.fired SSE event already carries { status }. The web's notification mapper already handles partial with a "see activity" hint. The body extends to mention "X of Y delivered" when status === "partial".

Components

File	Role	LOC est.
`migrations/0008_*.sql`	add 2 int columns to `reminders`	<20
`packages/db/src/schema.ts`	drizzle alignment	<10
`apps/bot/src/scheduler/per-key-mutex.ts` (new)	accountId-keyed async mutex	~40
`apps/bot/src/scheduler/rate-limiter.ts` (new)	per-account token bucket	~60
`apps/bot/src/scheduler/media-upload-cache.ts` (new)	`prepareWAMessageMedia` results, keyed by mediaId	~50
`apps/bot/src/scheduler/delivery-window.ts` (new)	pure window-end calculator	~30
`apps/bot/src/scheduler/fire-reminder.ts` (rewrite)	new loop using all of the above	~200
`apps/bot/src/scheduler/reminder-jobs.ts`	`teamSize` config	<10
`apps/bot/src/env.ts`	`BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY`	<20
`apps/web/src/actions/reminders.ts`	accept the two new fields	<30
`apps/web/src/components/reminder-wizard/when-form-client.tsx`	"Delivery hours" inputs	<40
`apps/web/src/components/reminder-edit/edit-when-form.tsx`	same	<30
`apps/web/src/lib/notifications.ts`	partial-status body extension	<15

Tests

delivery-window.test.ts — pure function. Window in past → next-day end; window crosses midnight (start > end) — explicitly reject in the schema; timezone offsets handled correctly.
rate-limiter.test.ts — fake-clock token bucket. N tokens drained, then refill rate; backpressure via acquire() returning a Promise.
per-key-mutex.test.ts — different keys do NOT block each other (parallelism); same key DOES (serialisation); a throwing handler releases the lock; cleanup removes empty entries.
media-upload-cache.test.ts — mock socket: prepare called once per unique mediaId regardless of how many groups consume it.
fire-reminder.test.ts (extend) — window-end gate marks remaining targets skipped; partial-status error_summary includes account / delivered / total context.

Tuning knobs (env)

Var	Default	Effect
`BOT_FIRE_CONCURRENCY`	8	pg-boss worker pool size; max accounts running simultaneously
`BOT_GROUP_CONCURRENCY`	3	per-account parallel group sends
`BOT_MAX_SEND_PER_MINUTE`	40	per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response

Per-reminder delivery_window_start_hour / delivery_window_end_hour default to 6/18 and can be widened (e.g. 0/24) for a specific big run.

Out of scope (v2 candidates)

Crash resumability across bot restarts. Today, if the bot dies mid-fan-out, pg-boss will retry the job; the loop will skip any rows already marked sent, but the in-memory rate-limiter and upload-cache state are gone — meaning the retry uploads media again and starts pacing from a full bucket. Acceptable for v1.
Pause / resume mid-run controls.
Cross-day window resume (current design hard-stops at window end and reports partial; doesn't queue the remainder for tomorrow).
Multi-account auto-split of a single reminder.
Adaptive rate limiting (auto-back-off on WA rate-limit response codes; today the operator tunes the env var).

Acceptance

1000-group reminder with one image, established account: completes in roughly 30–50 minutes, comfortably inside a 6am–6pm window.
Two reminders on different accounts firing within seconds of each other: both progress simultaneously, neither blocks the other.
A run that hits the window end: stops cleanly, marks remaining as skipped, surfaces the partial-status message in the Activity tab and via the browser notification.
355 existing tests still pass; ≈25 new tests cover the new helpers.

10 KiB Raw Blame History Unescape Escape