Records the design decisions for the next planned work:
- Per-reminder delivery window (default 6am–6pm, operator timezone).
Window-close hard-stops the run; remaining targets become
skipped; status reports as partial with a clear "this account is
at capacity, consider another paired account" message.
- Per-account isolation via pg-boss teamSize ≥ N + an in-process
PerKeyMutex keyed by accountId. Different accounts run in
parallel; the same account serialises (no double-rate sends
that would risk a ban).
- Per-account token-bucket rate limiter (default 40 msg/min,
BOT_MAX_SEND_PER_MINUTE).
- Up-front media-upload cache via prepareWAMessageMedia: 1000
groups × 5 MB upload turns into 5 MB. Biggest single win for
text+picture reminders.
- Bounded group concurrency (default 3 in-flight per account);
parts-within-a-group stay serial for visible message order.
- Pre-fetched DB Maps (groups / messages / media), no inner-loop
round-trips.
- Replaces the rigid 1.5 s inter-part sleep with 200–500 ms
jitter; the per-account rate-limiter is the real gate.
Out of scope for v1 (documented under "v2 candidates"): cross-day
window resume, mid-restart resumability, multi-account auto-split,
adaptive rate-limit back-off, pause/resume mid-run.
Acceptance: 1000-group reminder + one image, established account
finishes in ~30–50 minutes, well inside a 6am–6pm window. Two
reminders on different accounts at the same wall-clock minute
both progress in parallel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
217 lines
10 KiB
Markdown
217 lines
10 KiB
Markdown
# Windowed, Pacing-Safe Reminder Fan-Out
|
||
|
||
> Design spec for a faster, ban-safe, multi-account-friendly reminder
|
||
> delivery loop. Written 2026-05-10. Implementation tracked in a
|
||
> follow-up plan doc.
|
||
|
||
## Goal
|
||
|
||
Deliver a reminder to many groups (target: 1000+) safely within a
|
||
per-reminder delivery window. If we cannot finish in the window, stop,
|
||
mark the run `partial`, and tell the operator the account is at
|
||
capacity for this fan-out.
|
||
|
||
## Constraints
|
||
|
||
- WhatsApp's anti-spam is the dominant ceiling. For an established
|
||
account (years of legit history), 30–60 sends/minute is the
|
||
sustainable safe band; tighter for newer accounts.
|
||
- The system runs on a single bot process talking to multiple paired
|
||
WhatsApp accounts. Each account's Baileys socket is independent.
|
||
- Two simultaneous fan-outs on the **same** WhatsApp account would
|
||
double its effective send rate and risk a ban.
|
||
- The operator dropped multi-account fan-out (one reminder splitting
|
||
across N accounts) earlier this week. We respect that decision —
|
||
this design does not automatically split work across accounts.
|
||
|
||
## Approach (selected: B)
|
||
|
||
**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a
|
||
token-bucket rate limiter, add window-end check, cache DB lookups.
|
||
Wins ≈30% on text-only reminders; very little on media-heavy ones.
|
||
|
||
**B. Pacing + media-upload cache + bounded concurrency.** Everything
|
||
in A, plus upload each unique media file ONCE per run via Baileys'
|
||
`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload`
|
||
payload for every group send. Run up to N groups in parallel within
|
||
one account (parts within a group stay serial so order is preserved).
|
||
Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of
|
||
upload turns into 5 MB. **Recommended.**
|
||
|
||
**C. Multi-account fan-out** — dropped per operator decision.
|
||
|
||
## Per-account isolation (cross-account parallelism)
|
||
|
||
Today `boss.work()` is called with default `teamSize=1`, so a single
|
||
fan-out monopolises the whole bot. Two reminders on **different**
|
||
accounts queue serially, which surprises the operator.
|
||
|
||
The new model is **per-account serialization, cross-account
|
||
parallelism**:
|
||
|
||
- `teamSize` raised so multiple reminders on different accounts run
|
||
simultaneously.
|
||
- A per-key async mutex keyed by `accountId` wraps the inner work, so
|
||
two reminders on the **same** account take turns.
|
||
- The token-bucket rate limiter is per-account too, so one account's
|
||
pacing budget never throttles another.
|
||
|
||
```
|
||
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
|
||
├─ R1 (account A) ──┐
|
||
│ ├─ per-account-A mutex ──→ serialised within A
|
||
├─ R3 (account A) ──┘
|
||
│
|
||
├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
|
||
│
|
||
└─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B
|
||
```
|
||
|
||
## Delivery window
|
||
|
||
Each reminder gets a window in its operator timezone. If the run
|
||
cannot finish inside the window, send what we can and stop.
|
||
|
||
- New columns on `reminders`:
|
||
- `delivery_window_start_hour int default 6`
|
||
- `delivery_window_end_hour int default 18`
|
||
- Both interpreted in the row's existing `timezone` column.
|
||
- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g.
|
||
22 → 06) are rejected in v1 to keep the math obvious; can be added
|
||
later if anyone needs them.
|
||
- UI uses two number inputs in the When step (and edit-when page).
|
||
- `delivery-window.ts` exports a pure helper:
|
||
`windowEndAt(timezone, endHour, fireAt) → Date`. Returns the
|
||
end-of-window timestamp for the calendar day `fireAt` falls on,
|
||
in the given timezone. If `fireAt` is already past that day's
|
||
end-hour, the returned timestamp is in the past — the run loop's
|
||
first iteration sees `now() >= windowEndAt`, marks every target
|
||
`skipped`, and the run resolves to `failed` (zero sent). That's
|
||
the right behaviour: "we can't send after window close, even one
|
||
message".
|
||
- **Only the end hour is enforced at runtime in v1.** The start
|
||
hour is documented on the row but not gated — operators schedule
|
||
fire times that fall in their band naturally (cron + the picker's
|
||
default 09:00 time fields land inside 06–18). Enforcing the start
|
||
too would mean holding messages from a 4am cron miss-fire until
|
||
6am, which is a v2 conversation.
|
||
|
||
## Run loop changes (`fire-reminder.ts`)
|
||
|
||
Up-front, once per run:
|
||
|
||
1. Load all `reminder.targets`, `reminder.messages`, and referenced
|
||
`media_files` rows into in-memory Maps. Drops ~3000 round-trips to
|
||
~3 round-trips for a 1000-group run.
|
||
2. Pre-create every `reminder_run_targets` row with
|
||
`status = "pending"` so progress is observable from the Activity
|
||
tab while the fan-out is mid-flight.
|
||
3. **Pre-upload each unique media** via Baileys'
|
||
`prepareWAMessageMedia`. Cache the resulting `WAMediaUpload`
|
||
payload keyed by `mediaId` for the duration of the run.
|
||
4. Compute `windowEndAt` and stash it.
|
||
|
||
Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups,
|
||
default 3):
|
||
|
||
1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the
|
||
target `skipped` with `error="delivery window closed"` and skip.
|
||
2. **Already-sent gate:** if the run-target row is already `sent`
|
||
(i.e. a retry is replaying), skip.
|
||
3. Acquire a token from the per-account rate limiter (default 40
|
||
msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`).
|
||
4. `assertSessions(group)` — call once per group, cache for the run.
|
||
5. For each part in `reminder.messages`:
|
||
- text → `socket.sendMessage(jid, { text })`
|
||
- media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])`
|
||
- sleep `jitter(200..500 ms)` between parts (replaces the rigid
|
||
1.5 s wait — preserves per-chat ordering at WA's natural pace).
|
||
6. Update the run-target row to `sent` with latency.
|
||
|
||
Final status:
|
||
|
||
- **success** — every target sent.
|
||
- **partial** — at least one sent, at least one not (window-close,
|
||
failed, missing group). `error_summary` reads:
|
||
`"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000
|
||
groups delivered. This account is at capacity for this fan-out —
|
||
consider sending the remainder from another paired account."`
|
||
- **failed** — zero sent.
|
||
|
||
## Notification body
|
||
|
||
The existing `reminder.fired` SSE event already carries
|
||
`{ status }`. The web's notification mapper already handles
|
||
`partial` with a "see activity" hint. The body extends to mention
|
||
"X of Y delivered" when status === "partial".
|
||
|
||
## Components
|
||
|
||
| File | Role | LOC est. |
|
||
| --- | --- | --- |
|
||
| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 |
|
||
| `packages/db/src/schema.ts` | drizzle alignment | <10 |
|
||
| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 |
|
||
| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 |
|
||
| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 |
|
||
| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 |
|
||
| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~200 |
|
||
| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 |
|
||
| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 |
|
||
| `apps/web/src/actions/reminders.ts` | accept the two new fields | <30 |
|
||
| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 |
|
||
| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 |
|
||
| `apps/web/src/lib/notifications.ts` | partial-status body extension | <15 |
|
||
|
||
## Tests
|
||
|
||
- `delivery-window.test.ts` — pure function. Window in past →
|
||
next-day end; window crosses midnight (start > end) — explicitly
|
||
reject in the schema; timezone offsets handled correctly.
|
||
- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained,
|
||
then refill rate; backpressure via `acquire()` returning a Promise.
|
||
- `per-key-mutex.test.ts` — different keys do NOT block each other
|
||
(parallelism); same key DOES (serialisation); a throwing handler
|
||
releases the lock; cleanup removes empty entries.
|
||
- `media-upload-cache.test.ts` — mock socket: `prepare` called once
|
||
per unique mediaId regardless of how many groups consume it.
|
||
- `fire-reminder.test.ts` (extend) — window-end gate marks remaining
|
||
targets `skipped`; partial-status error_summary includes account /
|
||
delivered / total context.
|
||
|
||
## Tuning knobs (env)
|
||
|
||
| Var | Default | Effect |
|
||
| --- | --- | --- |
|
||
| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously |
|
||
| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends |
|
||
| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response |
|
||
|
||
Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour`
|
||
default to 6/18 and can be widened (e.g. 0/24) for a specific big run.
|
||
|
||
## Out of scope (v2 candidates)
|
||
|
||
- **Crash resumability across bot restarts.** Today, if the bot dies
|
||
mid-fan-out, pg-boss will retry the job; the loop will skip any
|
||
rows already marked `sent`, but the in-memory rate-limiter and
|
||
upload-cache state are gone — meaning the retry uploads media
|
||
again and starts pacing from a full bucket. Acceptable for v1.
|
||
- **Pause / resume mid-run** controls.
|
||
- **Cross-day window resume** (current design hard-stops at window
|
||
end and reports partial; doesn't queue the remainder for tomorrow).
|
||
- **Multi-account auto-split** of a single reminder.
|
||
- **Adaptive rate limiting** (auto-back-off on WA rate-limit response
|
||
codes; today the operator tunes the env var).
|
||
|
||
## Acceptance
|
||
|
||
- 1000-group reminder with one image, established account: completes
|
||
in roughly 30–50 minutes, comfortably inside a 6am–6pm window.
|
||
- Two reminders on different accounts firing within seconds of each
|
||
other: both progress simultaneously, neither blocks the other.
|
||
- A run that hits the window end: stops cleanly, marks remaining as
|
||
skipped, surfaces the partial-status message in the Activity tab
|
||
and via the browser notification.
|
||
- 355 existing tests still pass; ≈25 new tests cover the new helpers.
|