Previously the name field auto-derived from the first text part when
the operator left it blank. That's brittle once reminders carry
multiple parts of varying provenance, and confusing in lists where
"Reminder" or partial sentences crowd in.
Now: every reminder must carry a non-empty name, capped at 60 chars.
- Zod schema on createReminder/updateReminder: name moves from
`z.string().nullable().optional()` to
`z.string().trim().min(1, "Give the reminder a name").max(60)`.
Stale-URL legacy callers that omit it now get a clear server error.
- Wizard compose step: input has `required` + `aria-required`,
placeholder + label simplified ("(optional)" tag and the helper
paragraph removed), Continue blocks on empty.
- Edit-message form: same — required, aria-required, save blocked
on empty, the "leave blank and we'll auto-derive" hint dropped.
- Review-submit client: defensive fail-fast for stale-bookmark URLs
that arrive at step 5 without a name — bounces back with
"Give the reminder a name (back on the Message step)" instead of
letting the server reject.
The resolveReminderName helper stays put — duplicateReminderAction
and any future caller still benefit from the trim+clamp+fallback
chain. Helper unit tests unaffected (they test the resolver in
isolation, the policy-tightening lives at the schema layer above).
298 web tests still passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
286 lines
13 KiB
Markdown
286 lines
13 KiB
Markdown
# Windowed, Pacing-Safe Reminder Fan-Out
|
||
|
||
> Design spec for a faster, ban-safe, multi-account-friendly reminder
|
||
> delivery loop. Written 2026-05-10. Implementation tracked in a
|
||
> follow-up plan doc.
|
||
|
||
## Goal
|
||
|
||
Deliver a reminder to many groups (target: 1000+) safely within a
|
||
per-reminder delivery window. If we cannot finish in the window,
|
||
**pause** the run at window-close, persist progress, and let the
|
||
operator **resume** later from the Activity / detail view. The
|
||
paused-status message tells the operator what's blocking throughput
|
||
(account at capacity, media size eating the budget) so they can
|
||
decide whether to offload to another paired account, shrink the
|
||
attachment, or just resume the next morning at 6am.
|
||
|
||
## Constraints
|
||
|
||
- WhatsApp's anti-spam is the dominant ceiling. For an established
|
||
account (years of legit history), 30–60 sends/minute is the
|
||
sustainable safe band; tighter for newer accounts.
|
||
- The system runs on a single bot process talking to multiple paired
|
||
WhatsApp accounts. Each account's Baileys socket is independent.
|
||
- Two simultaneous fan-outs on the **same** WhatsApp account would
|
||
double its effective send rate and risk a ban.
|
||
- The operator dropped multi-account fan-out (one reminder splitting
|
||
across N accounts) earlier this week. We respect that decision —
|
||
this design does not automatically split work across accounts.
|
||
|
||
## Approach (selected: B)
|
||
|
||
**A. Minimal pacing fix.** Drop the rigid 1.5s sleep, add a
|
||
token-bucket rate limiter, add window-end check, cache DB lookups.
|
||
Wins ≈30% on text-only reminders; very little on media-heavy ones.
|
||
|
||
**B. Pacing + media-upload cache + bounded concurrency.** Everything
|
||
in A, plus upload each unique media file ONCE per run via Baileys'
|
||
`prepareWAMessageMedia` and reuse the resulting `WAMediaUpload`
|
||
payload for every group send. Run up to N groups in parallel within
|
||
one account (parts within a group stay serial so order is preserved).
|
||
Wins are massive on text + picture: 1000 groups × 5 MB = 5 GB of
|
||
upload turns into 5 MB. **Recommended.**
|
||
|
||
**C. Multi-account fan-out** — dropped per operator decision.
|
||
|
||
## Per-account isolation (cross-account parallelism)
|
||
|
||
Today `boss.work()` is called with default `teamSize=1`, so a single
|
||
fan-out monopolises the whole bot. Two reminders on **different**
|
||
accounts queue serially, which surprises the operator.
|
||
|
||
The new model is **per-account serialization, cross-account
|
||
parallelism**:
|
||
|
||
- `teamSize` raised so multiple reminders on different accounts run
|
||
simultaneously.
|
||
- A per-key async mutex keyed by `accountId` wraps the inner work, so
|
||
two reminders on the **same** account take turns.
|
||
- The token-bucket rate limiter is per-account too, so one account's
|
||
pacing budget never throttles another.
|
||
|
||
```
|
||
pg-boss worker pool (teamSize = BOT_FIRE_CONCURRENCY, default 8)
|
||
├─ R1 (account A) ──┐
|
||
│ ├─ per-account-A mutex ──→ serialised within A
|
||
├─ R3 (account A) ──┘
|
||
│
|
||
├─ R2 (account B) ──── per-account-B mutex ──→ parallel with A's
|
||
│
|
||
└─ R4 (account C) ──── per-account-C mutex ──→ parallel with A and B
|
||
```
|
||
|
||
## Delivery window
|
||
|
||
Each reminder gets a window in its operator timezone. If the run
|
||
cannot finish inside the window, send what we can and stop.
|
||
|
||
- New columns on `reminders`:
|
||
- `delivery_window_start_hour int default 6`
|
||
- `delivery_window_end_hour int default 18`
|
||
- Both interpreted in the row's existing `timezone` column.
|
||
- Validation: `0 ≤ start < end ≤ 24`. Cross-midnight windows (e.g.
|
||
22 → 06) are rejected in v1 to keep the math obvious; can be added
|
||
later if anyone needs them.
|
||
- UI uses two number inputs in the When step (and edit-when page).
|
||
- `delivery-window.ts` exports a pure helper:
|
||
`windowEndAt(timezone, endHour, fireAt) → Date`. Returns the
|
||
end-of-window timestamp for the calendar day `fireAt` falls on,
|
||
in the given timezone. If `fireAt` is already past that day's
|
||
end-hour, the returned timestamp is in the past — the run loop's
|
||
first iteration sees `now() >= windowEndAt`, marks every target
|
||
`skipped`, and the run resolves to `failed` (zero sent). That's
|
||
the right behaviour: "we can't send after window close, even one
|
||
message".
|
||
- **Only the end hour is enforced at runtime in v1.** The start
|
||
hour is documented on the row but not gated — operators schedule
|
||
fire times that fall in their band naturally (cron + the picker's
|
||
default 09:00 time fields land inside 06–18). Enforcing the start
|
||
too would mean holding messages from a 4am cron miss-fire until
|
||
6am, which is a v2 conversation.
|
||
|
||
## Run loop changes (`fire-reminder.ts`)
|
||
|
||
Up-front, once per run:
|
||
|
||
1. Load all `reminder.targets`, `reminder.messages`, and referenced
|
||
`media_files` rows into in-memory Maps. Drops ~3000 round-trips to
|
||
~3 round-trips for a 1000-group run.
|
||
2. Pre-create every `reminder_run_targets` row with
|
||
`status = "pending"` so progress is observable from the Activity
|
||
tab while the fan-out is mid-flight.
|
||
3. **Pre-upload each unique media** via Baileys'
|
||
`prepareWAMessageMedia`. Cache the resulting `WAMediaUpload`
|
||
payload keyed by `mediaId` for the duration of the run.
|
||
4. Compute `windowEndAt` and stash it.
|
||
|
||
Per-target (limited to `BOT_GROUP_CONCURRENCY` parallel groups,
|
||
default 3):
|
||
|
||
1. **Window-end gate:** if `Date.now() >= windowEndAt`, mark the
|
||
target `skipped` with `error="delivery window closed"` and skip.
|
||
2. **Already-sent gate:** if the run-target row is already `sent`
|
||
(i.e. a retry is replaying), skip.
|
||
3. Acquire a token from the per-account rate limiter (default 40
|
||
msg/min, configurable `BOT_MAX_SEND_PER_MINUTE`).
|
||
4. `assertSessions(group)` — call once per group, cache for the run.
|
||
5. For each part in `reminder.messages`:
|
||
- text → `socket.sendMessage(jid, { text })`
|
||
- media → `socket.sendMessage(jid, uploadedMediaCache[mediaId])`
|
||
- sleep `jitter(200..500 ms)` between parts (replaces the rigid
|
||
1.5 s wait — preserves per-chat ordering at WA's natural pace).
|
||
6. Update the run-target row to `sent` with latency.
|
||
|
||
Final status:
|
||
|
||
- **success** — every target sent.
|
||
- **paused** — window closed mid-run with at least one target still in
|
||
`pending`. Run carries a resumable state: sent rows stay `sent`,
|
||
unstarted rows stay `pending` (NOT skipped), failed rows stay
|
||
`failed`. `error_summary` reads:
|
||
`"Delivery window closed at 18:00 (Asia/Kuala_Lumpur). 412 of 1000
|
||
groups delivered, 588 still pending. Resume from the Activity tab.
|
||
If this happens repeatedly, consider offloading to another paired
|
||
account, or shrinking the message body / media size to fit more
|
||
groups in your daily window."`
|
||
- **partial** — every target was attempted; some sent and some
|
||
failed/skipped (group missing from DB, account offline, send error
|
||
inside the window). Not resumable; the failures are real failures.
|
||
- **failed** — zero sent. Either every send errored, or the run hit
|
||
the window close BEFORE the first send (run fired too late to do
|
||
any work; nothing to resume).
|
||
|
||
## Resume action
|
||
|
||
A paused run can be resumed by the operator. Mechanism:
|
||
|
||
- New server action `resumeReminderRunAction(runId)` validates
|
||
ownership, then enqueues a pg-boss job:
|
||
`boss.send("reminder.fire", { reminderId, runId })` with NO
|
||
singletonKey (resumes don't conflict with the reminder's normal
|
||
cron firing).
|
||
- The fire-reminder handler accepts an optional `runId` in its
|
||
payload. When present, it ATTACHES to that run instead of creating
|
||
a new one:
|
||
- Skips creating a new `reminder_runs` row.
|
||
- Loads the existing run's `reminder_run_targets` rows.
|
||
- Iterates only those with `status = 'pending'`.
|
||
- Re-uses the same windowEnd / rate limiter / media cache logic as
|
||
a fresh fire.
|
||
- On window close again, status flips back to `paused` with an
|
||
updated count.
|
||
- On success this round, status becomes `success` (if no failures
|
||
accumulated) or `partial` (if some failed).
|
||
- `failed` targets from the previous run are NOT retried on resume.
|
||
They're real errors — surfacing them as actionable in the UI is a
|
||
v2 concern (manual "retry failed" button).
|
||
|
||
UI surfaces of paused runs:
|
||
|
||
- Activity tab gets an amber "Paused" pill alongside the existing
|
||
Success/Partial/Failed/Skipped/Archived filters. Resume button
|
||
inline on each paused row.
|
||
- Reminder detail page's run history shows the same Resume button on
|
||
paused rows.
|
||
- The `reminder.fired` SSE event for status=paused triggers a
|
||
notification with title "Reminder paused" and body
|
||
`"X of Y groups delivered. Resume from the Activity tab."`
|
||
|
||
## Notification body
|
||
|
||
The existing `reminder.fired` SSE event already carries `{ status }`.
|
||
The notification mapper extends:
|
||
|
||
- `success` → unchanged.
|
||
- `partial` → body mentions delivered/total counts when present.
|
||
- `paused` → headline `"Reminder paused"`, body
|
||
`"X of Y groups delivered. Resume from the Activity tab."` Click
|
||
takes the operator to the reminder's detail page where the Resume
|
||
button lives.
|
||
- `failed` → unchanged.
|
||
- `skipped` → still filtered (bookkeeping noise).
|
||
|
||
## Components
|
||
|
||
| File | Role | LOC est. |
|
||
| --- | --- | --- |
|
||
| `migrations/0008_*.sql` | add 2 int columns to `reminders` | <20 |
|
||
| `packages/db/src/schema.ts` | drizzle alignment | <10 |
|
||
| `apps/bot/src/scheduler/per-key-mutex.ts` (new) | accountId-keyed async mutex | ~40 |
|
||
| `apps/bot/src/scheduler/rate-limiter.ts` (new) | per-account token bucket | ~60 |
|
||
| `apps/bot/src/scheduler/media-upload-cache.ts` (new) | `prepareWAMessageMedia` results, keyed by mediaId | ~50 |
|
||
| `apps/bot/src/scheduler/delivery-window.ts` (new) | pure window-end calculator | ~30 |
|
||
| `apps/bot/src/scheduler/fire-reminder.ts` (rewrite) | new loop using all of the above | ~200 |
|
||
| `apps/bot/src/scheduler/reminder-jobs.ts` | `teamSize` config | <10 |
|
||
| `apps/bot/src/env.ts` | `BOT_FIRE_CONCURRENCY`, `BOT_MAX_SEND_PER_MINUTE`, `BOT_GROUP_CONCURRENCY` | <20 |
|
||
| `apps/web/src/actions/reminders.ts` | accept the two new fields | <30 |
|
||
| `apps/web/src/components/reminder-wizard/when-form-client.tsx` | "Delivery hours" inputs | <40 |
|
||
| `apps/web/src/components/reminder-edit/edit-when-form.tsx` | same | <30 |
|
||
| `apps/web/src/lib/notifications.ts` | partial-status body extension | <15 |
|
||
|
||
## Tests
|
||
|
||
- `delivery-window.test.ts` — pure function. Window in past →
|
||
next-day end; window crosses midnight (start > end) — explicitly
|
||
reject in the schema; timezone offsets handled correctly.
|
||
- `rate-limiter.test.ts` — fake-clock token bucket. N tokens drained,
|
||
then refill rate; backpressure via `acquire()` returning a Promise.
|
||
- `per-key-mutex.test.ts` — different keys do NOT block each other
|
||
(parallelism); same key DOES (serialisation); a throwing handler
|
||
releases the lock; cleanup removes empty entries.
|
||
- `media-upload-cache.test.ts` — mock socket: `prepare` called once
|
||
per unique mediaId regardless of how many groups consume it.
|
||
- `fire-reminder.test.ts` (extend) — window-end gate marks remaining
|
||
targets `skipped`; partial-status error_summary includes account /
|
||
delivered / total context.
|
||
|
||
## Tuning knobs (env)
|
||
|
||
| Var | Default | Effect |
|
||
| --- | --- | --- |
|
||
| `BOT_FIRE_CONCURRENCY` | 8 | pg-boss worker pool size; max accounts running simultaneously |
|
||
| `BOT_GROUP_CONCURRENCY` | 3 | per-account parallel group sends |
|
||
| `BOT_MAX_SEND_PER_MINUTE` | 40 | per-account token-bucket rate; loosen to 60 if no flags after weeks of running, tighten to 20 if any rate-limit response |
|
||
|
||
Per-reminder `delivery_window_start_hour` / `delivery_window_end_hour`
|
||
default to 6/18 and can be widened (e.g. 0/24) for a specific big run.
|
||
|
||
## Out of scope (v2 candidates)
|
||
|
||
- **Crash resumability across bot restarts.** If the bot dies
|
||
mid-fan-out (mid-window), pg-boss will retry the job; the loop's
|
||
pre-loaded `pending` rows still pick up correctly, but the
|
||
in-memory rate-limiter and upload-cache state are gone — the
|
||
retry re-uploads media and starts pacing from a full bucket. The
|
||
paused-state resumability covered above is a different mechanism:
|
||
it handles the "window closed cleanly" case end-to-end. The
|
||
"bot crashed mid-window" case is degraded but not broken.
|
||
- **Auto-resume next morning** when window opens again (today the
|
||
operator clicks Resume manually).
|
||
- **Pause-by-operator** (only window-close pauses; user-triggered
|
||
pause mid-fan-out isn't wired).
|
||
- **Retry-failed-targets** action (paused-resume only re-attempts
|
||
`pending` rows; `failed` rows stay failed).
|
||
- **Multi-account auto-split** of a single reminder.
|
||
- **Adaptive rate limiting** (auto-back-off on WA rate-limit response
|
||
codes; today the operator tunes the env var).
|
||
|
||
## Acceptance
|
||
|
||
- 1000-group reminder with one image, established account: completes
|
||
in roughly 30–50 minutes, comfortably inside a 6am–6pm window.
|
||
- Two reminders on different accounts firing within seconds of each
|
||
other: both progress simultaneously, neither blocks the other.
|
||
- A run that hits the window end mid-fan-out: stops cleanly, marks
|
||
the run `paused`, leaves un-started targets as `pending`, surfaces
|
||
the paused-status notification with delivered/total counts.
|
||
- The operator clicks **Resume** on a paused run — fan-out continues
|
||
from the unsent targets, respecting the same per-account rate
|
||
limit + window. If it again can't finish, it pauses again with an
|
||
updated count.
|
||
- A run that hits the window end BEFORE any send (fired too late):
|
||
resolves `failed`, no resume offered.
|
||
- 355 existing tests still pass; ≈30 new tests cover the new helpers
|
||
and the paused/resume flow.
|