docs: initial design spec for WhatsApp reminder bot

Captures the validated design from the brainstorming session: two-service
topology (Next.js web + Node bot) communicating via Postgres LISTEN/NOTIFY,
Baileys for WhatsApp, grammy for Telegram, pg-boss for scheduling, Drizzle
for the data model, and Docker/Gitea-registry deploy flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
yiekheng 2026-05-03 14:36:39 +08:00
commit 42caa0d37a

View File

@ -0,0 +1,433 @@
# WhatsApp Reminder Bot — Design
**Status:** Draft
**Date:** 2026-05-03
**Author:** yiekheng (developer); operator: brother (single end-user)
## 1. Purpose
Self-hosted WhatsApp reminder bot. The operator manages 10+ WhatsApp accounts (each tied to a different business responsibility), schedules one-off and recurring reminder messages — text, photos, videos — to specific WhatsApp groups, and receives login QR codes through a private Telegram bot. The system runs 24/7 on the operator's home Docker server, behind a reverse proxy, on a self-hosted Gitea registry.
## 2. Stakeholders & access
- **Developer (you):** builds and maintains. Has full access to dev environment (mock WA account, dev Telegram bot).
- **Operator (brother):** the single end-user in production. Pairs all real WA accounts, creates and manages reminders, receives QR codes via Telegram. Holds all production credentials.
- **Customers (in WA groups):** unaware of the bot — they just receive messages from the WA accounts the operator owns.
Access to the bot is gated by **Telegram user ID whitelist** (configured in env). Web UI access requires a Telegram-issued magic link, so only Telegram-trusted operators can sign in to the dashboard.
## 3. Constraints accepted up front
- **Unofficial WhatsApp protocol.** Built on Baileys (`@whiskeysockets/baileys`). Violates WhatsApp ToS. Account ban risk is non-zero, especially for spam-pattern usage. Acceptable for this customer-reminder use case where messages go to known groups.
- **Self-hosted infrastructure.** Postgres at `192.168.0.210` (already running). Home Docker server runs Portainer; reverse proxy is aaPanel. Domain `04080616.xyz` is available for the web UI subdomain.
- **Self-hosted Gitea.** Git remote at `http://192.168.0.215:3000/yiekheng/cm_whatsapp_bot_v1.git`. Container registry at `gitea.04080616.xyz/yiekheng`.
- **Single-operator threat model.** No tenant isolation. Both developer and operator are effectively admins. The repo is private to the developer. `.env` files **may** be committed to the private Gitea (operator's choice — documented trade-off below).
## 4. High-level architecture
Two app containers + one external dependency. Communication between apps goes through Postgres only.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Home Docker server │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ web │ │ bot │ │
│ │ (Next.js) │◄───────►│ (Node.js) │ │
│ │ │ via │ │ │
│ │ PWA │ Postgres│ Baileys │ │
│ │ Dashboard │ LISTEN/ │ Telegram │ │
│ │ API routes │ NOTIFY │ pg-boss │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ │ shared volume: │ │
│ │ /data/media │ /data/sessions/<account_id>/ │
│ │ │ (Baileys auth state) │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ aaPanel reverse proxy ─► bot.04080616.xyz │
│ │
└─────────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
Postgres at 192.168.0.210 Telegram Bot API (cloud)
(whatsapp_bot_dev / via grammy long-polling
whatsapp_bot_prod) (or webhook later)
```
### Service responsibilities
| Service | Stateless | Owns |
|---|---|---|
| `web` | Yes (can restart freely) | UI, auth, server actions, media upload, SSE for live updates |
| `bot` | No (long-lived sessions) | All Baileys WhatsApp sessions, Telegram bot, pg-boss scheduler, reminder firing, group sync |
**Why split:** Next.js is built for stateless request/response; WhatsApp sessions are long-lived stateful WebSockets that must survive web deploys. Splitting lets us redeploy `web` (frontend changes) without dropping any active WA sessions.
### Why Postgres-as-bus instead of Redis or HTTP
- One less service to run, one less dependency to monitor.
- All bot↔web communication shares the same transactional boundary as data writes — if a write commits, downstream listeners see it.
- pg-boss provides BullMQ-equivalent functionality (delayed jobs, recurring jobs, retries, dead-letter) at a scale (10+ accounts, hundreds of reminders/day max) where Redis throughput advantages are irrelevant.
- LISTEN/NOTIFY covers live UI updates (e.g., "session connected" toast).
## 5. Tech stack
- **Language:** TypeScript everywhere.
- **Frontend:** Next.js 16 (App Router), Server Components + Server Actions, PWA-installable (manifest + browser service worker for offline app shell). Visual design will be built using the `frontend-design:frontend-design` skill during implementation.
- **Backend:** Node.js 22 + TypeScript for the `bot` service.
- **WhatsApp:** `@whiskeysockets/baileys` (no browser; pure WebSocket).
- **Telegram:** `grammy` framework (long-polling in dev, can switch to webhook in prod).
- **Database:** PostgreSQL (external at `192.168.0.210`). Drizzle ORM. Migrations in `packages/db/migrations`.
- **Job queue:** `pg-boss` (Postgres-native).
- **QR rendering:** `qrcode` library (string → PNG buffer).
- **Recurrence:** `rrule` library (RFC 5545).
- **Logging:** `pino` (JSON to stdout).
- **Validation:** `zod` (env, request bodies, server actions).
- **Build orchestration:** pnpm workspaces + Turborepo.
## 6. Repository layout
```
cm_whatsapp_bot_v1/
├── apps/
│ ├── web/ Next.js — UI + API routes + PWA
│ └── bot/ Node — Baileys + Telegram + scheduler
├── packages/
│ ├── db/ Drizzle schema, migrations, queries
│ └── shared/ cross-app types, rrule helpers, paths
├── docker/
│ ├── web.Dockerfile
│ └── bot.Dockerfile
├── docker-compose.base.yml service definitions, networks
├── docker-compose.dev.yml dev overrides: hot reload, exposed ports
├── docker-compose.prod.yml prod: registry images, named volumes
├── scripts/
│ ├── dev.sh up | down | logs | status | reset-db
│ ├── publish.sh build & push to Gitea registry
│ ├── gen_auth_secret.sh generate AUTH_SECRET
│ ├── db.sh migrate | rollback | seed | studio | reset
│ └── link-account.sh CLI helper to pair the dev WA mock account
├── envs/
│ ├── .env.example documented template
│ ├── .env.development dev TG bot, mock WA account, dev DB
│ └── .env.production prod TG bot, real accounts, prod DB
└── docs/
└── superpowers/specs/
└── 2026-05-03-whatsapp-bot-design.md
```
## 7. Environment separation
| Concern | Local dev | Production |
|---|---|---|
| Compose file | `base + dev.yml` | `base + prod.yml` |
| Image source | local build | `gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:${IMAGE_TAG}` |
| Postgres database | `whatsapp_bot_dev` on `192.168.0.210` | `whatsapp_bot_prod` on `192.168.0.210` |
| Postgres role | dev role with limited grants | prod role |
| Telegram bot | separate dev bot (`@..._dev_bot`) — operator's QR codes never go to prod chat | production bot |
| WhatsApp accounts | mock/test phone | operator's real 10+ accounts |
| Web URL | `http://localhost:3000` | `https://bot.04080616.xyz` (subdomain to be confirmed) |
| Hot reload | yes (Next.js HMR + tsx watch) | no |
| Volumes | `./dev-data/{media,sessions}` bind mounts | named volumes |
The `bot` service runs on its own internal port (8081) for health checks; not exposed externally in either env.
Env-validation runs at startup via zod. Missing or malformed env values cause an immediate fast-fail exit with a clear message — both services refuse to come up half-configured.
## 8. Deploy flow
```
dev machine Gitea (192.168.0.215) home server (Portainer)
──────────── ────────────────────── ──────────────────────
git push ───► cm_whatsapp_bot_v1.git
scripts/publish.sh v1.0.0 ───► gitea.04080616.xyz/
yiekheng/
cm-whatsapp-web:v1.0.0
cm-whatsapp-bot:v1.0.0
Portainer stack →
docker-compose.prod.yml
IMAGE_TAG=v1.0.0
pulls images, runs containers
aaPanel proxy →
bot.04080616.xyz → web:3000
```
Image tags:
- `latest` — current main HEAD (manual publish for now; CI later if needed).
- `vX.Y.Z` — release tags for production rollouts; pin in `.env.production`.
- `dev-<short-sha>` — ad-hoc images for testing on the home server before cutting a release.
Rollback = change `IMAGE_TAG` in `.env.production` and recreate the stack in Portainer.
## 9. Data model
ORM: Drizzle. Migrations versioned in `packages/db/migrations/`.
### Tables
```
operators — people who can use the bot
─────────────────────────────────
id uuid pk
telegram_user_id bigint unique — primary identity (whitelist key)
display_name text
role text — 'admin' (only role for v1)
default_timezone text — IANA, e.g. 'Asia/Kuala_Lumpur'
created_at timestamptz
whatsapp_accounts — each WA account the operator manages
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators
label text — operator-defined, e.g. "Sales 1"
phone_number text nullable — populated after pairing
status text — pending | connecting | connected
| disconnected | logged_out | banned
last_connected_at timestamptz nullable
last_qr_at timestamptz nullable
created_at timestamptz
unique(operator_id, label)
whatsapp_groups — groups discovered per account
─────────────────────────────────
id uuid pk
account_id uuid fk → whatsapp_accounts
wa_group_jid text — WhatsApp's group JID
name text
participant_count int
is_archived bool default false
last_synced_at timestamptz
unique(account_id, wa_group_jid)
media_files — uploaded photos/videos/documents
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators
filename_original text
mime_type text
size_bytes bigint
sha256 text
storage_path text — relative to /data/media/
created_at timestamptz
reminders — scheduled sends
─────────────────────────────────
id uuid pk
account_id uuid fk → whatsapp_accounts
name text
schedule_kind text — 'one_off' | 'recurring'
scheduled_at timestamptz nullable — for one_off
rrule text nullable — RFC 5545 rrule string
timezone text — IANA
ends_at timestamptz nullable
max_runs int nullable
status text — 'active' | 'paused' | 'ended'
created_by uuid fk → operators
created_at timestamptz
updated_at timestamptz
reminder_targets — groups a reminder fires into
─────────────────────────────────
reminder_id uuid fk → reminders
group_id uuid fk → whatsapp_groups
position int
pk(reminder_id, group_id)
reminder_messages — message parts in send order
─────────────────────────────────
id uuid pk
reminder_id uuid fk → reminders
position int
kind text — 'text' | 'image' | 'video' | 'document'
text_content text nullable — text body or media caption
media_id uuid fk → media_files nullable
reminder_runs — execution records
─────────────────────────────────
id uuid pk
reminder_id uuid fk → reminders
fired_at timestamptz
status text — 'success' | 'partial' | 'failed' | 'skipped'
error_summary text nullable
reminder_run_targets — per-target outcomes
─────────────────────────────────
run_id uuid fk → reminder_runs
group_id uuid fk → whatsapp_groups
status text — 'sent' | 'failed' | 'skipped'
wa_message_id text nullable
error text nullable
latency_ms int nullable
pk(run_id, group_id)
audit_log — append-only action history
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators nullable
source text — 'web' | 'telegram' | 'system'
action text — 'reminder.create' | 'account.pair' | ...
target_type text nullable
target_id uuid nullable
payload jsonb
created_at timestamptz
auth_sessions — web UI cookies
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators
token_hash text unique — SHA-256 of cookie value
created_at timestamptz
expires_at timestamptz
last_used_at timestamptz
ip_address inet nullable
user_agent text nullable
```
`pg-boss` creates and owns its own `pgboss.*` schema in the same database — namespace-isolated, no manual setup required beyond initial migration.
### Key model decisions
- **Recurring schedules use RRULE (RFC 5545), not cron.** RRULE expresses "every Monday and Wednesday at 9am, 20 occurrences" naturally; cron cannot. Library: `rrule` on Node.
- **Timezone is per-reminder, not per-account.** Operator may run accounts spanning markets in different time zones. Default fills in from operator's `default_timezone`.
- **Baileys auth state on disk, not in Postgres.** Path `/data/sessions/<whatsapp_account_id>/`, using Baileys `useMultiFileAuthState`. That's the upstream-supported path; the file set is many small frequently-mutating files (signal protocol keys); Postgres is the wrong shape. Volume is part of host backup strategy.
- **Audit log is append-only.** Never updated, only inserted. Powers "who created this", "when did this account get paired", etc.
- **Media in object-store-like layout on disk.** Path `/data/media/{yyyy/mm}/{uuid}.{ext}`. Postgres holds metadata only. Sweeper deletes media unreferenced by any reminder after configurable retention (default 90 days). Migration path to MinIO/S3 later: only the storage adapter changes.
- **Web auth via Telegram magic link.** Operator types `/login` to the Telegram bot → bot replies with a one-time URL → click sets a session cookie via `auth_sessions`. No passwords. The operator pool is exactly the Telegram-whitelisted set.
### Out of v1 (YAGNI; easy to add later)
- Templates with variable substitution (`{customer_name}`, `{day}`).
- Multi-tenant operator isolation beyond the existing whitelist.
- Per-customer message personalization.
- Conversation threads beyond the reminder firing log.
- A/B testing of reminder content.
- Web push notifications (Telegram already pushes alerts).
## 10. QR pairing flow (headline UX)
```
1. Operator (Telegram): /pair "Sales Account 3"
2. Bot inserts whatsapp_accounts row { status: 'pending', label: 'Sales Account 3' }
3. Bot starts Baileys session for that account_id
├─ session dir: /data/sessions/<account_id>/
└─ uses useMultiFileAuthState (auto-persists creds + signal keys)
4. Baileys emits connection.update { qr: '...' }
5. Bot renders QR string → PNG, sends to operator's TG chat
"📱 Scan with WhatsApp on Sales Account 3. Expires in 30s."
(Baileys re-emits QR every ~20s; bot edits the same TG message via editMessageMedia)
6. Operator scans → Baileys emits connection.update { connection: 'open' }
7. Bot updates row { status: 'connected', phone_number: '+60xxx', last_connected_at: now }
Bot sends TG: "✅ Sales Account 3 connected as +60xxxxxxx"
Bot pgNotify('web.event', { type: 'session.connected', account_id })
8. Bot triggers group-sync → upserts whatsapp_groups
Bot sends TG: "Synced 12 groups. Ready to send."
```
### Pairing edge cases
| Situation | Behavior |
|---|---|
| QR expires (no scan in ~30s) | Baileys re-emits; bot edits same TG message with new QR. After 5 cycles (~2.5 min): timeout, mark account `pending`, TG: "Pairing timed out — try `/pair` again." |
| Bot container restart mid-pairing | Startup sweeper drops any `pending` accounts with stale `last_qr_at`; operator re-runs `/pair`. |
| `/pair` on already-connected label | Reject: "Account 'X' already connected. Use `/unpair X` first." |
| WA logout from phone (linked-device removed) | Baileys `connection.close` with reason `loggedOut`. Bot marks `logged_out`, sends TG alert with re-pair instruction. Reminders for that account skip with reason `account_logged_out`. |
| Network drop on connected session | Baileys auto-reconnects (built-in). Alert only if downtime >5 min. |
| Web-initiated pair | Same flow; QR PNG also streamed to the open web modal via SSE so operator can scan from web instead of phone-Telegram. |
## 11. Reminder execution flow
```
On reminder create/edit (from web or Telegram):
→ DB row inserted/updated (transaction with reminder_targets, reminder_messages)
→ pgNotify('bot.command', { type: 'reminder.upsert', id })
→ bot.scheduler upserts the reminder into pg-boss:
one_off → schedule single delayed job at scheduled_at
recurring → compute next occurrence from rrule, schedule delayed job;
on completion, fire-reminder schedules the next occurrence
When pg-boss fires the job:
fire-reminder.handler:
1. Load reminder + targets + messages from DB
2. Insert reminder_runs { status: 'pending', fired_at: now }
3. Acquire account session from session-manager
- If not connected: mark all targets 'skipped', update run status, exit
4. For each target group:
a. For each message part in position order:
- text → sendTextMessage
- media → load /data/media/<path>, sendMedia with optional caption
b. Insert reminder_run_targets { status, wa_message_id, latency_ms }
c. Throttle: jitter between targets to stay under WA rate limits
5. Roll up reminder_runs.status:
all sent → 'success'; all failed → 'failed'; mix → 'partial'
6. pgNotify('web.event', { type: 'reminder.fired', run_id })
7. If recurring and not at end_at / max_runs:
schedule next occurrence in pg-boss
Else if at end:
update reminder.status = 'ended'
```
## 12. Error handling
| Failure | Detection | Response |
|---|---|---|
| WA send transient (timeout, network) | Baileys throws / promise rejects | Retry via pg-boss with exponential backoff (3 tries: 30s/2m/10m). Final failure → mark `reminder_run_targets.failed`, dashboard + TG alert. |
| WA send permanent (group not found, banned account) | Specific error codes | No retry. Mark target failed with reason. If account banned → mark `whatsapp_accounts.status='banned'`, urgent TG alert. |
| WA session disconnect | `connection.update` event | Auto-reconnect. Downtime >5 min → TG alert. Reminders during downtime → `skipped`. |
| WA logout | reason `loggedOut` | `status='logged_out'`. Stop reconnect attempts. TG: "Account X logged out — re-pair." |
| Telegram delivery failure | grammy throws | Retry once. Then log to `audit_log` only — don't recurse via TG (TG itself might be down). |
| Postgres connection lost | Drizzle errors | Both services exit non-zero (Docker restarts them). Health checks fail loudly during outage. |
| Media file missing on disk | `fs.stat` fails before send | Mark target `media_missing`, don't send placeholder. TG alert. |
| pg-boss job lost / corrupted | pg-boss own retry → dead-letter | Surface in admin "failed jobs" view; manual retry button. |
| WA rate limit | Specific error | Throttle sender to 1 send / 3 sec per account, jitter between. Backoff longer. |
| Unauthorized Telegram user | Whitelist middleware | Reply: "Sorry, this bot is private." Log to `audit_log`. No state change. |
| Web session expired | Cookie validation fails | Redirect to `/login`. |
### Observability
- **Logs:** `pino` JSON to stdout, captured by Docker.
- **Health endpoints:**
- `web`: `GET /api/health` — DB ping + uptime + commit SHA.
- `bot`: internal port 8081, `GET /health` — DB ping + per-session counts (`{ connected: 8, disconnected: 1, pending: 0 }`).
- **Per-reminder audit trail:** `reminder_runs` + `reminder_run_targets` history, exposed in dashboard. Every fire is fully reconstructable.
## 13. Testing strategy
| Layer | Tool | Scope |
|---|---|---|
| Unit | Vitest | rrule helpers, message-part assembly, audit log builders, env validation, error classifiers. No I/O. |
| Integration (DB) | Vitest + local dev Postgres (or Testcontainers) | Drizzle queries, pg-boss schedule sync, LISTEN/NOTIFY round-trip. Per-test schema with teardown. |
| Bot session logic | Vitest with mocked Baileys | Session-manager state transitions, QR rendering, group-sync upsert. No real WA connection. |
| Telegram | Vitest with mocked grammy | Command parsing, whitelist middleware, error responses. |
| Web E2E | Playwright (deferred) | Login (stubbed magic link), reminder create wizard, dashboard. Add when CI exists. |
| Pairing flow | Manual checklist | Real WA pairing requires a real phone — documented in `docs/superpowers/specs/manual-test-pairing.md`. Run before each release. |
### CI
Out of scope for v1. `pnpm test` and `pnpm lint` will run via husky + lint-staged on `git push`. Gitea Actions can be wired later.
## 14. Scripts
All scripts live in `scripts/`. Patterned on `cm_bot_v2`.
| Script | Purpose |
|---|---|
| `dev.sh` | `up \| down \| logs \| status \| reset-db` against `docker-compose.dev.yml`. Pre-flight checks for `.env.development`. Honors `NO_SUDO=1`. `reset-db` truncates only `whatsapp_bot_dev` with a confirmation prompt. |
| `publish.sh` | Build + push images to `gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:<tag>`. Default tag `latest`. Same auth-error guidance as the cm_bot_v2 reference. |
| `gen_auth_secret.sh` | Generate `AUTH_SECRET` (32 hex bytes). `--write [path]` mode appends/replaces in env file. |
| `db.sh` | Drizzle migration wrapper: `migrate \| rollback \| seed \| studio \| reset`. `reset` is dev-only, refuses if env points at prod DB. |
| `link-account.sh` | CLI helper to start a WA pairing flow without going through Telegram. Emits QR straight to the terminal. Useful for the dev mock account. |
| `local_build.sh` | One-liner foreground compose up. Convenience. |
## 15. Open questions for implementation phase
- Confirm subdomain choice: `bot.04080616.xyz` vs `whatsapp.04080616.xyz` vs other.
- Confirm Postgres connectivity from Docker bridge (`172.16.0.0/12`) is allowed in the existing `pg_hba.conf` on `192.168.0.210`. If not, add the entry before first deploy.
- Confirm operator's IANA timezone for `default_timezone` seed value.
- Decide media retention default (proposing 90 days; sweeper job runs daily).
- Decide whether to enforce a minimum interval between recurring fires (proposing 5 minutes).
These don't block design approval — they're settled during the writing-plans phase or first implementation step.