commit 42caa0d37ae86a612ac29869342ea9bdac823fe7 Author: yiekheng Date: Sun May 3 14:36:39 2026 +0800 docs: initial design spec for WhatsApp reminder bot Captures the validated design from the brainstorming session: two-service topology (Next.js web + Node bot) communicating via Postgres LISTEN/NOTIFY, Baileys for WhatsApp, grammy for Telegram, pg-boss for scheduling, Drizzle for the data model, and Docker/Gitea-registry deploy flow. Co-Authored-By: Claude Opus 4.7 (1M context) diff --git a/docs/superpowers/specs/2026-05-03-whatsapp-bot-design.md b/docs/superpowers/specs/2026-05-03-whatsapp-bot-design.md new file mode 100644 index 0000000..d438ba7 --- /dev/null +++ b/docs/superpowers/specs/2026-05-03-whatsapp-bot-design.md @@ -0,0 +1,433 @@ +# WhatsApp Reminder Bot — Design + +**Status:** Draft +**Date:** 2026-05-03 +**Author:** yiekheng (developer); operator: brother (single end-user) + +## 1. Purpose + +Self-hosted WhatsApp reminder bot. The operator manages 10+ WhatsApp accounts (each tied to a different business responsibility), schedules one-off and recurring reminder messages — text, photos, videos — to specific WhatsApp groups, and receives login QR codes through a private Telegram bot. The system runs 24/7 on the operator's home Docker server, behind a reverse proxy, on a self-hosted Gitea registry. + +## 2. Stakeholders & access + +- **Developer (you):** builds and maintains. Has full access to dev environment (mock WA account, dev Telegram bot). +- **Operator (brother):** the single end-user in production. Pairs all real WA accounts, creates and manages reminders, receives QR codes via Telegram. Holds all production credentials. +- **Customers (in WA groups):** unaware of the bot — they just receive messages from the WA accounts the operator owns. + +Access to the bot is gated by **Telegram user ID whitelist** (configured in env). Web UI access requires a Telegram-issued magic link, so only Telegram-trusted operators can sign in to the dashboard. + +## 3. Constraints accepted up front + +- **Unofficial WhatsApp protocol.** Built on Baileys (`@whiskeysockets/baileys`). Violates WhatsApp ToS. Account ban risk is non-zero, especially for spam-pattern usage. Acceptable for this customer-reminder use case where messages go to known groups. +- **Self-hosted infrastructure.** Postgres at `192.168.0.210` (already running). Home Docker server runs Portainer; reverse proxy is aaPanel. Domain `04080616.xyz` is available for the web UI subdomain. +- **Self-hosted Gitea.** Git remote at `http://192.168.0.215:3000/yiekheng/cm_whatsapp_bot_v1.git`. Container registry at `gitea.04080616.xyz/yiekheng`. +- **Single-operator threat model.** No tenant isolation. Both developer and operator are effectively admins. The repo is private to the developer. `.env` files **may** be committed to the private Gitea (operator's choice — documented trade-off below). + +## 4. High-level architecture + +Two app containers + one external dependency. Communication between apps goes through Postgres only. + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Home Docker server │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ │ +│ │ web │ │ bot │ │ +│ │ (Next.js) │◄───────►│ (Node.js) │ │ +│ │ │ via │ │ │ +│ │ PWA │ Postgres│ Baileys │ │ +│ │ Dashboard │ LISTEN/ │ Telegram │ │ +│ │ API routes │ NOTIFY │ pg-boss │ │ +│ └──────┬──────┘ └──────┬──────┘ │ +│ │ │ │ +│ │ shared volume: │ │ +│ │ /data/media │ /data/sessions// │ +│ │ │ (Baileys auth state) │ +│ │ │ │ +│ └───────────┬───────────┘ │ +│ │ │ +│ ▼ │ +│ aaPanel reverse proxy ─► bot.04080616.xyz │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ + │ │ + ▼ ▼ + Postgres at 192.168.0.210 Telegram Bot API (cloud) + (whatsapp_bot_dev / via grammy long-polling + whatsapp_bot_prod) (or webhook later) +``` + +### Service responsibilities + +| Service | Stateless | Owns | +|---|---|---| +| `web` | Yes (can restart freely) | UI, auth, server actions, media upload, SSE for live updates | +| `bot` | No (long-lived sessions) | All Baileys WhatsApp sessions, Telegram bot, pg-boss scheduler, reminder firing, group sync | + +**Why split:** Next.js is built for stateless request/response; WhatsApp sessions are long-lived stateful WebSockets that must survive web deploys. Splitting lets us redeploy `web` (frontend changes) without dropping any active WA sessions. + +### Why Postgres-as-bus instead of Redis or HTTP + +- One less service to run, one less dependency to monitor. +- All bot↔web communication shares the same transactional boundary as data writes — if a write commits, downstream listeners see it. +- pg-boss provides BullMQ-equivalent functionality (delayed jobs, recurring jobs, retries, dead-letter) at a scale (10+ accounts, hundreds of reminders/day max) where Redis throughput advantages are irrelevant. +- LISTEN/NOTIFY covers live UI updates (e.g., "session connected" toast). + +## 5. Tech stack + +- **Language:** TypeScript everywhere. +- **Frontend:** Next.js 16 (App Router), Server Components + Server Actions, PWA-installable (manifest + browser service worker for offline app shell). Visual design will be built using the `frontend-design:frontend-design` skill during implementation. +- **Backend:** Node.js 22 + TypeScript for the `bot` service. +- **WhatsApp:** `@whiskeysockets/baileys` (no browser; pure WebSocket). +- **Telegram:** `grammy` framework (long-polling in dev, can switch to webhook in prod). +- **Database:** PostgreSQL (external at `192.168.0.210`). Drizzle ORM. Migrations in `packages/db/migrations`. +- **Job queue:** `pg-boss` (Postgres-native). +- **QR rendering:** `qrcode` library (string → PNG buffer). +- **Recurrence:** `rrule` library (RFC 5545). +- **Logging:** `pino` (JSON to stdout). +- **Validation:** `zod` (env, request bodies, server actions). +- **Build orchestration:** pnpm workspaces + Turborepo. + +## 6. Repository layout + +``` +cm_whatsapp_bot_v1/ +├── apps/ +│ ├── web/ Next.js — UI + API routes + PWA +│ └── bot/ Node — Baileys + Telegram + scheduler +├── packages/ +│ ├── db/ Drizzle schema, migrations, queries +│ └── shared/ cross-app types, rrule helpers, paths +├── docker/ +│ ├── web.Dockerfile +│ └── bot.Dockerfile +├── docker-compose.base.yml service definitions, networks +├── docker-compose.dev.yml dev overrides: hot reload, exposed ports +├── docker-compose.prod.yml prod: registry images, named volumes +├── scripts/ +│ ├── dev.sh up | down | logs | status | reset-db +│ ├── publish.sh build & push to Gitea registry +│ ├── gen_auth_secret.sh generate AUTH_SECRET +│ ├── db.sh migrate | rollback | seed | studio | reset +│ └── link-account.sh CLI helper to pair the dev WA mock account +├── envs/ +│ ├── .env.example documented template +│ ├── .env.development dev TG bot, mock WA account, dev DB +│ └── .env.production prod TG bot, real accounts, prod DB +└── docs/ + └── superpowers/specs/ + └── 2026-05-03-whatsapp-bot-design.md +``` + +## 7. Environment separation + +| Concern | Local dev | Production | +|---|---|---| +| Compose file | `base + dev.yml` | `base + prod.yml` | +| Image source | local build | `gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:${IMAGE_TAG}` | +| Postgres database | `whatsapp_bot_dev` on `192.168.0.210` | `whatsapp_bot_prod` on `192.168.0.210` | +| Postgres role | dev role with limited grants | prod role | +| Telegram bot | separate dev bot (`@..._dev_bot`) — operator's QR codes never go to prod chat | production bot | +| WhatsApp accounts | mock/test phone | operator's real 10+ accounts | +| Web URL | `http://localhost:3000` | `https://bot.04080616.xyz` (subdomain to be confirmed) | +| Hot reload | yes (Next.js HMR + tsx watch) | no | +| Volumes | `./dev-data/{media,sessions}` bind mounts | named volumes | + +The `bot` service runs on its own internal port (8081) for health checks; not exposed externally in either env. + +Env-validation runs at startup via zod. Missing or malformed env values cause an immediate fast-fail exit with a clear message — both services refuse to come up half-configured. + +## 8. Deploy flow + +``` +dev machine Gitea (192.168.0.215) home server (Portainer) +──────────── ────────────────────── ────────────────────── +git push ───► cm_whatsapp_bot_v1.git +scripts/publish.sh v1.0.0 ───► gitea.04080616.xyz/ + yiekheng/ + cm-whatsapp-web:v1.0.0 + cm-whatsapp-bot:v1.0.0 + Portainer stack → + docker-compose.prod.yml + IMAGE_TAG=v1.0.0 + pulls images, runs containers + aaPanel proxy → + bot.04080616.xyz → web:3000 +``` + +Image tags: + +- `latest` — current main HEAD (manual publish for now; CI later if needed). +- `vX.Y.Z` — release tags for production rollouts; pin in `.env.production`. +- `dev-` — ad-hoc images for testing on the home server before cutting a release. + +Rollback = change `IMAGE_TAG` in `.env.production` and recreate the stack in Portainer. + +## 9. Data model + +ORM: Drizzle. Migrations versioned in `packages/db/migrations/`. + +### Tables + +``` +operators — people who can use the bot +───────────────────────────────── +id uuid pk +telegram_user_id bigint unique — primary identity (whitelist key) +display_name text +role text — 'admin' (only role for v1) +default_timezone text — IANA, e.g. 'Asia/Kuala_Lumpur' +created_at timestamptz + +whatsapp_accounts — each WA account the operator manages +───────────────────────────────── +id uuid pk +operator_id uuid fk → operators +label text — operator-defined, e.g. "Sales 1" +phone_number text nullable — populated after pairing +status text — pending | connecting | connected + | disconnected | logged_out | banned +last_connected_at timestamptz nullable +last_qr_at timestamptz nullable +created_at timestamptz +unique(operator_id, label) + +whatsapp_groups — groups discovered per account +───────────────────────────────── +id uuid pk +account_id uuid fk → whatsapp_accounts +wa_group_jid text — WhatsApp's group JID +name text +participant_count int +is_archived bool default false +last_synced_at timestamptz +unique(account_id, wa_group_jid) + +media_files — uploaded photos/videos/documents +───────────────────────────────── +id uuid pk +operator_id uuid fk → operators +filename_original text +mime_type text +size_bytes bigint +sha256 text +storage_path text — relative to /data/media/ +created_at timestamptz + +reminders — scheduled sends +───────────────────────────────── +id uuid pk +account_id uuid fk → whatsapp_accounts +name text +schedule_kind text — 'one_off' | 'recurring' +scheduled_at timestamptz nullable — for one_off +rrule text nullable — RFC 5545 rrule string +timezone text — IANA +ends_at timestamptz nullable +max_runs int nullable +status text — 'active' | 'paused' | 'ended' +created_by uuid fk → operators +created_at timestamptz +updated_at timestamptz + +reminder_targets — groups a reminder fires into +───────────────────────────────── +reminder_id uuid fk → reminders +group_id uuid fk → whatsapp_groups +position int +pk(reminder_id, group_id) + +reminder_messages — message parts in send order +───────────────────────────────── +id uuid pk +reminder_id uuid fk → reminders +position int +kind text — 'text' | 'image' | 'video' | 'document' +text_content text nullable — text body or media caption +media_id uuid fk → media_files nullable + +reminder_runs — execution records +───────────────────────────────── +id uuid pk +reminder_id uuid fk → reminders +fired_at timestamptz +status text — 'success' | 'partial' | 'failed' | 'skipped' +error_summary text nullable + +reminder_run_targets — per-target outcomes +───────────────────────────────── +run_id uuid fk → reminder_runs +group_id uuid fk → whatsapp_groups +status text — 'sent' | 'failed' | 'skipped' +wa_message_id text nullable +error text nullable +latency_ms int nullable +pk(run_id, group_id) + +audit_log — append-only action history +───────────────────────────────── +id uuid pk +operator_id uuid fk → operators nullable +source text — 'web' | 'telegram' | 'system' +action text — 'reminder.create' | 'account.pair' | ... +target_type text nullable +target_id uuid nullable +payload jsonb +created_at timestamptz + +auth_sessions — web UI cookies +───────────────────────────────── +id uuid pk +operator_id uuid fk → operators +token_hash text unique — SHA-256 of cookie value +created_at timestamptz +expires_at timestamptz +last_used_at timestamptz +ip_address inet nullable +user_agent text nullable +``` + +`pg-boss` creates and owns its own `pgboss.*` schema in the same database — namespace-isolated, no manual setup required beyond initial migration. + +### Key model decisions + +- **Recurring schedules use RRULE (RFC 5545), not cron.** RRULE expresses "every Monday and Wednesday at 9am, 20 occurrences" naturally; cron cannot. Library: `rrule` on Node. +- **Timezone is per-reminder, not per-account.** Operator may run accounts spanning markets in different time zones. Default fills in from operator's `default_timezone`. +- **Baileys auth state on disk, not in Postgres.** Path `/data/sessions//`, using Baileys `useMultiFileAuthState`. That's the upstream-supported path; the file set is many small frequently-mutating files (signal protocol keys); Postgres is the wrong shape. Volume is part of host backup strategy. +- **Audit log is append-only.** Never updated, only inserted. Powers "who created this", "when did this account get paired", etc. +- **Media in object-store-like layout on disk.** Path `/data/media/{yyyy/mm}/{uuid}.{ext}`. Postgres holds metadata only. Sweeper deletes media unreferenced by any reminder after configurable retention (default 90 days). Migration path to MinIO/S3 later: only the storage adapter changes. +- **Web auth via Telegram magic link.** Operator types `/login` to the Telegram bot → bot replies with a one-time URL → click sets a session cookie via `auth_sessions`. No passwords. The operator pool is exactly the Telegram-whitelisted set. + +### Out of v1 (YAGNI; easy to add later) + +- Templates with variable substitution (`{customer_name}`, `{day}`). +- Multi-tenant operator isolation beyond the existing whitelist. +- Per-customer message personalization. +- Conversation threads beyond the reminder firing log. +- A/B testing of reminder content. +- Web push notifications (Telegram already pushes alerts). + +## 10. QR pairing flow (headline UX) + +``` +1. Operator (Telegram): /pair "Sales Account 3" +2. Bot inserts whatsapp_accounts row { status: 'pending', label: 'Sales Account 3' } +3. Bot starts Baileys session for that account_id + ├─ session dir: /data/sessions// + └─ uses useMultiFileAuthState (auto-persists creds + signal keys) +4. Baileys emits connection.update { qr: '...' } +5. Bot renders QR string → PNG, sends to operator's TG chat + "📱 Scan with WhatsApp on Sales Account 3. Expires in 30s." + (Baileys re-emits QR every ~20s; bot edits the same TG message via editMessageMedia) +6. Operator scans → Baileys emits connection.update { connection: 'open' } +7. Bot updates row { status: 'connected', phone_number: '+60xxx', last_connected_at: now } + Bot sends TG: "✅ Sales Account 3 connected as +60xxxxxxx" + Bot pgNotify('web.event', { type: 'session.connected', account_id }) +8. Bot triggers group-sync → upserts whatsapp_groups + Bot sends TG: "Synced 12 groups. Ready to send." +``` + +### Pairing edge cases + +| Situation | Behavior | +|---|---| +| QR expires (no scan in ~30s) | Baileys re-emits; bot edits same TG message with new QR. After 5 cycles (~2.5 min): timeout, mark account `pending`, TG: "Pairing timed out — try `/pair` again." | +| Bot container restart mid-pairing | Startup sweeper drops any `pending` accounts with stale `last_qr_at`; operator re-runs `/pair`. | +| `/pair` on already-connected label | Reject: "Account 'X' already connected. Use `/unpair X` first." | +| WA logout from phone (linked-device removed) | Baileys `connection.close` with reason `loggedOut`. Bot marks `logged_out`, sends TG alert with re-pair instruction. Reminders for that account skip with reason `account_logged_out`. | +| Network drop on connected session | Baileys auto-reconnects (built-in). Alert only if downtime >5 min. | +| Web-initiated pair | Same flow; QR PNG also streamed to the open web modal via SSE so operator can scan from web instead of phone-Telegram. | + +## 11. Reminder execution flow + +``` +On reminder create/edit (from web or Telegram): + → DB row inserted/updated (transaction with reminder_targets, reminder_messages) + → pgNotify('bot.command', { type: 'reminder.upsert', id }) + → bot.scheduler upserts the reminder into pg-boss: + one_off → schedule single delayed job at scheduled_at + recurring → compute next occurrence from rrule, schedule delayed job; + on completion, fire-reminder schedules the next occurrence + +When pg-boss fires the job: + fire-reminder.handler: + 1. Load reminder + targets + messages from DB + 2. Insert reminder_runs { status: 'pending', fired_at: now } + 3. Acquire account session from session-manager + - If not connected: mark all targets 'skipped', update run status, exit + 4. For each target group: + a. For each message part in position order: + - text → sendTextMessage + - media → load /data/media/, sendMedia with optional caption + b. Insert reminder_run_targets { status, wa_message_id, latency_ms } + c. Throttle: jitter between targets to stay under WA rate limits + 5. Roll up reminder_runs.status: + all sent → 'success'; all failed → 'failed'; mix → 'partial' + 6. pgNotify('web.event', { type: 'reminder.fired', run_id }) + 7. If recurring and not at end_at / max_runs: + schedule next occurrence in pg-boss + Else if at end: + update reminder.status = 'ended' +``` + +## 12. Error handling + +| Failure | Detection | Response | +|---|---|---| +| WA send transient (timeout, network) | Baileys throws / promise rejects | Retry via pg-boss with exponential backoff (3 tries: 30s/2m/10m). Final failure → mark `reminder_run_targets.failed`, dashboard + TG alert. | +| WA send permanent (group not found, banned account) | Specific error codes | No retry. Mark target failed with reason. If account banned → mark `whatsapp_accounts.status='banned'`, urgent TG alert. | +| WA session disconnect | `connection.update` event | Auto-reconnect. Downtime >5 min → TG alert. Reminders during downtime → `skipped`. | +| WA logout | reason `loggedOut` | `status='logged_out'`. Stop reconnect attempts. TG: "Account X logged out — re-pair." | +| Telegram delivery failure | grammy throws | Retry once. Then log to `audit_log` only — don't recurse via TG (TG itself might be down). | +| Postgres connection lost | Drizzle errors | Both services exit non-zero (Docker restarts them). Health checks fail loudly during outage. | +| Media file missing on disk | `fs.stat` fails before send | Mark target `media_missing`, don't send placeholder. TG alert. | +| pg-boss job lost / corrupted | pg-boss own retry → dead-letter | Surface in admin "failed jobs" view; manual retry button. | +| WA rate limit | Specific error | Throttle sender to 1 send / 3 sec per account, jitter between. Backoff longer. | +| Unauthorized Telegram user | Whitelist middleware | Reply: "Sorry, this bot is private." Log to `audit_log`. No state change. | +| Web session expired | Cookie validation fails | Redirect to `/login`. | + +### Observability + +- **Logs:** `pino` JSON to stdout, captured by Docker. +- **Health endpoints:** + - `web`: `GET /api/health` — DB ping + uptime + commit SHA. + - `bot`: internal port 8081, `GET /health` — DB ping + per-session counts (`{ connected: 8, disconnected: 1, pending: 0 }`). +- **Per-reminder audit trail:** `reminder_runs` + `reminder_run_targets` history, exposed in dashboard. Every fire is fully reconstructable. + +## 13. Testing strategy + +| Layer | Tool | Scope | +|---|---|---| +| Unit | Vitest | rrule helpers, message-part assembly, audit log builders, env validation, error classifiers. No I/O. | +| Integration (DB) | Vitest + local dev Postgres (or Testcontainers) | Drizzle queries, pg-boss schedule sync, LISTEN/NOTIFY round-trip. Per-test schema with teardown. | +| Bot session logic | Vitest with mocked Baileys | Session-manager state transitions, QR rendering, group-sync upsert. No real WA connection. | +| Telegram | Vitest with mocked grammy | Command parsing, whitelist middleware, error responses. | +| Web E2E | Playwright (deferred) | Login (stubbed magic link), reminder create wizard, dashboard. Add when CI exists. | +| Pairing flow | Manual checklist | Real WA pairing requires a real phone — documented in `docs/superpowers/specs/manual-test-pairing.md`. Run before each release. | + +### CI + +Out of scope for v1. `pnpm test` and `pnpm lint` will run via husky + lint-staged on `git push`. Gitea Actions can be wired later. + +## 14. Scripts + +All scripts live in `scripts/`. Patterned on `cm_bot_v2`. + +| Script | Purpose | +|---|---| +| `dev.sh` | `up \| down \| logs \| status \| reset-db` against `docker-compose.dev.yml`. Pre-flight checks for `.env.development`. Honors `NO_SUDO=1`. `reset-db` truncates only `whatsapp_bot_dev` with a confirmation prompt. | +| `publish.sh` | Build + push images to `gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:`. Default tag `latest`. Same auth-error guidance as the cm_bot_v2 reference. | +| `gen_auth_secret.sh` | Generate `AUTH_SECRET` (32 hex bytes). `--write [path]` mode appends/replaces in env file. | +| `db.sh` | Drizzle migration wrapper: `migrate \| rollback \| seed \| studio \| reset`. `reset` is dev-only, refuses if env points at prod DB. | +| `link-account.sh` | CLI helper to start a WA pairing flow without going through Telegram. Emits QR straight to the terminal. Useful for the dev mock account. | +| `local_build.sh` | One-liner foreground compose up. Convenience. | + +## 15. Open questions for implementation phase + +- Confirm subdomain choice: `bot.04080616.xyz` vs `whatsapp.04080616.xyz` vs other. +- Confirm Postgres connectivity from Docker bridge (`172.16.0.0/12`) is allowed in the existing `pg_hba.conf` on `192.168.0.210`. If not, add the entry before first deploy. +- Confirm operator's IANA timezone for `default_timezone` seed value. +- Decide media retention default (proposing 90 days; sweeper job runs daily). +- Decide whether to enforce a minimum interval between recurring fires (proposing 5 minutes). + +These don't block design approval — they're settled during the writing-plans phase or first implementation step.