cm_whatsapp_bot_v1/docs/superpowers/specs/2026-05-03-whatsapp-bot-design.md
yiekheng 0f949284b1 docs: lock in subdomain, timezone, retention defaults
Replace open-questions section with confirmed values from brainstorming
review: wabot.04080616.xyz subdomain, Asia/Kuala_Lumpur default timezone,
90-day media retention, 5-minute minimum recurrence interval. Postgres
pg_hba check kept as a pre-deploy verification step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:40:37 +08:00

26 KiB

WhatsApp Reminder Bot — Design

Status: Draft Date: 2026-05-03 Author: yiekheng (developer); operator: brother (single end-user)

1. Purpose

Self-hosted WhatsApp reminder bot. The operator manages 10+ WhatsApp accounts (each tied to a different business responsibility), schedules one-off and recurring reminder messages — text, photos, videos — to specific WhatsApp groups, and receives login QR codes through a private Telegram bot. The system runs 24/7 on the operator's home Docker server, behind a reverse proxy, on a self-hosted Gitea registry.

2. Stakeholders & access

  • Developer (you): builds and maintains. Has full access to dev environment (mock WA account, dev Telegram bot).
  • Operator (brother): the single end-user in production. Pairs all real WA accounts, creates and manages reminders, receives QR codes via Telegram. Holds all production credentials.
  • Customers (in WA groups): unaware of the bot — they just receive messages from the WA accounts the operator owns.

Access to the bot is gated by Telegram user ID whitelist (configured in env). Web UI access requires a Telegram-issued magic link, so only Telegram-trusted operators can sign in to the dashboard.

3. Constraints accepted up front

  • Unofficial WhatsApp protocol. Built on Baileys (@whiskeysockets/baileys). Violates WhatsApp ToS. Account ban risk is non-zero, especially for spam-pattern usage. Acceptable for this customer-reminder use case where messages go to known groups.
  • Self-hosted infrastructure. Postgres at 192.168.0.210 (already running). Home Docker server runs Portainer; reverse proxy is aaPanel. Domain 04080616.xyz is available for the web UI subdomain.
  • Self-hosted Gitea. Git remote at http://192.168.0.215:3000/yiekheng/cm_whatsapp_bot_v1.git. Container registry at gitea.04080616.xyz/yiekheng.
  • Single-operator threat model. No tenant isolation. Both developer and operator are effectively admins. The repo is private to the developer. .env files may be committed to the private Gitea (operator's choice — documented trade-off below).

4. High-level architecture

Two app containers + one external dependency. Communication between apps goes through Postgres only.

┌─────────────────────────────────────────────────────────────────────┐
│                        Home Docker server                           │
│                                                                     │
│  ┌─────────────┐         ┌─────────────┐                            │
│  │    web      │         │    bot      │                            │
│  │  (Next.js)  │◄───────►│ (Node.js)   │                            │
│  │             │  via    │             │                            │
│  │  PWA        │ Postgres│  Baileys    │                            │
│  │  Dashboard  │ LISTEN/ │  Telegram   │                            │
│  │  API routes │ NOTIFY  │  pg-boss    │                            │
│  └──────┬──────┘         └──────┬──────┘                            │
│         │                       │                                   │
│         │   shared volume:      │                                   │
│         │   /data/media         │   /data/sessions/<account_id>/    │
│         │                       │   (Baileys auth state)            │
│         │                       │                                   │
│         └───────────┬───────────┘                                   │
│                     │                                               │
│                     ▼                                               │
│         aaPanel reverse proxy ─► wabot.04080616.xyz                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
                      │                         │
                      ▼                         ▼
        Postgres at 192.168.0.210     Telegram Bot API (cloud)
        (whatsapp_bot_dev /            via grammy long-polling
         whatsapp_bot_prod)            (or webhook later)

Service responsibilities

Service Stateless Owns
web Yes (can restart freely) UI, auth, server actions, media upload, SSE for live updates
bot No (long-lived sessions) All Baileys WhatsApp sessions, Telegram bot, pg-boss scheduler, reminder firing, group sync

Why split: Next.js is built for stateless request/response; WhatsApp sessions are long-lived stateful WebSockets that must survive web deploys. Splitting lets us redeploy web (frontend changes) without dropping any active WA sessions.

Why Postgres-as-bus instead of Redis or HTTP

  • One less service to run, one less dependency to monitor.
  • All bot↔web communication shares the same transactional boundary as data writes — if a write commits, downstream listeners see it.
  • pg-boss provides BullMQ-equivalent functionality (delayed jobs, recurring jobs, retries, dead-letter) at a scale (10+ accounts, hundreds of reminders/day max) where Redis throughput advantages are irrelevant.
  • LISTEN/NOTIFY covers live UI updates (e.g., "session connected" toast).

5. Tech stack

  • Language: TypeScript everywhere.
  • Frontend: Next.js 16 (App Router), Server Components + Server Actions, PWA-installable (manifest + browser service worker for offline app shell). Visual design will be built using the frontend-design:frontend-design skill during implementation.
  • Backend: Node.js 22 + TypeScript for the bot service.
  • WhatsApp: @whiskeysockets/baileys (no browser; pure WebSocket).
  • Telegram: grammy framework (long-polling in dev, can switch to webhook in prod).
  • Database: PostgreSQL (external at 192.168.0.210). Drizzle ORM. Migrations in packages/db/migrations.
  • Job queue: pg-boss (Postgres-native).
  • QR rendering: qrcode library (string → PNG buffer).
  • Recurrence: rrule library (RFC 5545).
  • Logging: pino (JSON to stdout).
  • Validation: zod (env, request bodies, server actions).
  • Build orchestration: pnpm workspaces + Turborepo.

6. Repository layout

cm_whatsapp_bot_v1/
├── apps/
│   ├── web/                      Next.js — UI + API routes + PWA
│   └── bot/                      Node — Baileys + Telegram + scheduler
├── packages/
│   ├── db/                       Drizzle schema, migrations, queries
│   └── shared/                   cross-app types, rrule helpers, paths
├── docker/
│   ├── web.Dockerfile
│   └── bot.Dockerfile
├── docker-compose.base.yml       service definitions, networks
├── docker-compose.dev.yml        dev overrides: hot reload, exposed ports
├── docker-compose.prod.yml       prod: registry images, named volumes
├── scripts/
│   ├── dev.sh                    up | down | logs | status | reset-db
│   ├── publish.sh                build & push to Gitea registry
│   ├── gen_auth_secret.sh        generate AUTH_SECRET
│   ├── db.sh                     migrate | rollback | seed | studio | reset
│   └── link-account.sh           CLI helper to pair the dev WA mock account
├── envs/
│   ├── .env.example              documented template
│   ├── .env.development          dev TG bot, mock WA account, dev DB
│   └── .env.production           prod TG bot, real accounts, prod DB
└── docs/
    └── superpowers/specs/
        └── 2026-05-03-whatsapp-bot-design.md

7. Environment separation

Concern Local dev Production
Compose file base + dev.yml base + prod.yml
Image source local build gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:${IMAGE_TAG}
Postgres database whatsapp_bot_dev on 192.168.0.210 whatsapp_bot_prod on 192.168.0.210
Postgres role dev role with limited grants prod role
Telegram bot separate dev bot (@..._dev_bot) — operator's QR codes never go to prod chat production bot
WhatsApp accounts mock/test phone operator's real 10+ accounts
Web URL http://localhost:3000 https://wabot.04080616.xyz
Hot reload yes (Next.js HMR + tsx watch) no
Volumes ./dev-data/{media,sessions} bind mounts named volumes

The bot service runs on its own internal port (8081) for health checks; not exposed externally in either env.

Env-validation runs at startup via zod. Missing or malformed env values cause an immediate fast-fail exit with a clear message — both services refuse to come up half-configured.

8. Deploy flow

dev machine                         Gitea (192.168.0.215)        home server (Portainer)
────────────                        ──────────────────────       ──────────────────────
git push                       ───► cm_whatsapp_bot_v1.git
scripts/publish.sh v1.0.0      ───► gitea.04080616.xyz/
                                      yiekheng/
                                      cm-whatsapp-web:v1.0.0
                                      cm-whatsapp-bot:v1.0.0
                                                                Portainer stack →
                                                                  docker-compose.prod.yml
                                                                  IMAGE_TAG=v1.0.0
                                                                  pulls images, runs containers
                                                                aaPanel proxy →
                                                                  wabot.04080616.xyz → web:3000

Image tags:

  • latest — current main HEAD (manual publish for now; CI later if needed).
  • vX.Y.Z — release tags for production rollouts; pin in .env.production.
  • dev-<short-sha> — ad-hoc images for testing on the home server before cutting a release.

Rollback = change IMAGE_TAG in .env.production and recreate the stack in Portainer.

9. Data model

ORM: Drizzle. Migrations versioned in packages/db/migrations/.

Tables

operators                       — people who can use the bot
─────────────────────────────────
id                  uuid pk
telegram_user_id    bigint unique  — primary identity (whitelist key)
display_name        text
role                text           — 'admin' (only role for v1)
default_timezone    text           — IANA, default 'Asia/Kuala_Lumpur'
created_at          timestamptz

whatsapp_accounts               — each WA account the operator manages
─────────────────────────────────
id                  uuid pk
operator_id         uuid fk → operators
label               text           — operator-defined, e.g. "Sales 1"
phone_number        text nullable  — populated after pairing
status              text           — pending | connecting | connected
                                   | disconnected | logged_out | banned
last_connected_at   timestamptz nullable
last_qr_at          timestamptz nullable
created_at          timestamptz
unique(operator_id, label)

whatsapp_groups                 — groups discovered per account
─────────────────────────────────
id                  uuid pk
account_id          uuid fk → whatsapp_accounts
wa_group_jid        text           — WhatsApp's group JID
name                text
participant_count   int
is_archived         bool default false
last_synced_at      timestamptz
unique(account_id, wa_group_jid)

media_files                     — uploaded photos/videos/documents
─────────────────────────────────
id                  uuid pk
operator_id         uuid fk → operators
filename_original   text
mime_type           text
size_bytes          bigint
sha256              text
storage_path        text           — relative to /data/media/
created_at          timestamptz

reminders                       — scheduled sends
─────────────────────────────────
id                  uuid pk
account_id          uuid fk → whatsapp_accounts
name                text
schedule_kind       text           — 'one_off' | 'recurring'
scheduled_at        timestamptz nullable    — for one_off
rrule               text nullable           — RFC 5545 rrule string
timezone            text                    — IANA
ends_at             timestamptz nullable
max_runs            int nullable
status              text           — 'active' | 'paused' | 'ended'
created_by          uuid fk → operators
created_at          timestamptz
updated_at          timestamptz

reminder_targets                — groups a reminder fires into
─────────────────────────────────
reminder_id         uuid fk → reminders
group_id            uuid fk → whatsapp_groups
position            int
pk(reminder_id, group_id)

reminder_messages               — message parts in send order
─────────────────────────────────
id                  uuid pk
reminder_id         uuid fk → reminders
position            int
kind                text           — 'text' | 'image' | 'video' | 'document'
text_content        text nullable  — text body or media caption
media_id            uuid fk → media_files nullable

reminder_runs                   — execution records
─────────────────────────────────
id                  uuid pk
reminder_id         uuid fk → reminders
fired_at            timestamptz
status              text           — 'success' | 'partial' | 'failed' | 'skipped'
error_summary       text nullable

reminder_run_targets            — per-target outcomes
─────────────────────────────────
run_id              uuid fk → reminder_runs
group_id            uuid fk → whatsapp_groups
status              text           — 'sent' | 'failed' | 'skipped'
wa_message_id       text nullable
error               text nullable
latency_ms          int nullable
pk(run_id, group_id)

audit_log                       — append-only action history
─────────────────────────────────
id                  uuid pk
operator_id         uuid fk → operators nullable
source              text           — 'web' | 'telegram' | 'system'
action              text           — 'reminder.create' | 'account.pair' | ...
target_type         text nullable
target_id           uuid nullable
payload             jsonb
created_at          timestamptz

auth_sessions                   — web UI cookies
─────────────────────────────────
id                  uuid pk
operator_id         uuid fk → operators
token_hash          text unique    — SHA-256 of cookie value
created_at          timestamptz
expires_at          timestamptz
last_used_at        timestamptz
ip_address          inet nullable
user_agent          text nullable

pg-boss creates and owns its own pgboss.* schema in the same database — namespace-isolated, no manual setup required beyond initial migration.

Key model decisions

  • Recurring schedules use RRULE (RFC 5545), not cron. RRULE expresses "every Monday and Wednesday at 9am, 20 occurrences" naturally; cron cannot. Library: rrule on Node.
  • Timezone is per-reminder, not per-account. Operator may run accounts spanning markets in different time zones. Default fills in from operator's default_timezone.
  • Baileys auth state on disk, not in Postgres. Path /data/sessions/<whatsapp_account_id>/, using Baileys useMultiFileAuthState. That's the upstream-supported path; the file set is many small frequently-mutating files (signal protocol keys); Postgres is the wrong shape. Volume is part of host backup strategy.
  • Audit log is append-only. Never updated, only inserted. Powers "who created this", "when did this account get paired", etc.
  • Media in object-store-like layout on disk. Path /data/media/{yyyy/mm}/{uuid}.{ext}. Postgres holds metadata only. Sweeper deletes media unreferenced by any reminder after configurable retention (default 90 days). Migration path to MinIO/S3 later: only the storage adapter changes.
  • Web auth via Telegram magic link. Operator types /login to the Telegram bot → bot replies with a one-time URL → click sets a session cookie via auth_sessions. No passwords. The operator pool is exactly the Telegram-whitelisted set.

Out of v1 (YAGNI; easy to add later)

  • Templates with variable substitution ({customer_name}, {day}).
  • Multi-tenant operator isolation beyond the existing whitelist.
  • Per-customer message personalization.
  • Conversation threads beyond the reminder firing log.
  • A/B testing of reminder content.
  • Web push notifications (Telegram already pushes alerts).

10. QR pairing flow (headline UX)

1. Operator (Telegram): /pair "Sales Account 3"
2. Bot inserts whatsapp_accounts row { status: 'pending', label: 'Sales Account 3' }
3. Bot starts Baileys session for that account_id
   ├─ session dir: /data/sessions/<account_id>/
   └─ uses useMultiFileAuthState (auto-persists creds + signal keys)
4. Baileys emits connection.update { qr: '...' }
5. Bot renders QR string → PNG, sends to operator's TG chat
   "📱 Scan with WhatsApp on Sales Account 3. Expires in 30s."
   (Baileys re-emits QR every ~20s; bot edits the same TG message via editMessageMedia)
6. Operator scans → Baileys emits connection.update { connection: 'open' }
7. Bot updates row { status: 'connected', phone_number: '+60xxx', last_connected_at: now }
   Bot sends TG: "✅ Sales Account 3 connected as +60xxxxxxx"
   Bot pgNotify('web.event', { type: 'session.connected', account_id })
8. Bot triggers group-sync → upserts whatsapp_groups
   Bot sends TG: "Synced 12 groups. Ready to send."

Pairing edge cases

Situation Behavior
QR expires (no scan in ~30s) Baileys re-emits; bot edits same TG message with new QR. After 5 cycles (~2.5 min): timeout, mark account pending, TG: "Pairing timed out — try /pair again."
Bot container restart mid-pairing Startup sweeper drops any pending accounts with stale last_qr_at; operator re-runs /pair.
/pair on already-connected label Reject: "Account 'X' already connected. Use /unpair X first."
WA logout from phone (linked-device removed) Baileys connection.close with reason loggedOut. Bot marks logged_out, sends TG alert with re-pair instruction. Reminders for that account skip with reason account_logged_out.
Network drop on connected session Baileys auto-reconnects (built-in). Alert only if downtime >5 min.
Web-initiated pair Same flow; QR PNG also streamed to the open web modal via SSE so operator can scan from web instead of phone-Telegram.

11. Reminder execution flow

On reminder create/edit (from web or Telegram):
  → DB row inserted/updated (transaction with reminder_targets, reminder_messages)
  → pgNotify('bot.command', { type: 'reminder.upsert', id })
  → bot.scheduler upserts the reminder into pg-boss:
      one_off    → schedule single delayed job at scheduled_at
      recurring  → compute next occurrence from rrule, schedule delayed job;
                   on completion, fire-reminder schedules the next occurrence

When pg-boss fires the job:
  fire-reminder.handler:
    1. Load reminder + targets + messages from DB
    2. Insert reminder_runs { status: 'pending', fired_at: now }
    3. Acquire account session from session-manager
       - If not connected: mark all targets 'skipped', update run status, exit
    4. For each target group:
        a. For each message part in position order:
             - text  → sendTextMessage
             - media → load /data/media/<path>, sendMedia with optional caption
        b. Insert reminder_run_targets { status, wa_message_id, latency_ms }
        c. Throttle: jitter between targets to stay under WA rate limits
    5. Roll up reminder_runs.status:
       all sent → 'success';  all failed → 'failed';  mix → 'partial'
    6. pgNotify('web.event', { type: 'reminder.fired', run_id })
    7. If recurring and not at end_at / max_runs:
         schedule next occurrence in pg-boss
       Else if at end:
         update reminder.status = 'ended'

12. Error handling

Failure Detection Response
WA send transient (timeout, network) Baileys throws / promise rejects Retry via pg-boss with exponential backoff (3 tries: 30s/2m/10m). Final failure → mark reminder_run_targets.failed, dashboard + TG alert.
WA send permanent (group not found, banned account) Specific error codes No retry. Mark target failed with reason. If account banned → mark whatsapp_accounts.status='banned', urgent TG alert.
WA session disconnect connection.update event Auto-reconnect. Downtime >5 min → TG alert. Reminders during downtime → skipped.
WA logout reason loggedOut status='logged_out'. Stop reconnect attempts. TG: "Account X logged out — re-pair."
Telegram delivery failure grammy throws Retry once. Then log to audit_log only — don't recurse via TG (TG itself might be down).
Postgres connection lost Drizzle errors Both services exit non-zero (Docker restarts them). Health checks fail loudly during outage.
Media file missing on disk fs.stat fails before send Mark target media_missing, don't send placeholder. TG alert.
pg-boss job lost / corrupted pg-boss own retry → dead-letter Surface in admin "failed jobs" view; manual retry button.
WA rate limit Specific error Throttle sender to 1 send / 3 sec per account, jitter between. Backoff longer.
Unauthorized Telegram user Whitelist middleware Reply: "Sorry, this bot is private." Log to audit_log. No state change.
Web session expired Cookie validation fails Redirect to /login.

Observability

  • Logs: pino JSON to stdout, captured by Docker.
  • Health endpoints:
    • web: GET /api/health — DB ping + uptime + commit SHA.
    • bot: internal port 8081, GET /health — DB ping + per-session counts ({ connected: 8, disconnected: 1, pending: 0 }).
  • Per-reminder audit trail: reminder_runs + reminder_run_targets history, exposed in dashboard. Every fire is fully reconstructable.

13. Testing strategy

Layer Tool Scope
Unit Vitest rrule helpers, message-part assembly, audit log builders, env validation, error classifiers. No I/O.
Integration (DB) Vitest + local dev Postgres (or Testcontainers) Drizzle queries, pg-boss schedule sync, LISTEN/NOTIFY round-trip. Per-test schema with teardown.
Bot session logic Vitest with mocked Baileys Session-manager state transitions, QR rendering, group-sync upsert. No real WA connection.
Telegram Vitest with mocked grammy Command parsing, whitelist middleware, error responses.
Web E2E Playwright (deferred) Login (stubbed magic link), reminder create wizard, dashboard. Add when CI exists.
Pairing flow Manual checklist Real WA pairing requires a real phone — documented in docs/superpowers/specs/manual-test-pairing.md. Run before each release.

CI

Out of scope for v1. pnpm test and pnpm lint will run via husky + lint-staged on git push. Gitea Actions can be wired later.

14. Scripts

All scripts live in scripts/. Patterned on cm_bot_v2.

Script Purpose
dev.sh up | down | logs | status | reset-db against docker-compose.dev.yml. Pre-flight checks for .env.development. Honors NO_SUDO=1. reset-db truncates only whatsapp_bot_dev with a confirmation prompt.
publish.sh Build + push images to gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:<tag>. Default tag latest. Same auth-error guidance as the cm_bot_v2 reference.
gen_auth_secret.sh Generate AUTH_SECRET (32 hex bytes). --write [path] mode appends/replaces in env file.
db.sh Drizzle migration wrapper: migrate | rollback | seed | studio | reset. reset is dev-only, refuses if env points at prod DB.
link-account.sh CLI helper to start a WA pairing flow without going through Telegram. Emits QR straight to the terminal. Useful for the dev mock account.
local_build.sh One-liner foreground compose up. Convenience.

15. Confirmed values & remaining pre-deploy checks

Confirmed during brainstorming:

  • Web URL: https://wabot.04080616.xyz.
  • Default timezone: Asia/Kuala_Lumpur (seeded into operators.default_timezone).
  • Media retention: 90 days. Sweeper job runs daily, deletes media not referenced by any reminder older than retention.
  • Minimum interval between recurring fires: 5 minutes (enforced at the schedule-validation layer to prevent runaway loops).

Pre-deploy check (not blocking design; verified during first implementation step):

  • Postgres connectivity: confirm pg_hba.conf on 192.168.0.210 allows the Docker bridge subnet (172.16.0.0/12) and listen_addresses covers the LAN interface. Add the entry before first deploy if missing.