Replace open-questions section with confirmed values from brainstorming review: wabot.04080616.xyz subdomain, Asia/Kuala_Lumpur default timezone, 90-day media retention, 5-minute minimum recurrence interval. Postgres pg_hba check kept as a pre-deploy verification step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
26 KiB
WhatsApp Reminder Bot — Design
Status: Draft Date: 2026-05-03 Author: yiekheng (developer); operator: brother (single end-user)
1. Purpose
Self-hosted WhatsApp reminder bot. The operator manages 10+ WhatsApp accounts (each tied to a different business responsibility), schedules one-off and recurring reminder messages — text, photos, videos — to specific WhatsApp groups, and receives login QR codes through a private Telegram bot. The system runs 24/7 on the operator's home Docker server, behind a reverse proxy, on a self-hosted Gitea registry.
2. Stakeholders & access
- Developer (you): builds and maintains. Has full access to dev environment (mock WA account, dev Telegram bot).
- Operator (brother): the single end-user in production. Pairs all real WA accounts, creates and manages reminders, receives QR codes via Telegram. Holds all production credentials.
- Customers (in WA groups): unaware of the bot — they just receive messages from the WA accounts the operator owns.
Access to the bot is gated by Telegram user ID whitelist (configured in env). Web UI access requires a Telegram-issued magic link, so only Telegram-trusted operators can sign in to the dashboard.
3. Constraints accepted up front
- Unofficial WhatsApp protocol. Built on Baileys (
@whiskeysockets/baileys). Violates WhatsApp ToS. Account ban risk is non-zero, especially for spam-pattern usage. Acceptable for this customer-reminder use case where messages go to known groups. - Self-hosted infrastructure. Postgres at
192.168.0.210(already running). Home Docker server runs Portainer; reverse proxy is aaPanel. Domain04080616.xyzis available for the web UI subdomain. - Self-hosted Gitea. Git remote at
http://192.168.0.215:3000/yiekheng/cm_whatsapp_bot_v1.git. Container registry atgitea.04080616.xyz/yiekheng. - Single-operator threat model. No tenant isolation. Both developer and operator are effectively admins. The repo is private to the developer.
.envfiles may be committed to the private Gitea (operator's choice — documented trade-off below).
4. High-level architecture
Two app containers + one external dependency. Communication between apps goes through Postgres only.
┌─────────────────────────────────────────────────────────────────────┐
│ Home Docker server │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ web │ │ bot │ │
│ │ (Next.js) │◄───────►│ (Node.js) │ │
│ │ │ via │ │ │
│ │ PWA │ Postgres│ Baileys │ │
│ │ Dashboard │ LISTEN/ │ Telegram │ │
│ │ API routes │ NOTIFY │ pg-boss │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ │ shared volume: │ │
│ │ /data/media │ /data/sessions/<account_id>/ │
│ │ │ (Baileys auth state) │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ aaPanel reverse proxy ─► wabot.04080616.xyz │
│ │
└─────────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
Postgres at 192.168.0.210 Telegram Bot API (cloud)
(whatsapp_bot_dev / via grammy long-polling
whatsapp_bot_prod) (or webhook later)
Service responsibilities
| Service | Stateless | Owns |
|---|---|---|
web |
Yes (can restart freely) | UI, auth, server actions, media upload, SSE for live updates |
bot |
No (long-lived sessions) | All Baileys WhatsApp sessions, Telegram bot, pg-boss scheduler, reminder firing, group sync |
Why split: Next.js is built for stateless request/response; WhatsApp sessions are long-lived stateful WebSockets that must survive web deploys. Splitting lets us redeploy web (frontend changes) without dropping any active WA sessions.
Why Postgres-as-bus instead of Redis or HTTP
- One less service to run, one less dependency to monitor.
- All bot↔web communication shares the same transactional boundary as data writes — if a write commits, downstream listeners see it.
- pg-boss provides BullMQ-equivalent functionality (delayed jobs, recurring jobs, retries, dead-letter) at a scale (10+ accounts, hundreds of reminders/day max) where Redis throughput advantages are irrelevant.
- LISTEN/NOTIFY covers live UI updates (e.g., "session connected" toast).
5. Tech stack
- Language: TypeScript everywhere.
- Frontend: Next.js 16 (App Router), Server Components + Server Actions, PWA-installable (manifest + browser service worker for offline app shell). Visual design will be built using the
frontend-design:frontend-designskill during implementation. - Backend: Node.js 22 + TypeScript for the
botservice. - WhatsApp:
@whiskeysockets/baileys(no browser; pure WebSocket). - Telegram:
grammyframework (long-polling in dev, can switch to webhook in prod). - Database: PostgreSQL (external at
192.168.0.210). Drizzle ORM. Migrations inpackages/db/migrations. - Job queue:
pg-boss(Postgres-native). - QR rendering:
qrcodelibrary (string → PNG buffer). - Recurrence:
rrulelibrary (RFC 5545). - Logging:
pino(JSON to stdout). - Validation:
zod(env, request bodies, server actions). - Build orchestration: pnpm workspaces + Turborepo.
6. Repository layout
cm_whatsapp_bot_v1/
├── apps/
│ ├── web/ Next.js — UI + API routes + PWA
│ └── bot/ Node — Baileys + Telegram + scheduler
├── packages/
│ ├── db/ Drizzle schema, migrations, queries
│ └── shared/ cross-app types, rrule helpers, paths
├── docker/
│ ├── web.Dockerfile
│ └── bot.Dockerfile
├── docker-compose.base.yml service definitions, networks
├── docker-compose.dev.yml dev overrides: hot reload, exposed ports
├── docker-compose.prod.yml prod: registry images, named volumes
├── scripts/
│ ├── dev.sh up | down | logs | status | reset-db
│ ├── publish.sh build & push to Gitea registry
│ ├── gen_auth_secret.sh generate AUTH_SECRET
│ ├── db.sh migrate | rollback | seed | studio | reset
│ └── link-account.sh CLI helper to pair the dev WA mock account
├── envs/
│ ├── .env.example documented template
│ ├── .env.development dev TG bot, mock WA account, dev DB
│ └── .env.production prod TG bot, real accounts, prod DB
└── docs/
└── superpowers/specs/
└── 2026-05-03-whatsapp-bot-design.md
7. Environment separation
| Concern | Local dev | Production |
|---|---|---|
| Compose file | base + dev.yml |
base + prod.yml |
| Image source | local build | gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:${IMAGE_TAG} |
| Postgres database | whatsapp_bot_dev on 192.168.0.210 |
whatsapp_bot_prod on 192.168.0.210 |
| Postgres role | dev role with limited grants | prod role |
| Telegram bot | separate dev bot (@..._dev_bot) — operator's QR codes never go to prod chat |
production bot |
| WhatsApp accounts | mock/test phone | operator's real 10+ accounts |
| Web URL | http://localhost:3000 |
https://wabot.04080616.xyz |
| Hot reload | yes (Next.js HMR + tsx watch) | no |
| Volumes | ./dev-data/{media,sessions} bind mounts |
named volumes |
The bot service runs on its own internal port (8081) for health checks; not exposed externally in either env.
Env-validation runs at startup via zod. Missing or malformed env values cause an immediate fast-fail exit with a clear message — both services refuse to come up half-configured.
8. Deploy flow
dev machine Gitea (192.168.0.215) home server (Portainer)
──────────── ────────────────────── ──────────────────────
git push ───► cm_whatsapp_bot_v1.git
scripts/publish.sh v1.0.0 ───► gitea.04080616.xyz/
yiekheng/
cm-whatsapp-web:v1.0.0
cm-whatsapp-bot:v1.0.0
Portainer stack →
docker-compose.prod.yml
IMAGE_TAG=v1.0.0
pulls images, runs containers
aaPanel proxy →
wabot.04080616.xyz → web:3000
Image tags:
latest— current main HEAD (manual publish for now; CI later if needed).vX.Y.Z— release tags for production rollouts; pin in.env.production.dev-<short-sha>— ad-hoc images for testing on the home server before cutting a release.
Rollback = change IMAGE_TAG in .env.production and recreate the stack in Portainer.
9. Data model
ORM: Drizzle. Migrations versioned in packages/db/migrations/.
Tables
operators — people who can use the bot
─────────────────────────────────
id uuid pk
telegram_user_id bigint unique — primary identity (whitelist key)
display_name text
role text — 'admin' (only role for v1)
default_timezone text — IANA, default 'Asia/Kuala_Lumpur'
created_at timestamptz
whatsapp_accounts — each WA account the operator manages
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators
label text — operator-defined, e.g. "Sales 1"
phone_number text nullable — populated after pairing
status text — pending | connecting | connected
| disconnected | logged_out | banned
last_connected_at timestamptz nullable
last_qr_at timestamptz nullable
created_at timestamptz
unique(operator_id, label)
whatsapp_groups — groups discovered per account
─────────────────────────────────
id uuid pk
account_id uuid fk → whatsapp_accounts
wa_group_jid text — WhatsApp's group JID
name text
participant_count int
is_archived bool default false
last_synced_at timestamptz
unique(account_id, wa_group_jid)
media_files — uploaded photos/videos/documents
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators
filename_original text
mime_type text
size_bytes bigint
sha256 text
storage_path text — relative to /data/media/
created_at timestamptz
reminders — scheduled sends
─────────────────────────────────
id uuid pk
account_id uuid fk → whatsapp_accounts
name text
schedule_kind text — 'one_off' | 'recurring'
scheduled_at timestamptz nullable — for one_off
rrule text nullable — RFC 5545 rrule string
timezone text — IANA
ends_at timestamptz nullable
max_runs int nullable
status text — 'active' | 'paused' | 'ended'
created_by uuid fk → operators
created_at timestamptz
updated_at timestamptz
reminder_targets — groups a reminder fires into
─────────────────────────────────
reminder_id uuid fk → reminders
group_id uuid fk → whatsapp_groups
position int
pk(reminder_id, group_id)
reminder_messages — message parts in send order
─────────────────────────────────
id uuid pk
reminder_id uuid fk → reminders
position int
kind text — 'text' | 'image' | 'video' | 'document'
text_content text nullable — text body or media caption
media_id uuid fk → media_files nullable
reminder_runs — execution records
─────────────────────────────────
id uuid pk
reminder_id uuid fk → reminders
fired_at timestamptz
status text — 'success' | 'partial' | 'failed' | 'skipped'
error_summary text nullable
reminder_run_targets — per-target outcomes
─────────────────────────────────
run_id uuid fk → reminder_runs
group_id uuid fk → whatsapp_groups
status text — 'sent' | 'failed' | 'skipped'
wa_message_id text nullable
error text nullable
latency_ms int nullable
pk(run_id, group_id)
audit_log — append-only action history
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators nullable
source text — 'web' | 'telegram' | 'system'
action text — 'reminder.create' | 'account.pair' | ...
target_type text nullable
target_id uuid nullable
payload jsonb
created_at timestamptz
auth_sessions — web UI cookies
─────────────────────────────────
id uuid pk
operator_id uuid fk → operators
token_hash text unique — SHA-256 of cookie value
created_at timestamptz
expires_at timestamptz
last_used_at timestamptz
ip_address inet nullable
user_agent text nullable
pg-boss creates and owns its own pgboss.* schema in the same database — namespace-isolated, no manual setup required beyond initial migration.
Key model decisions
- Recurring schedules use RRULE (RFC 5545), not cron. RRULE expresses "every Monday and Wednesday at 9am, 20 occurrences" naturally; cron cannot. Library:
rruleon Node. - Timezone is per-reminder, not per-account. Operator may run accounts spanning markets in different time zones. Default fills in from operator's
default_timezone. - Baileys auth state on disk, not in Postgres. Path
/data/sessions/<whatsapp_account_id>/, using BaileysuseMultiFileAuthState. That's the upstream-supported path; the file set is many small frequently-mutating files (signal protocol keys); Postgres is the wrong shape. Volume is part of host backup strategy. - Audit log is append-only. Never updated, only inserted. Powers "who created this", "when did this account get paired", etc.
- Media in object-store-like layout on disk. Path
/data/media/{yyyy/mm}/{uuid}.{ext}. Postgres holds metadata only. Sweeper deletes media unreferenced by any reminder after configurable retention (default 90 days). Migration path to MinIO/S3 later: only the storage adapter changes. - Web auth via Telegram magic link. Operator types
/loginto the Telegram bot → bot replies with a one-time URL → click sets a session cookie viaauth_sessions. No passwords. The operator pool is exactly the Telegram-whitelisted set.
Out of v1 (YAGNI; easy to add later)
- Templates with variable substitution (
{customer_name},{day}). - Multi-tenant operator isolation beyond the existing whitelist.
- Per-customer message personalization.
- Conversation threads beyond the reminder firing log.
- A/B testing of reminder content.
- Web push notifications (Telegram already pushes alerts).
10. QR pairing flow (headline UX)
1. Operator (Telegram): /pair "Sales Account 3"
2. Bot inserts whatsapp_accounts row { status: 'pending', label: 'Sales Account 3' }
3. Bot starts Baileys session for that account_id
├─ session dir: /data/sessions/<account_id>/
└─ uses useMultiFileAuthState (auto-persists creds + signal keys)
4. Baileys emits connection.update { qr: '...' }
5. Bot renders QR string → PNG, sends to operator's TG chat
"📱 Scan with WhatsApp on Sales Account 3. Expires in 30s."
(Baileys re-emits QR every ~20s; bot edits the same TG message via editMessageMedia)
6. Operator scans → Baileys emits connection.update { connection: 'open' }
7. Bot updates row { status: 'connected', phone_number: '+60xxx', last_connected_at: now }
Bot sends TG: "✅ Sales Account 3 connected as +60xxxxxxx"
Bot pgNotify('web.event', { type: 'session.connected', account_id })
8. Bot triggers group-sync → upserts whatsapp_groups
Bot sends TG: "Synced 12 groups. Ready to send."
Pairing edge cases
| Situation | Behavior |
|---|---|
| QR expires (no scan in ~30s) | Baileys re-emits; bot edits same TG message with new QR. After 5 cycles (~2.5 min): timeout, mark account pending, TG: "Pairing timed out — try /pair again." |
| Bot container restart mid-pairing | Startup sweeper drops any pending accounts with stale last_qr_at; operator re-runs /pair. |
/pair on already-connected label |
Reject: "Account 'X' already connected. Use /unpair X first." |
| WA logout from phone (linked-device removed) | Baileys connection.close with reason loggedOut. Bot marks logged_out, sends TG alert with re-pair instruction. Reminders for that account skip with reason account_logged_out. |
| Network drop on connected session | Baileys auto-reconnects (built-in). Alert only if downtime >5 min. |
| Web-initiated pair | Same flow; QR PNG also streamed to the open web modal via SSE so operator can scan from web instead of phone-Telegram. |
11. Reminder execution flow
On reminder create/edit (from web or Telegram):
→ DB row inserted/updated (transaction with reminder_targets, reminder_messages)
→ pgNotify('bot.command', { type: 'reminder.upsert', id })
→ bot.scheduler upserts the reminder into pg-boss:
one_off → schedule single delayed job at scheduled_at
recurring → compute next occurrence from rrule, schedule delayed job;
on completion, fire-reminder schedules the next occurrence
When pg-boss fires the job:
fire-reminder.handler:
1. Load reminder + targets + messages from DB
2. Insert reminder_runs { status: 'pending', fired_at: now }
3. Acquire account session from session-manager
- If not connected: mark all targets 'skipped', update run status, exit
4. For each target group:
a. For each message part in position order:
- text → sendTextMessage
- media → load /data/media/<path>, sendMedia with optional caption
b. Insert reminder_run_targets { status, wa_message_id, latency_ms }
c. Throttle: jitter between targets to stay under WA rate limits
5. Roll up reminder_runs.status:
all sent → 'success'; all failed → 'failed'; mix → 'partial'
6. pgNotify('web.event', { type: 'reminder.fired', run_id })
7. If recurring and not at end_at / max_runs:
schedule next occurrence in pg-boss
Else if at end:
update reminder.status = 'ended'
12. Error handling
| Failure | Detection | Response |
|---|---|---|
| WA send transient (timeout, network) | Baileys throws / promise rejects | Retry via pg-boss with exponential backoff (3 tries: 30s/2m/10m). Final failure → mark reminder_run_targets.failed, dashboard + TG alert. |
| WA send permanent (group not found, banned account) | Specific error codes | No retry. Mark target failed with reason. If account banned → mark whatsapp_accounts.status='banned', urgent TG alert. |
| WA session disconnect | connection.update event |
Auto-reconnect. Downtime >5 min → TG alert. Reminders during downtime → skipped. |
| WA logout | reason loggedOut |
status='logged_out'. Stop reconnect attempts. TG: "Account X logged out — re-pair." |
| Telegram delivery failure | grammy throws | Retry once. Then log to audit_log only — don't recurse via TG (TG itself might be down). |
| Postgres connection lost | Drizzle errors | Both services exit non-zero (Docker restarts them). Health checks fail loudly during outage. |
| Media file missing on disk | fs.stat fails before send |
Mark target media_missing, don't send placeholder. TG alert. |
| pg-boss job lost / corrupted | pg-boss own retry → dead-letter | Surface in admin "failed jobs" view; manual retry button. |
| WA rate limit | Specific error | Throttle sender to 1 send / 3 sec per account, jitter between. Backoff longer. |
| Unauthorized Telegram user | Whitelist middleware | Reply: "Sorry, this bot is private." Log to audit_log. No state change. |
| Web session expired | Cookie validation fails | Redirect to /login. |
Observability
- Logs:
pinoJSON to stdout, captured by Docker. - Health endpoints:
web:GET /api/health— DB ping + uptime + commit SHA.bot: internal port 8081,GET /health— DB ping + per-session counts ({ connected: 8, disconnected: 1, pending: 0 }).
- Per-reminder audit trail:
reminder_runs+reminder_run_targetshistory, exposed in dashboard. Every fire is fully reconstructable.
13. Testing strategy
| Layer | Tool | Scope |
|---|---|---|
| Unit | Vitest | rrule helpers, message-part assembly, audit log builders, env validation, error classifiers. No I/O. |
| Integration (DB) | Vitest + local dev Postgres (or Testcontainers) | Drizzle queries, pg-boss schedule sync, LISTEN/NOTIFY round-trip. Per-test schema with teardown. |
| Bot session logic | Vitest with mocked Baileys | Session-manager state transitions, QR rendering, group-sync upsert. No real WA connection. |
| Telegram | Vitest with mocked grammy | Command parsing, whitelist middleware, error responses. |
| Web E2E | Playwright (deferred) | Login (stubbed magic link), reminder create wizard, dashboard. Add when CI exists. |
| Pairing flow | Manual checklist | Real WA pairing requires a real phone — documented in docs/superpowers/specs/manual-test-pairing.md. Run before each release. |
CI
Out of scope for v1. pnpm test and pnpm lint will run via husky + lint-staged on git push. Gitea Actions can be wired later.
14. Scripts
All scripts live in scripts/. Patterned on cm_bot_v2.
| Script | Purpose |
|---|---|
dev.sh |
up | down | logs | status | reset-db against docker-compose.dev.yml. Pre-flight checks for .env.development. Honors NO_SUDO=1. reset-db truncates only whatsapp_bot_dev with a confirmation prompt. |
publish.sh |
Build + push images to gitea.04080616.xyz/yiekheng/cm-whatsapp-{web,bot}:<tag>. Default tag latest. Same auth-error guidance as the cm_bot_v2 reference. |
gen_auth_secret.sh |
Generate AUTH_SECRET (32 hex bytes). --write [path] mode appends/replaces in env file. |
db.sh |
Drizzle migration wrapper: migrate | rollback | seed | studio | reset. reset is dev-only, refuses if env points at prod DB. |
link-account.sh |
CLI helper to start a WA pairing flow without going through Telegram. Emits QR straight to the terminal. Useful for the dev mock account. |
local_build.sh |
One-liner foreground compose up. Convenience. |
15. Confirmed values & remaining pre-deploy checks
Confirmed during brainstorming:
- Web URL:
https://wabot.04080616.xyz. - Default timezone:
Asia/Kuala_Lumpur(seeded intooperators.default_timezone). - Media retention: 90 days. Sweeper job runs daily, deletes media not referenced by any reminder older than retention.
- Minimum interval between recurring fires: 5 minutes (enforced at the schedule-validation layer to prevent runaway loops).
Pre-deploy check (not blocking design; verified during first implementation step):
- Postgres connectivity: confirm
pg_hba.confon192.168.0.210allows the Docker bridge subnet (172.16.0.0/12) andlisten_addressescovers the LAN interface. Add the entry before first deploy if missing.