From c906a9fa3af9218e080bf21a8d17b317db128f57 Mon Sep 17 00:00:00 2001 From: yiekheng Date: Sun, 10 May 2026 21:45:03 +0800 Subject: [PATCH] docs: refresh README + add docs/runbook.md for v1 sign-off MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - README rewritten to reflect v1 reality: auth bootstrap, AES-GCM cookies, three-layer rate limit, duplicate-pair detection, logout-before-delete, journal-monotonic guard, the new test counts (482 web + 88 bot), and the right scripts (set-password, create-user). Drops the telegram-era 'Status' paragraph and the earlier 'Auth deferred' bullet. - docs/runbook.md is a new manual end-to-end smoke checklist organised by section: pre-flight, auth bootstrap, user management, account pairing (incl. back→re-pair + duplicate-phone regression checks), reminder lifecycle (incl. triple-fire + reschedule regression checks), account lifecycle, sign-out + token-version kill, cross-tenant isolation, log sweep, plus a troubleshooting cheatsheet. Closes P3/T23 + P3/T24. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 120 +++++++++++++++++++++-------- docs/runbook.md | 200 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 289 insertions(+), 31 deletions(-) create mode 100644 docs/runbook.md diff --git a/README.md b/README.md index 5fa79d9..c933bdf 100644 --- a/README.md +++ b/README.md @@ -6,24 +6,36 @@ the run history all from a phone home-screen icon. ## Status -**Plans 1, 2, and 3 complete.** The web app at `wabot.04080616.xyz` is -the primary control surface; the Telegram bot has been removed. +**v1 production-ready.** The web app at `wabot.04080616.xyz` is the +primary control surface; the Telegram bot has been removed. What's working today: +- **Username + password auth** with role-based access (admin / user). + HttpOnly + Secure session cookies, encrypted with AES-256-GCM (so a + leaked cookie reveals nothing about userId / role) and bound to the + `OPERATOR_TOKEN_VERSION` env so a single env bump kills every + outstanding session. +- **Three-layer login rate limit** — per-IP + per-username (lower-cased + so case-rotation doesn't help) + a global backstop, so a residential- + proxy attacker can't brute one account by hopping IPs. - **Self-hosted Next.js 16 PWA** — installable on a phone home screen. Mobile-first single-row header with a slide-out drawer; desktop - sidebar. + sidebar. Login lives outside the shell on a bare-header surface. - **Live QR pairing** — server-side Baileys session feeds the QR payload directly into the browser via Server-Sent Events. Scan, see "✅ Connected" within seconds, auto-redirect. +- **Duplicate-pair detection** — scanning a QR with a phone already + linked to another account row surfaces a clear "already paired as + <label>" message instead of fighting Baileys for the device. - **Multi-account, multi-group reminders** — 5-step wizard (Account → Message → When → Groups → Review) plus per-section edit pages so you don't have to walk the wizard end-to-end to fix one - field. Active recurrence picker covers Daily / Weekly / Monthly / - Yearly with multi-rule support and per-rule fire-time pickers; the - rendered description reads as plain English ("Every week on Mon, - Wed, Fri at 09:00") not raw cron. + field. Recurrence picker covers Daily / Weekly / Monthly / Yearly + with multi-rule support and per-rule fire-time pickers; the rendered + description reads as plain English ("Every week on Mon, Wed, Fri at + 09:00") not raw cron. Optional "Pause sending by" deadline that + defaults OFF — operators have to opt in explicitly. - **Multi-message stacks** — a reminder can carry multiple ordered parts (text + media), fired in sequence with a 1.5 s gap. Media files swap at any time from the Edit Message page. @@ -33,19 +45,29 @@ What's working today: as a downloadable file instead of failing silently. - **Swipe-to-act rows** — on mobile, swipe a reminder or activity row left for Delete or right for Pause/Restart/Archive. iOS-Mail - style. + style. Click vs drag is disambiguated by a 6-px tap threshold so a + swipe doesn't accidentally trigger the row's link. - **Activity tab** — last 200 runs with status filters (Success / - Partial / Failed / Skipped) plus an Archived tab. Archive a noisy - run to keep the main list readable; restore later. Hard-delete - always available. Run history survives a reminder deletion. + Paused / Failed / Archived). Partial runs surface under both Paused + and Failed; Skipped runs collapse into Archived. Hard-delete and + archive both available; run history survives a reminder deletion. - **Auto-reconnect on transient drops; restart-survival via Baileys session persistence.** Pair once, the device stays linked across - container restarts. -- **All actions audited.** Reminder run history queryable from the - UI; per-run target results (sent / failed / skipped) preserved - even when the underlying group is removed. + container restarts. Logout-on-delete cleans the operator's + linked-devices list on the WhatsApp side too. +- **Hardened pg-boss scheduling** — three-tier dedupe so a triple- + click Save or microsecond-spaced enqueue doesn't fire a reminder + multiple times. Reschedule cancels stale jobs by singletonKey first + so a recurring next-fire never gets silently dropped. +- **Drizzle journal monotonicity guard** — `pnpm migrate` refuses to + run if the `_journal.json` `when` timestamps aren't strictly + increasing (a recurring foot-gun where drizzle would silently skip + a freshly-generated migration). CI tests + the migrate runner both + enforce. +- **All actions audited.** Per-run target results (sent / failed / + skipped) preserved even when the underlying group is removed. -Test count: **249 web + 31 shared + 26 bot = 306** passing. +Test count: **482 web + 88 bot = 570** passing. ## Host requirements @@ -79,24 +101,28 @@ Prerequisites: Docker, the `wabot` database + `waBot` role on # 1. Configure env cp envs/.env.example .env.development # edit .env.development: real DATABASE_URL, plus the LAN host to expose -scripts/gen_auth_secret.sh --write +scripts/gen_auth_secret.sh --write # writes AUTH_SECRET to .env.development # 2. Bring up the stack, install deps NO_SUDO=1 scripts/dev.sh up NO_SUDO=1 scripts/dev.sh pnpm install -# 3. Apply migrations and seed your operator row +# 3. Apply migrations and seed the bootstrap operator row NO_SUDO=1 scripts/db.sh migrate NO_SUDO=1 scripts/db.sh seed -# 4. Open the web app +# 4. Set the bootstrap admin password (NO password is set by seed) +echo 'change-me-now' | scripts/set-password.sh admin + +# 5. Open the web app and sign in as `admin` with the password above # Local: http://localhost:9000 -# LAN: http://:9000 (e.g. http://192.168.0.253:9000) -# Public: https://wabot.04080616.xyz (whatever your reverse proxy serves) +# LAN: http://:9000 +# Public: https://wabot.04080616.xyz ``` -Pair an account: `/accounts` → "New Account" → enter a label → -"Pair WhatsApp" → scan the QR with WhatsApp's "Linked Devices". +Inside the app: `/settings/users` → Add user → invite teammates with +`user` role; promote / demote / reset password / delete from the same +page. The "Admin" nav entry is admin-only. PWA install: phone Chrome → menu → "Install App" / "Add to Home Screen". Launches fullscreen. @@ -108,6 +134,9 @@ group (the default for this repo). Drop it if you need `sudo docker`. End-to-end checks that unit tests can't cover (live Baileys, WhatsApp delivery, swipe gestures): +[`docs/runbook.md`](docs/runbook.md). + +The earlier wizard-only checklist still lives at [`docs/superpowers/specs/manual-test-web.md`](docs/superpowers/specs/manual-test-web.md). ## Layout @@ -118,11 +147,14 @@ WhatsApp delivery, swipe gestures): - `packages/db/` — Drizzle schema and migrations - `packages/shared/` — cross-app helpers (rrule, media paths, timezones, WhatsApp media classifier) -- `docs/superpowers/specs/` — design specs and manual test runbooks +- `docs/runbook.md` — manual end-to-end smoke checklist +- `docs/superpowers/specs/` — design specs and earlier manual test + runbooks - `docs/superpowers/plans/` — implementation plans - `docker/` — Dockerfiles (`tools.Dockerfile`, `bot.Dockerfile`, `web.Dockerfile`) -- `scripts/` — `dev.sh`, `db.sh`, `gen_auth_secret.sh` +- `scripts/` — `dev.sh`, `db.sh`, `gen_auth_secret.sh`, + `set-password.sh`, `create-user.sh` ## Scripts @@ -134,17 +166,43 @@ container, so no host Node is needed. | `scripts/dev.sh up\|down\|logs\|status\|build\|exec\|pnpm\|shell\|restart-bot` | Stack lifecycle and tools-container shell | | `scripts/db.sh migrate\|generate\|studio\|seed\|reset` | Drizzle migration helper | | `scripts/gen_auth_secret.sh [--write]` | Generate `AUTH_SECRET` (host-only, no Node needed) | +| `scripts/set-password.sh ` | Set / reset a user's password (reads stdin) | +| `scripts/create-user.sh ` | Create a user from CLI (admin / user) | Set `NO_SUDO=1` if your user is in the docker group (recommended). +## Auth + admin model + +- One bootstrap operator (`admin`) is created by the seed; its + password is set via `scripts/set-password.sh admin` on first launch. +- Two roles: `admin` (full access including user management) and + `user` (everything except `/settings/users`). Role-based nav + filtering is enforced in middleware + the AppShell + every server + action that mutates user state. +- Every user gets an isolated workspace — accounts, reminders, + groups, and run history all scope by `operator_id`. The admin + panel is the only cross-tenant surface. +- Sessions: AES-256-GCM-encrypted cookie keyed off `AUTH_SECRET`, + HttpOnly + Secure-in-prod + SameSite=Lax, 30-day TTL. The + `OPERATOR_TOKEN_VERSION` env (defaults to `"1"`) is the kill switch + — bumping it invalidates every outstanding cookie globally on the + next request. +- Login rate limits: 10 / 5 min per-IP + 5 / 15 min per-username + a + 100 / min global backstop. The error message is identical for all + three so the limit-which-tripped isn't leaked. + ## Deferred - **Standalone media library** browser (currently media is uploaded per-reminder). - **E2E browser tests** (Playwright) on the swipe and pairing flows. -- **Auth** (passkeys / email-password) — bring back if URL exposure - becomes a concern. Today the app trusts whatever's in front of the - reverse proxy. -- **Multi-operator** — schema supports `operator_id` on every row, - but the seed runs as a single operator and there's no /signup or - invite flow yet. +- **Search-as-you-type in the wizard's groups picker** — at 3 000+ + groups per account the picker still loads the alphabetical + top-200; operators with >200 groups need to use the list page's + search to find anything past 'L'. +- **Composite index on `(account_id, name)`** for the groups list + page's `ORDER BY name LIMIT 200` query — currently a sort + limit; + the GIN trigram on `name` plus the unique on `(account_id, + wa_group_jid)` already cover most cases. +- **Self-service password reset** (email link, etc.) — out of scope + for v1; admins use the Users page. diff --git a/docs/runbook.md b/docs/runbook.md new file mode 100644 index 0000000..5d503d8 --- /dev/null +++ b/docs/runbook.md @@ -0,0 +1,200 @@ +# Manual end-to-end runbook (v1) + +Smoke checklist for verifying a fresh deploy. Unit tests don't catch +the live-Baileys / live-Postgres / browser-gesture path; this is what +you run before declaring a release good. + +Time budget: ~10 minutes if everything works, ~30 if a step fails. + +--- + +## Pre-flight + +- [ ] **Stack up.** + `docker ps | grep cmbot` → expect `cmbot-tools`, `cmbot-bot`, + `cmbot-web` all `Up`. +- [ ] **Migrations clean.** + `NO_SUDO=1 scripts/db.sh migrate` → "Migrations applied." (and + *not* "Refusing to run drizzle migrate" — that's the journal + monotonicity guard tripping). +- [ ] **Web reachable.** + `curl -sf http://localhost:9000/api/health` → 200. +- [ ] **Bot reachable.** + `curl -sf http://localhost:8081/health` → 200. + +If any pre-flight fails, fix before continuing. + +--- + +## 1. Auth bootstrap + +- [ ] `scripts/db.sh seed` (idempotent — only inserts the `admin` + operator if missing). +- [ ] `echo 'change-me-now' | scripts/set-password.sh admin` → "Password + updated." +- [ ] Open `http://localhost:9000/login` → enter `admin` / the password + → redirected to `/`. +- [ ] **Wrong password three times in a row** still rate-limits but + with the generic "Too many attempts" message — no leak about + which limit (IP / username / global) tripped. +- [ ] Hit `/admin` URL while signed out → redirected to `/login` with + `?next=/admin`. After a successful login, lands back on `/admin`. + +--- + +## 2. User management (admin-only) + +- [ ] **Sidebar / drawer**: only one nav entry highlights at a time. + On `/settings/users`, only `Admin` lights up; `Settings` does + not. +- [ ] `/settings/users` → Add user → username `alice`, password + `alpha7!`, role `user` → "User created." +- [ ] `alice` row shows: username + `you` chip if applicable, role + pill, Promote / Reset / Delete buttons on row 2. +- [ ] Promote `alice` to admin → page revalidates, badge flips to + `admin`. +- [ ] Demote back to `user`. +- [ ] **Last-admin guard**: Demote / Delete on the only remaining + admin row are both disabled. +- [ ] Delete `alice` via the confirm dialog (Cancel + Delete user + buttons; **no third "Close" button** — the static guard test + catches that regression but eyeball it anyway). + +--- + +## 3. Account pairing + +- [ ] `/accounts` → New Account → label `WaBot Test` → Pair WhatsApp. + Land on the live QR page within ~2 s. +- [ ] Login screen header is JUST the centered brand mark — no nav, + no menu drawer. +- [ ] Scan with WhatsApp → "Linked Devices" → "Link a device". +- [ ] **Connection success.** Page transitions through `qr` → (brief + `restart-required` close handled silently) → `connected` with + a green check and `+60xxx` phone number → auto-redirect to + `/accounts/` after 3 s. +- [ ] **Refresh Groups** button on `/accounts//groups` → spinner + during the sync, page auto-refreshes when the bot pushes + `groups.synced` over SSE. No manual reload needed. + +### Pair regression checks (these caught real bugs) + +- [ ] **Back → Re-pair**: from a live QR, click ← Back → Pair again + from the account detail page. Should NOT instantly flash + "Pairing timed out". A new QR appears and the countdown + restarts at 5:00. +- [ ] **Duplicate phone**: with one phone already paired, scan its QR + from a *second* account row → see the amber "Phone already + linked" panel naming the existing account. The original + account's session stays intact. + +--- + +## 4. Reminder lifecycle + +- [ ] `/reminders` → New Reminder → walk the wizard: + - Step 1: pick `WaBot Test`. + - Step 2: enter a short text message ("smoke test <timestamp>"). + - Step 3: pick `Daily` recurrence, fire ~2 minutes from now. + Confirm "Pause sending by" checkbox is **unchecked by default**. + - Step 4: select 1 group. + - Step 5: review → Save. +- [ ] Reminder appears on `/reminders` with status `Active`. + Recurrence column shows the human-readable description; long + descriptions truncate with `…`. +- [ ] **Wait for the fire window.** When the time hits, the message + lands in the WhatsApp group **exactly once**. +- [ ] `/activity` → the run shows under `Success`. Default tab is + Success (no `All` tab). +- [ ] Swipe-left a row → Delete shelf appears. Swipe-right → Pause / + Restart shelf. Tapping a row navigates to its detail; dragging + does NOT navigate (6-px threshold). +- [ ] Pause the reminder → status flips to `Paused` immediately and + the next-fire-time disappears. +- [ ] Restart → fires on the next scheduled occurrence. + +### Reminder regression checks + +- [ ] **Triple-fire repro** (only if you have a tame group): edit + the reminder repeatedly within microseconds of each other (e.g. + the wizard Save button hammered three times). The message must + land **exactly once**. The bot logs should show + "duplicate fire detected inside mutex" warnings on the second + and third attempts. +- [ ] **Reschedule under existing job**: edit a recurring reminder's + schedule to a NEW time before its next-fire arrives. The new + time must fire (the old `created` job is now `cancelled` in + `pgboss.job`; verify with `select state, count(*) from + pgboss.job where name='reminder.fire' group by state`). + +--- + +## 5. Account lifecycle + +- [ ] **Unpair** the account from `/accounts/`. Confirm dialog + (Cancel + Yes, unpair). The account row stays in the list with + "Unpaired" status; groups disappear from the picker (they're + soft-archived, not deleted). +- [ ] **Re-pair** the same account → groups come back via the + on-conflict upsert flipping `is_archived` back to false. +- [ ] **Delete** the account from `/accounts/` → Confirm dialog → + the account vanishes from `/accounts`. Check on the *phone*'s + WhatsApp Linked Devices list — the entry is gone (the + logout-before-stop flow tells WhatsApp to drop it). + +--- + +## 6. Sign-out + session lifetime + +- [ ] **Sign out** from the sidebar / drawer footer → land on `/login`. +- [ ] Hit any protected URL → redirected to login. +- [ ] **Token-version kill switch**: set `OPERATOR_TOKEN_VERSION=2` + in `.env.development`, restart the web container. Every + previously-issued cookie is now invalid; every authenticated + request bounces to `/login`. Reset to `1` after. + +--- + +## 7. Cross-tenant isolation + +- [ ] Sign in as `admin`. Note dashboard counter values. +- [ ] As admin, create a second user `bob` and give them a fresh + account / reminder / fire it once. +- [ ] Sign out, sign in as `bob`. Dashboard counters MUST show only + bob's numbers (not admin's). `/reminders` lists only bob's + reminders. `/accounts` only bob's accounts. + +--- + +## 8. Sweep + +- [ ] `docker logs cmbot-web --since 10m | grep -iE 'error|⨯'` — no + output (or only Baileys "Stream Errored (restart required)" + noise; that's upstream). +- [ ] `docker logs cmbot-bot --since 10m | grep -iE 'error|fatal'` — + no output beyond the same Baileys upstream noise. +- [ ] `git status` clean (no leftover `_check.ts` or temp files). + +--- + +## When a step fails + +- **Migration refused** with "Refusing to run drizzle migrate": + open `packages/db/migrations/meta/_journal.json` and bump the + flagged entry's `when` to the suggested value. Re-run. +- **Pair shows immediate timeout**: bot logs should mention "ignoring + close from previous attempt while warming up" — that's the fix + working, but check a stale Baileys session isn't gummed up. Last + resort: `rm -rf dev-data/sessions/` and re-pair. +- **Reminder fires twice**: check `pgboss.queue.policy` for + `reminder.fire` — must be `standard`, not `stately` (stately drops + reschedules silently). The `registerReminderJobs` boot hook + force-flips this on every bot start. +- **Delete didn't remove the linked-device entry on the phone**: + the bot's `socket.logout()` is best-effort — if the socket was + already disconnected when delete fired, the operator removes the + entry manually from WhatsApp's UI. + +If any of the regression checks (Back→Re-pair, duplicate phone, +triple-fire, reschedule) fail, that's a real bug — capture the bot +log and file an issue before shipping.