docs: refresh README + add docs/runbook.md for v1 sign-off

- README rewritten to reflect v1 reality: auth bootstrap, AES-GCM
  cookies, three-layer rate limit, duplicate-pair detection,
  logout-before-delete, journal-monotonic guard, the new test
  counts (482 web + 88 bot), and the right scripts (set-password,
  create-user). Drops the telegram-era 'Status' paragraph and the
  earlier 'Auth deferred' bullet.
- docs/runbook.md is a new manual end-to-end smoke checklist
  organised by section: pre-flight, auth bootstrap, user
  management, account pairing (incl. back→re-pair + duplicate-phone
  regression checks), reminder lifecycle (incl. triple-fire +
  reschedule regression checks), account lifecycle, sign-out +
  token-version kill, cross-tenant isolation, log sweep, plus a
  troubleshooting cheatsheet.

Closes P3/T23 + P3/T24.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
yiekheng 2026-05-10 21:45:03 +08:00
parent 47d7c53fda
commit c906a9fa3a
2 changed files with 289 additions and 31 deletions

120
README.md
View File

@ -6,24 +6,36 @@ the run history all from a phone home-screen icon.
## Status ## Status
**Plans 1, 2, and 3 complete.** The web app at `wabot.04080616.xyz` is **v1 production-ready.** The web app at `wabot.04080616.xyz` is the
the primary control surface; the Telegram bot has been removed. primary control surface; the Telegram bot has been removed.
What's working today: What's working today:
- **Username + password auth** with role-based access (admin / user).
HttpOnly + Secure session cookies, encrypted with AES-256-GCM (so a
leaked cookie reveals nothing about userId / role) and bound to the
`OPERATOR_TOKEN_VERSION` env so a single env bump kills every
outstanding session.
- **Three-layer login rate limit** — per-IP + per-username (lower-cased
so case-rotation doesn't help) + a global backstop, so a residential-
proxy attacker can't brute one account by hopping IPs.
- **Self-hosted Next.js 16 PWA** — installable on a phone home screen. - **Self-hosted Next.js 16 PWA** — installable on a phone home screen.
Mobile-first single-row header with a slide-out drawer; desktop Mobile-first single-row header with a slide-out drawer; desktop
sidebar. sidebar. Login lives outside the shell on a bare-header surface.
- **Live QR pairing** — server-side Baileys session feeds the QR - **Live QR pairing** — server-side Baileys session feeds the QR
payload directly into the browser via Server-Sent Events. Scan, payload directly into the browser via Server-Sent Events. Scan,
see "✅ Connected" within seconds, auto-redirect. see "✅ Connected" within seconds, auto-redirect.
- **Duplicate-pair detection** — scanning a QR with a phone already
linked to another account row surfaces a clear "already paired as
&lt;label&gt;" message instead of fighting Baileys for the device.
- **Multi-account, multi-group reminders** — 5-step wizard - **Multi-account, multi-group reminders** — 5-step wizard
(Account → Message → When → Groups → Review) plus per-section edit (Account → Message → When → Groups → Review) plus per-section edit
pages so you don't have to walk the wizard end-to-end to fix one pages so you don't have to walk the wizard end-to-end to fix one
field. Active recurrence picker covers Daily / Weekly / Monthly / field. Recurrence picker covers Daily / Weekly / Monthly / Yearly
Yearly with multi-rule support and per-rule fire-time pickers; the with multi-rule support and per-rule fire-time pickers; the rendered
rendered description reads as plain English ("Every week on Mon, description reads as plain English ("Every week on Mon, Wed, Fri at
Wed, Fri at 09:00") not raw cron. 09:00") not raw cron. Optional "Pause sending by" deadline that
defaults OFF — operators have to opt in explicitly.
- **Multi-message stacks** — a reminder can carry multiple ordered - **Multi-message stacks** — a reminder can carry multiple ordered
parts (text + media), fired in sequence with a 1.5 s gap. Media parts (text + media), fired in sequence with a 1.5 s gap. Media
files swap at any time from the Edit Message page. files swap at any time from the Edit Message page.
@ -33,19 +45,29 @@ What's working today:
as a downloadable file instead of failing silently. as a downloadable file instead of failing silently.
- **Swipe-to-act rows** — on mobile, swipe a reminder or activity - **Swipe-to-act rows** — on mobile, swipe a reminder or activity
row left for Delete or right for Pause/Restart/Archive. iOS-Mail row left for Delete or right for Pause/Restart/Archive. iOS-Mail
style. style. Click vs drag is disambiguated by a 6-px tap threshold so a
swipe doesn't accidentally trigger the row's link.
- **Activity tab** — last 200 runs with status filters (Success / - **Activity tab** — last 200 runs with status filters (Success /
Partial / Failed / Skipped) plus an Archived tab. Archive a noisy Paused / Failed / Archived). Partial runs surface under both Paused
run to keep the main list readable; restore later. Hard-delete and Failed; Skipped runs collapse into Archived. Hard-delete and
always available. Run history survives a reminder deletion. archive both available; run history survives a reminder deletion.
- **Auto-reconnect on transient drops; restart-survival via Baileys - **Auto-reconnect on transient drops; restart-survival via Baileys
session persistence.** Pair once, the device stays linked across session persistence.** Pair once, the device stays linked across
container restarts. container restarts. Logout-on-delete cleans the operator's
- **All actions audited.** Reminder run history queryable from the linked-devices list on the WhatsApp side too.
UI; per-run target results (sent / failed / skipped) preserved - **Hardened pg-boss scheduling** — three-tier dedupe so a triple-
even when the underlying group is removed. click Save or microsecond-spaced enqueue doesn't fire a reminder
multiple times. Reschedule cancels stale jobs by singletonKey first
so a recurring next-fire never gets silently dropped.
- **Drizzle journal monotonicity guard**`pnpm migrate` refuses to
run if the `_journal.json` `when` timestamps aren't strictly
increasing (a recurring foot-gun where drizzle would silently skip
a freshly-generated migration). CI tests + the migrate runner both
enforce.
- **All actions audited.** Per-run target results (sent / failed /
skipped) preserved even when the underlying group is removed.
Test count: **249 web + 31 shared + 26 bot = 306** passing. Test count: **482 web + 88 bot = 570** passing.
## Host requirements ## Host requirements
@ -79,24 +101,28 @@ Prerequisites: Docker, the `wabot` database + `waBot` role on
# 1. Configure env # 1. Configure env
cp envs/.env.example .env.development cp envs/.env.example .env.development
# edit .env.development: real DATABASE_URL, plus the LAN host to expose # edit .env.development: real DATABASE_URL, plus the LAN host to expose
scripts/gen_auth_secret.sh --write scripts/gen_auth_secret.sh --write # writes AUTH_SECRET to .env.development
# 2. Bring up the stack, install deps # 2. Bring up the stack, install deps
NO_SUDO=1 scripts/dev.sh up NO_SUDO=1 scripts/dev.sh up
NO_SUDO=1 scripts/dev.sh pnpm install NO_SUDO=1 scripts/dev.sh pnpm install
# 3. Apply migrations and seed your operator row # 3. Apply migrations and seed the bootstrap operator row
NO_SUDO=1 scripts/db.sh migrate NO_SUDO=1 scripts/db.sh migrate
NO_SUDO=1 scripts/db.sh seed NO_SUDO=1 scripts/db.sh seed
# 4. Open the web app # 4. Set the bootstrap admin password (NO password is set by seed)
echo 'change-me-now' | scripts/set-password.sh admin
# 5. Open the web app and sign in as `admin` with the password above
# Local: http://localhost:9000 # Local: http://localhost:9000
# LAN: http://<host-ip>:9000 (e.g. http://192.168.0.253:9000) # LAN: http://<host-ip>:9000
# Public: https://wabot.04080616.xyz (whatever your reverse proxy serves) # Public: https://wabot.04080616.xyz
``` ```
Pair an account: `/accounts` → "New Account" → enter a label → Inside the app: `/settings/users` → Add user → invite teammates with
"Pair WhatsApp" → scan the QR with WhatsApp's "Linked Devices". `user` role; promote / demote / reset password / delete from the same
page. The "Admin" nav entry is admin-only.
PWA install: phone Chrome → menu → "Install App" / "Add to Home PWA install: phone Chrome → menu → "Install App" / "Add to Home
Screen". Launches fullscreen. Screen". Launches fullscreen.
@ -108,6 +134,9 @@ group (the default for this repo). Drop it if you need `sudo docker`.
End-to-end checks that unit tests can't cover (live Baileys, End-to-end checks that unit tests can't cover (live Baileys,
WhatsApp delivery, swipe gestures): WhatsApp delivery, swipe gestures):
[`docs/runbook.md`](docs/runbook.md).
The earlier wizard-only checklist still lives at
[`docs/superpowers/specs/manual-test-web.md`](docs/superpowers/specs/manual-test-web.md). [`docs/superpowers/specs/manual-test-web.md`](docs/superpowers/specs/manual-test-web.md).
## Layout ## Layout
@ -118,11 +147,14 @@ WhatsApp delivery, swipe gestures):
- `packages/db/` — Drizzle schema and migrations - `packages/db/` — Drizzle schema and migrations
- `packages/shared/` — cross-app helpers (rrule, media paths, - `packages/shared/` — cross-app helpers (rrule, media paths,
timezones, WhatsApp media classifier) timezones, WhatsApp media classifier)
- `docs/superpowers/specs/` — design specs and manual test runbooks - `docs/runbook.md` — manual end-to-end smoke checklist
- `docs/superpowers/specs/` — design specs and earlier manual test
runbooks
- `docs/superpowers/plans/` — implementation plans - `docs/superpowers/plans/` — implementation plans
- `docker/` — Dockerfiles (`tools.Dockerfile`, `bot.Dockerfile`, - `docker/` — Dockerfiles (`tools.Dockerfile`, `bot.Dockerfile`,
`web.Dockerfile`) `web.Dockerfile`)
- `scripts/``dev.sh`, `db.sh`, `gen_auth_secret.sh` - `scripts/``dev.sh`, `db.sh`, `gen_auth_secret.sh`,
`set-password.sh`, `create-user.sh`
## Scripts ## Scripts
@ -134,17 +166,43 @@ container, so no host Node is needed.
| `scripts/dev.sh up\|down\|logs\|status\|build\|exec\|pnpm\|shell\|restart-bot` | Stack lifecycle and tools-container shell | | `scripts/dev.sh up\|down\|logs\|status\|build\|exec\|pnpm\|shell\|restart-bot` | Stack lifecycle and tools-container shell |
| `scripts/db.sh migrate\|generate\|studio\|seed\|reset` | Drizzle migration helper | | `scripts/db.sh migrate\|generate\|studio\|seed\|reset` | Drizzle migration helper |
| `scripts/gen_auth_secret.sh [--write]` | Generate `AUTH_SECRET` (host-only, no Node needed) | | `scripts/gen_auth_secret.sh [--write]` | Generate `AUTH_SECRET` (host-only, no Node needed) |
| `scripts/set-password.sh <username>` | Set / reset a user's password (reads stdin) |
| `scripts/create-user.sh <username> <role>` | Create a user from CLI (admin / user) |
Set `NO_SUDO=1` if your user is in the docker group (recommended). Set `NO_SUDO=1` if your user is in the docker group (recommended).
## Auth + admin model
- One bootstrap operator (`admin`) is created by the seed; its
password is set via `scripts/set-password.sh admin` on first launch.
- Two roles: `admin` (full access including user management) and
`user` (everything except `/settings/users`). Role-based nav
filtering is enforced in middleware + the AppShell + every server
action that mutates user state.
- Every user gets an isolated workspace — accounts, reminders,
groups, and run history all scope by `operator_id`. The admin
panel is the only cross-tenant surface.
- Sessions: AES-256-GCM-encrypted cookie keyed off `AUTH_SECRET`,
HttpOnly + Secure-in-prod + SameSite=Lax, 30-day TTL. The
`OPERATOR_TOKEN_VERSION` env (defaults to `"1"`) is the kill switch
— bumping it invalidates every outstanding cookie globally on the
next request.
- Login rate limits: 10 / 5 min per-IP + 5 / 15 min per-username + a
100 / min global backstop. The error message is identical for all
three so the limit-which-tripped isn't leaked.
## Deferred ## Deferred
- **Standalone media library** browser (currently media is uploaded - **Standalone media library** browser (currently media is uploaded
per-reminder). per-reminder).
- **E2E browser tests** (Playwright) on the swipe and pairing flows. - **E2E browser tests** (Playwright) on the swipe and pairing flows.
- **Auth** (passkeys / email-password) — bring back if URL exposure - **Search-as-you-type in the wizard's groups picker** — at 3 000+
becomes a concern. Today the app trusts whatever's in front of the groups per account the picker still loads the alphabetical
reverse proxy. top-200; operators with >200 groups need to use the list page's
- **Multi-operator** — schema supports `operator_id` on every row, search to find anything past 'L'.
but the seed runs as a single operator and there's no /signup or - **Composite index on `(account_id, name)`** for the groups list
invite flow yet. page's `ORDER BY name LIMIT 200` query — currently a sort + limit;
the GIN trigram on `name` plus the unique on `(account_id,
wa_group_jid)` already cover most cases.
- **Self-service password reset** (email link, etc.) — out of scope
for v1; admins use the Users page.

200
docs/runbook.md Normal file
View File

@ -0,0 +1,200 @@
# Manual end-to-end runbook (v1)
Smoke checklist for verifying a fresh deploy. Unit tests don't catch
the live-Baileys / live-Postgres / browser-gesture path; this is what
you run before declaring a release good.
Time budget: ~10 minutes if everything works, ~30 if a step fails.
---
## Pre-flight
- [ ] **Stack up.**
`docker ps | grep cmbot` → expect `cmbot-tools`, `cmbot-bot`,
`cmbot-web` all `Up`.
- [ ] **Migrations clean.**
`NO_SUDO=1 scripts/db.sh migrate` → "Migrations applied." (and
*not* "Refusing to run drizzle migrate" — that's the journal
monotonicity guard tripping).
- [ ] **Web reachable.**
`curl -sf http://localhost:9000/api/health` → 200.
- [ ] **Bot reachable.**
`curl -sf http://localhost:8081/health` → 200.
If any pre-flight fails, fix before continuing.
---
## 1. Auth bootstrap
- [ ] `scripts/db.sh seed` (idempotent — only inserts the `admin`
operator if missing).
- [ ] `echo 'change-me-now' | scripts/set-password.sh admin` → "Password
updated."
- [ ] Open `http://localhost:9000/login` → enter `admin` / the password
→ redirected to `/`.
- [ ] **Wrong password three times in a row** still rate-limits but
with the generic "Too many attempts" message — no leak about
which limit (IP / username / global) tripped.
- [ ] Hit `/admin` URL while signed out → redirected to `/login` with
`?next=/admin`. After a successful login, lands back on `/admin`.
---
## 2. User management (admin-only)
- [ ] **Sidebar / drawer**: only one nav entry highlights at a time.
On `/settings/users`, only `Admin` lights up; `Settings` does
not.
- [ ] `/settings/users` → Add user → username `alice`, password
`alpha7!`, role `user` → "User created."
- [ ] `alice` row shows: username + `you` chip if applicable, role
pill, Promote / Reset / Delete buttons on row 2.
- [ ] Promote `alice` to admin → page revalidates, badge flips to
`admin`.
- [ ] Demote back to `user`.
- [ ] **Last-admin guard**: Demote / Delete on the only remaining
admin row are both disabled.
- [ ] Delete `alice` via the confirm dialog (Cancel + Delete user
buttons; **no third "Close" button** — the static guard test
catches that regression but eyeball it anyway).
---
## 3. Account pairing
- [ ] `/accounts` → New Account → label `WaBot Test` → Pair WhatsApp.
Land on the live QR page within ~2 s.
- [ ] Login screen header is JUST the centered brand mark — no nav,
no menu drawer.
- [ ] Scan with WhatsApp → "Linked Devices" → "Link a device".
- [ ] **Connection success.** Page transitions through `qr` → (brief
`restart-required` close handled silently) → `connected` with
a green check and `+60xxx` phone number → auto-redirect to
`/accounts/<id>` after 3 s.
- [ ] **Refresh Groups** button on `/accounts/<id>/groups` → spinner
during the sync, page auto-refreshes when the bot pushes
`groups.synced` over SSE. No manual reload needed.
### Pair regression checks (these caught real bugs)
- [ ] **Back → Re-pair**: from a live QR, click ← Back → Pair again
from the account detail page. Should NOT instantly flash
"Pairing timed out". A new QR appears and the countdown
restarts at 5:00.
- [ ] **Duplicate phone**: with one phone already paired, scan its QR
from a *second* account row → see the amber "Phone already
linked" panel naming the existing account. The original
account's session stays intact.
---
## 4. Reminder lifecycle
- [ ] `/reminders` → New Reminder → walk the wizard:
- Step 1: pick `WaBot Test`.
- Step 2: enter a short text message ("smoke test &lt;timestamp&gt;").
- Step 3: pick `Daily` recurrence, fire ~2 minutes from now.
Confirm "Pause sending by" checkbox is **unchecked by default**.
- Step 4: select 1 group.
- Step 5: review → Save.
- [ ] Reminder appears on `/reminders` with status `Active`.
Recurrence column shows the human-readable description; long
descriptions truncate with `…`.
- [ ] **Wait for the fire window.** When the time hits, the message
lands in the WhatsApp group **exactly once**.
- [ ] `/activity` → the run shows under `Success`. Default tab is
Success (no `All` tab).
- [ ] Swipe-left a row → Delete shelf appears. Swipe-right → Pause /
Restart shelf. Tapping a row navigates to its detail; dragging
does NOT navigate (6-px threshold).
- [ ] Pause the reminder → status flips to `Paused` immediately and
the next-fire-time disappears.
- [ ] Restart → fires on the next scheduled occurrence.
### Reminder regression checks
- [ ] **Triple-fire repro** (only if you have a tame group): edit
the reminder repeatedly within microseconds of each other (e.g.
the wizard Save button hammered three times). The message must
land **exactly once**. The bot logs should show
"duplicate fire detected inside mutex" warnings on the second
and third attempts.
- [ ] **Reschedule under existing job**: edit a recurring reminder's
schedule to a NEW time before its next-fire arrives. The new
time must fire (the old `created` job is now `cancelled` in
`pgboss.job`; verify with `select state, count(*) from
pgboss.job where name='reminder.fire' group by state`).
---
## 5. Account lifecycle
- [ ] **Unpair** the account from `/accounts/<id>`. Confirm dialog
(Cancel + Yes, unpair). The account row stays in the list with
"Unpaired" status; groups disappear from the picker (they're
soft-archived, not deleted).
- [ ] **Re-pair** the same account → groups come back via the
on-conflict upsert flipping `is_archived` back to false.
- [ ] **Delete** the account from `/accounts/<id>` → Confirm dialog →
the account vanishes from `/accounts`. Check on the *phone*'s
WhatsApp Linked Devices list — the entry is gone (the
logout-before-stop flow tells WhatsApp to drop it).
---
## 6. Sign-out + session lifetime
- [ ] **Sign out** from the sidebar / drawer footer → land on `/login`.
- [ ] Hit any protected URL → redirected to login.
- [ ] **Token-version kill switch**: set `OPERATOR_TOKEN_VERSION=2`
in `.env.development`, restart the web container. Every
previously-issued cookie is now invalid; every authenticated
request bounces to `/login`. Reset to `1` after.
---
## 7. Cross-tenant isolation
- [ ] Sign in as `admin`. Note dashboard counter values.
- [ ] As admin, create a second user `bob` and give them a fresh
account / reminder / fire it once.
- [ ] Sign out, sign in as `bob`. Dashboard counters MUST show only
bob's numbers (not admin's). `/reminders` lists only bob's
reminders. `/accounts` only bob's accounts.
---
## 8. Sweep
- [ ] `docker logs cmbot-web --since 10m | grep -iE 'error|'` — no
output (or only Baileys "Stream Errored (restart required)"
noise; that's upstream).
- [ ] `docker logs cmbot-bot --since 10m | grep -iE 'error|fatal'`
no output beyond the same Baileys upstream noise.
- [ ] `git status` clean (no leftover `_check.ts` or temp files).
---
## When a step fails
- **Migration refused** with "Refusing to run drizzle migrate":
open `packages/db/migrations/meta/_journal.json` and bump the
flagged entry's `when` to the suggested value. Re-run.
- **Pair shows immediate timeout**: bot logs should mention "ignoring
close from previous attempt while warming up" — that's the fix
working, but check a stale Baileys session isn't gummed up. Last
resort: `rm -rf dev-data/sessions/<accountId>` and re-pair.
- **Reminder fires twice**: check `pgboss.queue.policy` for
`reminder.fire` — must be `standard`, not `stately` (stately drops
reschedules silently). The `registerReminderJobs` boot hook
force-flips this on every bot start.
- **Delete didn't remove the linked-device entry on the phone**:
the bot's `socket.logout()` is best-effort — if the socket was
already disconnected when delete fired, the operator removes the
entry manually from WhatsApp's UI.
If any of the regression checks (Back→Re-pair, duplicate phone,
triple-fire, reschedule) fail, that's a real bug — capture the bot
log and file an issue before shipping.