yiekheng c906a9fa3a docs: refresh README + add docs/runbook.md for v1 sign-off
- README rewritten to reflect v1 reality: auth bootstrap, AES-GCM
  cookies, three-layer rate limit, duplicate-pair detection,
  logout-before-delete, journal-monotonic guard, the new test
  counts (482 web + 88 bot), and the right scripts (set-password,
  create-user). Drops the telegram-era 'Status' paragraph and the
  earlier 'Auth deferred' bullet.
- docs/runbook.md is a new manual end-to-end smoke checklist
  organised by section: pre-flight, auth bootstrap, user
  management, account pairing (incl. back→re-pair + duplicate-phone
  regression checks), reminder lifecycle (incl. triple-fire +
  reschedule regression checks), account lifecycle, sign-out +
  token-version kill, cross-tenant isolation, log sweep, plus a
  troubleshooting cheatsheet.

Closes P3/T23 + P3/T24.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:45:03 +08:00

201 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Manual end-to-end runbook (v1)
Smoke checklist for verifying a fresh deploy. Unit tests don't catch
the live-Baileys / live-Postgres / browser-gesture path; this is what
you run before declaring a release good.
Time budget: ~10 minutes if everything works, ~30 if a step fails.
---
## Pre-flight
- [ ] **Stack up.**
`docker ps | grep cmbot` → expect `cmbot-tools`, `cmbot-bot`,
`cmbot-web` all `Up`.
- [ ] **Migrations clean.**
`NO_SUDO=1 scripts/db.sh migrate` → "Migrations applied." (and
*not* "Refusing to run drizzle migrate" — that's the journal
monotonicity guard tripping).
- [ ] **Web reachable.**
`curl -sf http://localhost:9000/api/health` → 200.
- [ ] **Bot reachable.**
`curl -sf http://localhost:8081/health` → 200.
If any pre-flight fails, fix before continuing.
---
## 1. Auth bootstrap
- [ ] `scripts/db.sh seed` (idempotent — only inserts the `admin`
operator if missing).
- [ ] `echo 'change-me-now' | scripts/set-password.sh admin` → "Password
updated."
- [ ] Open `http://localhost:9000/login` → enter `admin` / the password
→ redirected to `/`.
- [ ] **Wrong password three times in a row** still rate-limits but
with the generic "Too many attempts" message — no leak about
which limit (IP / username / global) tripped.
- [ ] Hit `/admin` URL while signed out → redirected to `/login` with
`?next=/admin`. After a successful login, lands back on `/admin`.
---
## 2. User management (admin-only)
- [ ] **Sidebar / drawer**: only one nav entry highlights at a time.
On `/settings/users`, only `Admin` lights up; `Settings` does
not.
- [ ] `/settings/users` → Add user → username `alice`, password
`alpha7!`, role `user` → "User created."
- [ ] `alice` row shows: username + `you` chip if applicable, role
pill, Promote / Reset / Delete buttons on row 2.
- [ ] Promote `alice` to admin → page revalidates, badge flips to
`admin`.
- [ ] Demote back to `user`.
- [ ] **Last-admin guard**: Demote / Delete on the only remaining
admin row are both disabled.
- [ ] Delete `alice` via the confirm dialog (Cancel + Delete user
buttons; **no third "Close" button** — the static guard test
catches that regression but eyeball it anyway).
---
## 3. Account pairing
- [ ] `/accounts` → New Account → label `WaBot Test` → Pair WhatsApp.
Land on the live QR page within ~2 s.
- [ ] Login screen header is JUST the centered brand mark — no nav,
no menu drawer.
- [ ] Scan with WhatsApp → "Linked Devices" → "Link a device".
- [ ] **Connection success.** Page transitions through `qr` → (brief
`restart-required` close handled silently) → `connected` with
a green check and `+60xxx` phone number → auto-redirect to
`/accounts/<id>` after 3 s.
- [ ] **Refresh Groups** button on `/accounts/<id>/groups` → spinner
during the sync, page auto-refreshes when the bot pushes
`groups.synced` over SSE. No manual reload needed.
### Pair regression checks (these caught real bugs)
- [ ] **Back → Re-pair**: from a live QR, click ← Back → Pair again
from the account detail page. Should NOT instantly flash
"Pairing timed out". A new QR appears and the countdown
restarts at 5:00.
- [ ] **Duplicate phone**: with one phone already paired, scan its QR
from a *second* account row → see the amber "Phone already
linked" panel naming the existing account. The original
account's session stays intact.
---
## 4. Reminder lifecycle
- [ ] `/reminders` → New Reminder → walk the wizard:
- Step 1: pick `WaBot Test`.
- Step 2: enter a short text message ("smoke test &lt;timestamp&gt;").
- Step 3: pick `Daily` recurrence, fire ~2 minutes from now.
Confirm "Pause sending by" checkbox is **unchecked by default**.
- Step 4: select 1 group.
- Step 5: review → Save.
- [ ] Reminder appears on `/reminders` with status `Active`.
Recurrence column shows the human-readable description; long
descriptions truncate with `…`.
- [ ] **Wait for the fire window.** When the time hits, the message
lands in the WhatsApp group **exactly once**.
- [ ] `/activity` → the run shows under `Success`. Default tab is
Success (no `All` tab).
- [ ] Swipe-left a row → Delete shelf appears. Swipe-right → Pause /
Restart shelf. Tapping a row navigates to its detail; dragging
does NOT navigate (6-px threshold).
- [ ] Pause the reminder → status flips to `Paused` immediately and
the next-fire-time disappears.
- [ ] Restart → fires on the next scheduled occurrence.
### Reminder regression checks
- [ ] **Triple-fire repro** (only if you have a tame group): edit
the reminder repeatedly within microseconds of each other (e.g.
the wizard Save button hammered three times). The message must
land **exactly once**. The bot logs should show
"duplicate fire detected inside mutex" warnings on the second
and third attempts.
- [ ] **Reschedule under existing job**: edit a recurring reminder's
schedule to a NEW time before its next-fire arrives. The new
time must fire (the old `created` job is now `cancelled` in
`pgboss.job`; verify with `select state, count(*) from
pgboss.job where name='reminder.fire' group by state`).
---
## 5. Account lifecycle
- [ ] **Unpair** the account from `/accounts/<id>`. Confirm dialog
(Cancel + Yes, unpair). The account row stays in the list with
"Unpaired" status; groups disappear from the picker (they're
soft-archived, not deleted).
- [ ] **Re-pair** the same account → groups come back via the
on-conflict upsert flipping `is_archived` back to false.
- [ ] **Delete** the account from `/accounts/<id>` → Confirm dialog →
the account vanishes from `/accounts`. Check on the *phone*'s
WhatsApp Linked Devices list — the entry is gone (the
logout-before-stop flow tells WhatsApp to drop it).
---
## 6. Sign-out + session lifetime
- [ ] **Sign out** from the sidebar / drawer footer → land on `/login`.
- [ ] Hit any protected URL → redirected to login.
- [ ] **Token-version kill switch**: set `OPERATOR_TOKEN_VERSION=2`
in `.env.development`, restart the web container. Every
previously-issued cookie is now invalid; every authenticated
request bounces to `/login`. Reset to `1` after.
---
## 7. Cross-tenant isolation
- [ ] Sign in as `admin`. Note dashboard counter values.
- [ ] As admin, create a second user `bob` and give them a fresh
account / reminder / fire it once.
- [ ] Sign out, sign in as `bob`. Dashboard counters MUST show only
bob's numbers (not admin's). `/reminders` lists only bob's
reminders. `/accounts` only bob's accounts.
---
## 8. Sweep
- [ ] `docker logs cmbot-web --since 10m | grep -iE 'error|'` — no
output (or only Baileys "Stream Errored (restart required)"
noise; that's upstream).
- [ ] `docker logs cmbot-bot --since 10m | grep -iE 'error|fatal'`
no output beyond the same Baileys upstream noise.
- [ ] `git status` clean (no leftover `_check.ts` or temp files).
---
## When a step fails
- **Migration refused** with "Refusing to run drizzle migrate":
open `packages/db/migrations/meta/_journal.json` and bump the
flagged entry's `when` to the suggested value. Re-run.
- **Pair shows immediate timeout**: bot logs should mention "ignoring
close from previous attempt while warming up" — that's the fix
working, but check a stale Baileys session isn't gummed up. Last
resort: `rm -rf dev-data/sessions/<accountId>` and re-pair.
- **Reminder fires twice**: check `pgboss.queue.policy` for
`reminder.fire` — must be `standard`, not `stately` (stately drops
reschedules silently). The `registerReminderJobs` boot hook
force-flips this on every bot start.
- **Delete didn't remove the linked-device entry on the phone**:
the bot's `socket.logout()` is best-effort — if the socket was
already disconnected when delete fired, the operator removes the
entry manually from WhatsApp's UI.
If any of the regression checks (Back→Re-pair, duplicate phone,
triple-fire, reschedule) fail, that's a real bug — capture the bot
log and file an issue before shipping.