yiekheng c906a9fa3a docs: refresh README + add docs/runbook.md for v1 sign-off
- README rewritten to reflect v1 reality: auth bootstrap, AES-GCM
  cookies, three-layer rate limit, duplicate-pair detection,
  logout-before-delete, journal-monotonic guard, the new test
  counts (482 web + 88 bot), and the right scripts (set-password,
  create-user). Drops the telegram-era 'Status' paragraph and the
  earlier 'Auth deferred' bullet.
- docs/runbook.md is a new manual end-to-end smoke checklist
  organised by section: pre-flight, auth bootstrap, user
  management, account pairing (incl. back→re-pair + duplicate-phone
  regression checks), reminder lifecycle (incl. triple-fire +
  reschedule regression checks), account lifecycle, sign-out +
  token-version kill, cross-tenant isolation, log sweep, plus a
  troubleshooting cheatsheet.

Closes P3/T23 + P3/T24.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:45:03 +08:00

8.2 KiB
Raw Blame History

Manual end-to-end runbook (v1)

Smoke checklist for verifying a fresh deploy. Unit tests don't catch the live-Baileys / live-Postgres / browser-gesture path; this is what you run before declaring a release good.

Time budget: ~10 minutes if everything works, ~30 if a step fails.


Pre-flight

  • Stack up. docker ps | grep cmbot → expect cmbot-tools, cmbot-bot, cmbot-web all Up.
  • Migrations clean. NO_SUDO=1 scripts/db.sh migrate → "Migrations applied." (and not "Refusing to run drizzle migrate" — that's the journal monotonicity guard tripping).
  • Web reachable. curl -sf http://localhost:9000/api/health → 200.
  • Bot reachable. curl -sf http://localhost:8081/health → 200.

If any pre-flight fails, fix before continuing.


1. Auth bootstrap

  • scripts/db.sh seed (idempotent — only inserts the admin operator if missing).
  • echo 'change-me-now' | scripts/set-password.sh admin → "Password updated."
  • Open http://localhost:9000/login → enter admin / the password → redirected to /.
  • Wrong password three times in a row still rate-limits but with the generic "Too many attempts" message — no leak about which limit (IP / username / global) tripped.
  • Hit /admin URL while signed out → redirected to /login with ?next=/admin. After a successful login, lands back on /admin.

2. User management (admin-only)

  • Sidebar / drawer: only one nav entry highlights at a time. On /settings/users, only Admin lights up; Settings does not.
  • /settings/users → Add user → username alice, password alpha7!, role user → "User created."
  • alice row shows: username + you chip if applicable, role pill, Promote / Reset / Delete buttons on row 2.
  • Promote alice to admin → page revalidates, badge flips to admin.
  • Demote back to user.
  • Last-admin guard: Demote / Delete on the only remaining admin row are both disabled.
  • Delete alice via the confirm dialog (Cancel + Delete user buttons; no third "Close" button — the static guard test catches that regression but eyeball it anyway).

3. Account pairing

  • /accounts → New Account → label WaBot Test → Pair WhatsApp. Land on the live QR page within ~2 s.
  • Login screen header is JUST the centered brand mark — no nav, no menu drawer.
  • Scan with WhatsApp → "Linked Devices" → "Link a device".
  • Connection success. Page transitions through qr → (brief restart-required close handled silently) → connected with a green check and +60xxx phone number → auto-redirect to /accounts/<id> after 3 s.
  • Refresh Groups button on /accounts/<id>/groups → spinner during the sync, page auto-refreshes when the bot pushes groups.synced over SSE. No manual reload needed.

Pair regression checks (these caught real bugs)

  • Back → Re-pair: from a live QR, click ← Back → Pair again from the account detail page. Should NOT instantly flash "Pairing timed out". A new QR appears and the countdown restarts at 5:00.
  • Duplicate phone: with one phone already paired, scan its QR from a second account row → see the amber "Phone already linked" panel naming the existing account. The original account's session stays intact.

4. Reminder lifecycle

  • /reminders → New Reminder → walk the wizard: - Step 1: pick WaBot Test. - Step 2: enter a short text message ("smoke test <timestamp>"). - Step 3: pick Daily recurrence, fire ~2 minutes from now. Confirm "Pause sending by" checkbox is unchecked by default. - Step 4: select 1 group. - Step 5: review → Save.
  • Reminder appears on /reminders with status Active. Recurrence column shows the human-readable description; long descriptions truncate with .
  • Wait for the fire window. When the time hits, the message lands in the WhatsApp group exactly once.
  • /activity → the run shows under Success. Default tab is Success (no All tab).
  • Swipe-left a row → Delete shelf appears. Swipe-right → Pause / Restart shelf. Tapping a row navigates to its detail; dragging does NOT navigate (6-px threshold).
  • Pause the reminder → status flips to Paused immediately and the next-fire-time disappears.
  • Restart → fires on the next scheduled occurrence.

Reminder regression checks

  • Triple-fire repro (only if you have a tame group): edit the reminder repeatedly within microseconds of each other (e.g. the wizard Save button hammered three times). The message must land exactly once. The bot logs should show "duplicate fire detected inside mutex" warnings on the second and third attempts.
  • Reschedule under existing job: edit a recurring reminder's schedule to a NEW time before its next-fire arrives. The new time must fire (the old created job is now cancelled in pgboss.job; verify with select state, count(*) from pgboss.job where name='reminder.fire' group by state).

5. Account lifecycle

  • Unpair the account from /accounts/<id>. Confirm dialog (Cancel + Yes, unpair). The account row stays in the list with "Unpaired" status; groups disappear from the picker (they're soft-archived, not deleted).
  • Re-pair the same account → groups come back via the on-conflict upsert flipping is_archived back to false.
  • Delete the account from /accounts/<id> → Confirm dialog → the account vanishes from /accounts. Check on the phone's WhatsApp Linked Devices list — the entry is gone (the logout-before-stop flow tells WhatsApp to drop it).

6. Sign-out + session lifetime

  • Sign out from the sidebar / drawer footer → land on /login.
  • Hit any protected URL → redirected to login.
  • Token-version kill switch: set OPERATOR_TOKEN_VERSION=2 in .env.development, restart the web container. Every previously-issued cookie is now invalid; every authenticated request bounces to /login. Reset to 1 after.

7. Cross-tenant isolation

  • Sign in as admin. Note dashboard counter values.
  • As admin, create a second user bob and give them a fresh account / reminder / fire it once.
  • Sign out, sign in as bob. Dashboard counters MUST show only bob's numbers (not admin's). /reminders lists only bob's reminders. /accounts only bob's accounts.

8. Sweep

  • docker logs cmbot-web --since 10m | grep -iE 'error|' — no output (or only Baileys "Stream Errored (restart required)" noise; that's upstream).
  • docker logs cmbot-bot --since 10m | grep -iE 'error|fatal' — no output beyond the same Baileys upstream noise.
  • git status clean (no leftover _check.ts or temp files).

When a step fails

  • Migration refused with "Refusing to run drizzle migrate": open packages/db/migrations/meta/_journal.json and bump the flagged entry's when to the suggested value. Re-run.
  • Pair shows immediate timeout: bot logs should mention "ignoring close from previous attempt while warming up" — that's the fix working, but check a stale Baileys session isn't gummed up. Last resort: rm -rf dev-data/sessions/<accountId> and re-pair.
  • Reminder fires twice: check pgboss.queue.policy for reminder.fire — must be standard, not stately (stately drops reschedules silently). The registerReminderJobs boot hook force-flips this on every bot start.
  • Delete didn't remove the linked-device entry on the phone: the bot's socket.logout() is best-effort — if the socket was already disconnected when delete fired, the operator removes the entry manually from WhatsApp's UI.

If any of the regression checks (Back→Re-pair, duplicate phone, triple-fire, reschedule) fail, that's a real bug — capture the bot log and file an issue before shipping.