- README rewritten to reflect v1 reality: auth bootstrap, AES-GCM cookies, three-layer rate limit, duplicate-pair detection, logout-before-delete, journal-monotonic guard, the new test counts (482 web + 88 bot), and the right scripts (set-password, create-user). Drops the telegram-era 'Status' paragraph and the earlier 'Auth deferred' bullet. - docs/runbook.md is a new manual end-to-end smoke checklist organised by section: pre-flight, auth bootstrap, user management, account pairing (incl. back→re-pair + duplicate-phone regression checks), reminder lifecycle (incl. triple-fire + reschedule regression checks), account lifecycle, sign-out + token-version kill, cross-tenant isolation, log sweep, plus a troubleshooting cheatsheet. Closes P3/T23 + P3/T24. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
201 lines
8.2 KiB
Markdown
201 lines
8.2 KiB
Markdown
# Manual end-to-end runbook (v1)
|
||
|
||
Smoke checklist for verifying a fresh deploy. Unit tests don't catch
|
||
the live-Baileys / live-Postgres / browser-gesture path; this is what
|
||
you run before declaring a release good.
|
||
|
||
Time budget: ~10 minutes if everything works, ~30 if a step fails.
|
||
|
||
---
|
||
|
||
## Pre-flight
|
||
|
||
- [ ] **Stack up.**
|
||
`docker ps | grep cmbot` → expect `cmbot-tools`, `cmbot-bot`,
|
||
`cmbot-web` all `Up`.
|
||
- [ ] **Migrations clean.**
|
||
`NO_SUDO=1 scripts/db.sh migrate` → "Migrations applied." (and
|
||
*not* "Refusing to run drizzle migrate" — that's the journal
|
||
monotonicity guard tripping).
|
||
- [ ] **Web reachable.**
|
||
`curl -sf http://localhost:9000/api/health` → 200.
|
||
- [ ] **Bot reachable.**
|
||
`curl -sf http://localhost:8081/health` → 200.
|
||
|
||
If any pre-flight fails, fix before continuing.
|
||
|
||
---
|
||
|
||
## 1. Auth bootstrap
|
||
|
||
- [ ] `scripts/db.sh seed` (idempotent — only inserts the `admin`
|
||
operator if missing).
|
||
- [ ] `echo 'change-me-now' | scripts/set-password.sh admin` → "Password
|
||
updated."
|
||
- [ ] Open `http://localhost:9000/login` → enter `admin` / the password
|
||
→ redirected to `/`.
|
||
- [ ] **Wrong password three times in a row** still rate-limits but
|
||
with the generic "Too many attempts" message — no leak about
|
||
which limit (IP / username / global) tripped.
|
||
- [ ] Hit `/admin` URL while signed out → redirected to `/login` with
|
||
`?next=/admin`. After a successful login, lands back on `/admin`.
|
||
|
||
---
|
||
|
||
## 2. User management (admin-only)
|
||
|
||
- [ ] **Sidebar / drawer**: only one nav entry highlights at a time.
|
||
On `/settings/users`, only `Admin` lights up; `Settings` does
|
||
not.
|
||
- [ ] `/settings/users` → Add user → username `alice`, password
|
||
`alpha7!`, role `user` → "User created."
|
||
- [ ] `alice` row shows: username + `you` chip if applicable, role
|
||
pill, Promote / Reset / Delete buttons on row 2.
|
||
- [ ] Promote `alice` to admin → page revalidates, badge flips to
|
||
`admin`.
|
||
- [ ] Demote back to `user`.
|
||
- [ ] **Last-admin guard**: Demote / Delete on the only remaining
|
||
admin row are both disabled.
|
||
- [ ] Delete `alice` via the confirm dialog (Cancel + Delete user
|
||
buttons; **no third "Close" button** — the static guard test
|
||
catches that regression but eyeball it anyway).
|
||
|
||
---
|
||
|
||
## 3. Account pairing
|
||
|
||
- [ ] `/accounts` → New Account → label `WaBot Test` → Pair WhatsApp.
|
||
Land on the live QR page within ~2 s.
|
||
- [ ] Login screen header is JUST the centered brand mark — no nav,
|
||
no menu drawer.
|
||
- [ ] Scan with WhatsApp → "Linked Devices" → "Link a device".
|
||
- [ ] **Connection success.** Page transitions through `qr` → (brief
|
||
`restart-required` close handled silently) → `connected` with
|
||
a green check and `+60xxx` phone number → auto-redirect to
|
||
`/accounts/<id>` after 3 s.
|
||
- [ ] **Refresh Groups** button on `/accounts/<id>/groups` → spinner
|
||
during the sync, page auto-refreshes when the bot pushes
|
||
`groups.synced` over SSE. No manual reload needed.
|
||
|
||
### Pair regression checks (these caught real bugs)
|
||
|
||
- [ ] **Back → Re-pair**: from a live QR, click ← Back → Pair again
|
||
from the account detail page. Should NOT instantly flash
|
||
"Pairing timed out". A new QR appears and the countdown
|
||
restarts at 5:00.
|
||
- [ ] **Duplicate phone**: with one phone already paired, scan its QR
|
||
from a *second* account row → see the amber "Phone already
|
||
linked" panel naming the existing account. The original
|
||
account's session stays intact.
|
||
|
||
---
|
||
|
||
## 4. Reminder lifecycle
|
||
|
||
- [ ] `/reminders` → New Reminder → walk the wizard:
|
||
- Step 1: pick `WaBot Test`.
|
||
- Step 2: enter a short text message ("smoke test <timestamp>").
|
||
- Step 3: pick `Daily` recurrence, fire ~2 minutes from now.
|
||
Confirm "Pause sending by" checkbox is **unchecked by default**.
|
||
- Step 4: select 1 group.
|
||
- Step 5: review → Save.
|
||
- [ ] Reminder appears on `/reminders` with status `Active`.
|
||
Recurrence column shows the human-readable description; long
|
||
descriptions truncate with `…`.
|
||
- [ ] **Wait for the fire window.** When the time hits, the message
|
||
lands in the WhatsApp group **exactly once**.
|
||
- [ ] `/activity` → the run shows under `Success`. Default tab is
|
||
Success (no `All` tab).
|
||
- [ ] Swipe-left a row → Delete shelf appears. Swipe-right → Pause /
|
||
Restart shelf. Tapping a row navigates to its detail; dragging
|
||
does NOT navigate (6-px threshold).
|
||
- [ ] Pause the reminder → status flips to `Paused` immediately and
|
||
the next-fire-time disappears.
|
||
- [ ] Restart → fires on the next scheduled occurrence.
|
||
|
||
### Reminder regression checks
|
||
|
||
- [ ] **Triple-fire repro** (only if you have a tame group): edit
|
||
the reminder repeatedly within microseconds of each other (e.g.
|
||
the wizard Save button hammered three times). The message must
|
||
land **exactly once**. The bot logs should show
|
||
"duplicate fire detected inside mutex" warnings on the second
|
||
and third attempts.
|
||
- [ ] **Reschedule under existing job**: edit a recurring reminder's
|
||
schedule to a NEW time before its next-fire arrives. The new
|
||
time must fire (the old `created` job is now `cancelled` in
|
||
`pgboss.job`; verify with `select state, count(*) from
|
||
pgboss.job where name='reminder.fire' group by state`).
|
||
|
||
---
|
||
|
||
## 5. Account lifecycle
|
||
|
||
- [ ] **Unpair** the account from `/accounts/<id>`. Confirm dialog
|
||
(Cancel + Yes, unpair). The account row stays in the list with
|
||
"Unpaired" status; groups disappear from the picker (they're
|
||
soft-archived, not deleted).
|
||
- [ ] **Re-pair** the same account → groups come back via the
|
||
on-conflict upsert flipping `is_archived` back to false.
|
||
- [ ] **Delete** the account from `/accounts/<id>` → Confirm dialog →
|
||
the account vanishes from `/accounts`. Check on the *phone*'s
|
||
WhatsApp Linked Devices list — the entry is gone (the
|
||
logout-before-stop flow tells WhatsApp to drop it).
|
||
|
||
---
|
||
|
||
## 6. Sign-out + session lifetime
|
||
|
||
- [ ] **Sign out** from the sidebar / drawer footer → land on `/login`.
|
||
- [ ] Hit any protected URL → redirected to login.
|
||
- [ ] **Token-version kill switch**: set `OPERATOR_TOKEN_VERSION=2`
|
||
in `.env.development`, restart the web container. Every
|
||
previously-issued cookie is now invalid; every authenticated
|
||
request bounces to `/login`. Reset to `1` after.
|
||
|
||
---
|
||
|
||
## 7. Cross-tenant isolation
|
||
|
||
- [ ] Sign in as `admin`. Note dashboard counter values.
|
||
- [ ] As admin, create a second user `bob` and give them a fresh
|
||
account / reminder / fire it once.
|
||
- [ ] Sign out, sign in as `bob`. Dashboard counters MUST show only
|
||
bob's numbers (not admin's). `/reminders` lists only bob's
|
||
reminders. `/accounts` only bob's accounts.
|
||
|
||
---
|
||
|
||
## 8. Sweep
|
||
|
||
- [ ] `docker logs cmbot-web --since 10m | grep -iE 'error|⨯'` — no
|
||
output (or only Baileys "Stream Errored (restart required)"
|
||
noise; that's upstream).
|
||
- [ ] `docker logs cmbot-bot --since 10m | grep -iE 'error|fatal'` —
|
||
no output beyond the same Baileys upstream noise.
|
||
- [ ] `git status` clean (no leftover `_check.ts` or temp files).
|
||
|
||
---
|
||
|
||
## When a step fails
|
||
|
||
- **Migration refused** with "Refusing to run drizzle migrate":
|
||
open `packages/db/migrations/meta/_journal.json` and bump the
|
||
flagged entry's `when` to the suggested value. Re-run.
|
||
- **Pair shows immediate timeout**: bot logs should mention "ignoring
|
||
close from previous attempt while warming up" — that's the fix
|
||
working, but check a stale Baileys session isn't gummed up. Last
|
||
resort: `rm -rf dev-data/sessions/<accountId>` and re-pair.
|
||
- **Reminder fires twice**: check `pgboss.queue.policy` for
|
||
`reminder.fire` — must be `standard`, not `stately` (stately drops
|
||
reschedules silently). The `registerReminderJobs` boot hook
|
||
force-flips this on every bot start.
|
||
- **Delete didn't remove the linked-device entry on the phone**:
|
||
the bot's `socket.logout()` is best-effort — if the socket was
|
||
already disconnected when delete fired, the operator removes the
|
||
entry manually from WhatsApp's UI.
|
||
|
||
If any of the regression checks (Back→Re-pair, duplicate phone,
|
||
triple-fire, reschedule) fail, that's a real bug — capture the bot
|
||
log and file an issue before shipping.
|