docs: refresh README + add docs/runbook.md for v1 sign-off
- README rewritten to reflect v1 reality: auth bootstrap, AES-GCM cookies, three-layer rate limit, duplicate-pair detection, logout-before-delete, journal-monotonic guard, the new test counts (482 web + 88 bot), and the right scripts (set-password, create-user). Drops the telegram-era 'Status' paragraph and the earlier 'Auth deferred' bullet. - docs/runbook.md is a new manual end-to-end smoke checklist organised by section: pre-flight, auth bootstrap, user management, account pairing (incl. back→re-pair + duplicate-phone regression checks), reminder lifecycle (incl. triple-fire + reschedule regression checks), account lifecycle, sign-out + token-version kill, cross-tenant isolation, log sweep, plus a troubleshooting cheatsheet. Closes P3/T23 + P3/T24. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
47d7c53fda
commit
c906a9fa3a
120
README.md
120
README.md
@ -6,24 +6,36 @@ the run history all from a phone home-screen icon.
|
||||
|
||||
## Status
|
||||
|
||||
**Plans 1, 2, and 3 complete.** The web app at `wabot.04080616.xyz` is
|
||||
the primary control surface; the Telegram bot has been removed.
|
||||
**v1 production-ready.** The web app at `wabot.04080616.xyz` is the
|
||||
primary control surface; the Telegram bot has been removed.
|
||||
|
||||
What's working today:
|
||||
|
||||
- **Username + password auth** with role-based access (admin / user).
|
||||
HttpOnly + Secure session cookies, encrypted with AES-256-GCM (so a
|
||||
leaked cookie reveals nothing about userId / role) and bound to the
|
||||
`OPERATOR_TOKEN_VERSION` env so a single env bump kills every
|
||||
outstanding session.
|
||||
- **Three-layer login rate limit** — per-IP + per-username (lower-cased
|
||||
so case-rotation doesn't help) + a global backstop, so a residential-
|
||||
proxy attacker can't brute one account by hopping IPs.
|
||||
- **Self-hosted Next.js 16 PWA** — installable on a phone home screen.
|
||||
Mobile-first single-row header with a slide-out drawer; desktop
|
||||
sidebar.
|
||||
sidebar. Login lives outside the shell on a bare-header surface.
|
||||
- **Live QR pairing** — server-side Baileys session feeds the QR
|
||||
payload directly into the browser via Server-Sent Events. Scan,
|
||||
see "✅ Connected" within seconds, auto-redirect.
|
||||
- **Duplicate-pair detection** — scanning a QR with a phone already
|
||||
linked to another account row surfaces a clear "already paired as
|
||||
<label>" message instead of fighting Baileys for the device.
|
||||
- **Multi-account, multi-group reminders** — 5-step wizard
|
||||
(Account → Message → When → Groups → Review) plus per-section edit
|
||||
pages so you don't have to walk the wizard end-to-end to fix one
|
||||
field. Active recurrence picker covers Daily / Weekly / Monthly /
|
||||
Yearly with multi-rule support and per-rule fire-time pickers; the
|
||||
rendered description reads as plain English ("Every week on Mon,
|
||||
Wed, Fri at 09:00") not raw cron.
|
||||
field. Recurrence picker covers Daily / Weekly / Monthly / Yearly
|
||||
with multi-rule support and per-rule fire-time pickers; the rendered
|
||||
description reads as plain English ("Every week on Mon, Wed, Fri at
|
||||
09:00") not raw cron. Optional "Pause sending by" deadline that
|
||||
defaults OFF — operators have to opt in explicitly.
|
||||
- **Multi-message stacks** — a reminder can carry multiple ordered
|
||||
parts (text + media), fired in sequence with a 1.5 s gap. Media
|
||||
files swap at any time from the Edit Message page.
|
||||
@ -33,19 +45,29 @@ What's working today:
|
||||
as a downloadable file instead of failing silently.
|
||||
- **Swipe-to-act rows** — on mobile, swipe a reminder or activity
|
||||
row left for Delete or right for Pause/Restart/Archive. iOS-Mail
|
||||
style.
|
||||
style. Click vs drag is disambiguated by a 6-px tap threshold so a
|
||||
swipe doesn't accidentally trigger the row's link.
|
||||
- **Activity tab** — last 200 runs with status filters (Success /
|
||||
Partial / Failed / Skipped) plus an Archived tab. Archive a noisy
|
||||
run to keep the main list readable; restore later. Hard-delete
|
||||
always available. Run history survives a reminder deletion.
|
||||
Paused / Failed / Archived). Partial runs surface under both Paused
|
||||
and Failed; Skipped runs collapse into Archived. Hard-delete and
|
||||
archive both available; run history survives a reminder deletion.
|
||||
- **Auto-reconnect on transient drops; restart-survival via Baileys
|
||||
session persistence.** Pair once, the device stays linked across
|
||||
container restarts.
|
||||
- **All actions audited.** Reminder run history queryable from the
|
||||
UI; per-run target results (sent / failed / skipped) preserved
|
||||
even when the underlying group is removed.
|
||||
container restarts. Logout-on-delete cleans the operator's
|
||||
linked-devices list on the WhatsApp side too.
|
||||
- **Hardened pg-boss scheduling** — three-tier dedupe so a triple-
|
||||
click Save or microsecond-spaced enqueue doesn't fire a reminder
|
||||
multiple times. Reschedule cancels stale jobs by singletonKey first
|
||||
so a recurring next-fire never gets silently dropped.
|
||||
- **Drizzle journal monotonicity guard** — `pnpm migrate` refuses to
|
||||
run if the `_journal.json` `when` timestamps aren't strictly
|
||||
increasing (a recurring foot-gun where drizzle would silently skip
|
||||
a freshly-generated migration). CI tests + the migrate runner both
|
||||
enforce.
|
||||
- **All actions audited.** Per-run target results (sent / failed /
|
||||
skipped) preserved even when the underlying group is removed.
|
||||
|
||||
Test count: **249 web + 31 shared + 26 bot = 306** passing.
|
||||
Test count: **482 web + 88 bot = 570** passing.
|
||||
|
||||
## Host requirements
|
||||
|
||||
@ -79,24 +101,28 @@ Prerequisites: Docker, the `wabot` database + `waBot` role on
|
||||
# 1. Configure env
|
||||
cp envs/.env.example .env.development
|
||||
# edit .env.development: real DATABASE_URL, plus the LAN host to expose
|
||||
scripts/gen_auth_secret.sh --write
|
||||
scripts/gen_auth_secret.sh --write # writes AUTH_SECRET to .env.development
|
||||
|
||||
# 2. Bring up the stack, install deps
|
||||
NO_SUDO=1 scripts/dev.sh up
|
||||
NO_SUDO=1 scripts/dev.sh pnpm install
|
||||
|
||||
# 3. Apply migrations and seed your operator row
|
||||
# 3. Apply migrations and seed the bootstrap operator row
|
||||
NO_SUDO=1 scripts/db.sh migrate
|
||||
NO_SUDO=1 scripts/db.sh seed
|
||||
|
||||
# 4. Open the web app
|
||||
# 4. Set the bootstrap admin password (NO password is set by seed)
|
||||
echo 'change-me-now' | scripts/set-password.sh admin
|
||||
|
||||
# 5. Open the web app and sign in as `admin` with the password above
|
||||
# Local: http://localhost:9000
|
||||
# LAN: http://<host-ip>:9000 (e.g. http://192.168.0.253:9000)
|
||||
# Public: https://wabot.04080616.xyz (whatever your reverse proxy serves)
|
||||
# LAN: http://<host-ip>:9000
|
||||
# Public: https://wabot.04080616.xyz
|
||||
```
|
||||
|
||||
Pair an account: `/accounts` → "New Account" → enter a label →
|
||||
"Pair WhatsApp" → scan the QR with WhatsApp's "Linked Devices".
|
||||
Inside the app: `/settings/users` → Add user → invite teammates with
|
||||
`user` role; promote / demote / reset password / delete from the same
|
||||
page. The "Admin" nav entry is admin-only.
|
||||
|
||||
PWA install: phone Chrome → menu → "Install App" / "Add to Home
|
||||
Screen". Launches fullscreen.
|
||||
@ -108,6 +134,9 @@ group (the default for this repo). Drop it if you need `sudo docker`.
|
||||
|
||||
End-to-end checks that unit tests can't cover (live Baileys,
|
||||
WhatsApp delivery, swipe gestures):
|
||||
[`docs/runbook.md`](docs/runbook.md).
|
||||
|
||||
The earlier wizard-only checklist still lives at
|
||||
[`docs/superpowers/specs/manual-test-web.md`](docs/superpowers/specs/manual-test-web.md).
|
||||
|
||||
## Layout
|
||||
@ -118,11 +147,14 @@ WhatsApp delivery, swipe gestures):
|
||||
- `packages/db/` — Drizzle schema and migrations
|
||||
- `packages/shared/` — cross-app helpers (rrule, media paths,
|
||||
timezones, WhatsApp media classifier)
|
||||
- `docs/superpowers/specs/` — design specs and manual test runbooks
|
||||
- `docs/runbook.md` — manual end-to-end smoke checklist
|
||||
- `docs/superpowers/specs/` — design specs and earlier manual test
|
||||
runbooks
|
||||
- `docs/superpowers/plans/` — implementation plans
|
||||
- `docker/` — Dockerfiles (`tools.Dockerfile`, `bot.Dockerfile`,
|
||||
`web.Dockerfile`)
|
||||
- `scripts/` — `dev.sh`, `db.sh`, `gen_auth_secret.sh`
|
||||
- `scripts/` — `dev.sh`, `db.sh`, `gen_auth_secret.sh`,
|
||||
`set-password.sh`, `create-user.sh`
|
||||
|
||||
## Scripts
|
||||
|
||||
@ -134,17 +166,43 @@ container, so no host Node is needed.
|
||||
| `scripts/dev.sh up\|down\|logs\|status\|build\|exec\|pnpm\|shell\|restart-bot` | Stack lifecycle and tools-container shell |
|
||||
| `scripts/db.sh migrate\|generate\|studio\|seed\|reset` | Drizzle migration helper |
|
||||
| `scripts/gen_auth_secret.sh [--write]` | Generate `AUTH_SECRET` (host-only, no Node needed) |
|
||||
| `scripts/set-password.sh <username>` | Set / reset a user's password (reads stdin) |
|
||||
| `scripts/create-user.sh <username> <role>` | Create a user from CLI (admin / user) |
|
||||
|
||||
Set `NO_SUDO=1` if your user is in the docker group (recommended).
|
||||
|
||||
## Auth + admin model
|
||||
|
||||
- One bootstrap operator (`admin`) is created by the seed; its
|
||||
password is set via `scripts/set-password.sh admin` on first launch.
|
||||
- Two roles: `admin` (full access including user management) and
|
||||
`user` (everything except `/settings/users`). Role-based nav
|
||||
filtering is enforced in middleware + the AppShell + every server
|
||||
action that mutates user state.
|
||||
- Every user gets an isolated workspace — accounts, reminders,
|
||||
groups, and run history all scope by `operator_id`. The admin
|
||||
panel is the only cross-tenant surface.
|
||||
- Sessions: AES-256-GCM-encrypted cookie keyed off `AUTH_SECRET`,
|
||||
HttpOnly + Secure-in-prod + SameSite=Lax, 30-day TTL. The
|
||||
`OPERATOR_TOKEN_VERSION` env (defaults to `"1"`) is the kill switch
|
||||
— bumping it invalidates every outstanding cookie globally on the
|
||||
next request.
|
||||
- Login rate limits: 10 / 5 min per-IP + 5 / 15 min per-username + a
|
||||
100 / min global backstop. The error message is identical for all
|
||||
three so the limit-which-tripped isn't leaked.
|
||||
|
||||
## Deferred
|
||||
|
||||
- **Standalone media library** browser (currently media is uploaded
|
||||
per-reminder).
|
||||
- **E2E browser tests** (Playwright) on the swipe and pairing flows.
|
||||
- **Auth** (passkeys / email-password) — bring back if URL exposure
|
||||
becomes a concern. Today the app trusts whatever's in front of the
|
||||
reverse proxy.
|
||||
- **Multi-operator** — schema supports `operator_id` on every row,
|
||||
but the seed runs as a single operator and there's no /signup or
|
||||
invite flow yet.
|
||||
- **Search-as-you-type in the wizard's groups picker** — at 3 000+
|
||||
groups per account the picker still loads the alphabetical
|
||||
top-200; operators with >200 groups need to use the list page's
|
||||
search to find anything past 'L'.
|
||||
- **Composite index on `(account_id, name)`** for the groups list
|
||||
page's `ORDER BY name LIMIT 200` query — currently a sort + limit;
|
||||
the GIN trigram on `name` plus the unique on `(account_id,
|
||||
wa_group_jid)` already cover most cases.
|
||||
- **Self-service password reset** (email link, etc.) — out of scope
|
||||
for v1; admins use the Users page.
|
||||
|
||||
200
docs/runbook.md
Normal file
200
docs/runbook.md
Normal file
@ -0,0 +1,200 @@
|
||||
# Manual end-to-end runbook (v1)
|
||||
|
||||
Smoke checklist for verifying a fresh deploy. Unit tests don't catch
|
||||
the live-Baileys / live-Postgres / browser-gesture path; this is what
|
||||
you run before declaring a release good.
|
||||
|
||||
Time budget: ~10 minutes if everything works, ~30 if a step fails.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
- [ ] **Stack up.**
|
||||
`docker ps | grep cmbot` → expect `cmbot-tools`, `cmbot-bot`,
|
||||
`cmbot-web` all `Up`.
|
||||
- [ ] **Migrations clean.**
|
||||
`NO_SUDO=1 scripts/db.sh migrate` → "Migrations applied." (and
|
||||
*not* "Refusing to run drizzle migrate" — that's the journal
|
||||
monotonicity guard tripping).
|
||||
- [ ] **Web reachable.**
|
||||
`curl -sf http://localhost:9000/api/health` → 200.
|
||||
- [ ] **Bot reachable.**
|
||||
`curl -sf http://localhost:8081/health` → 200.
|
||||
|
||||
If any pre-flight fails, fix before continuing.
|
||||
|
||||
---
|
||||
|
||||
## 1. Auth bootstrap
|
||||
|
||||
- [ ] `scripts/db.sh seed` (idempotent — only inserts the `admin`
|
||||
operator if missing).
|
||||
- [ ] `echo 'change-me-now' | scripts/set-password.sh admin` → "Password
|
||||
updated."
|
||||
- [ ] Open `http://localhost:9000/login` → enter `admin` / the password
|
||||
→ redirected to `/`.
|
||||
- [ ] **Wrong password three times in a row** still rate-limits but
|
||||
with the generic "Too many attempts" message — no leak about
|
||||
which limit (IP / username / global) tripped.
|
||||
- [ ] Hit `/admin` URL while signed out → redirected to `/login` with
|
||||
`?next=/admin`. After a successful login, lands back on `/admin`.
|
||||
|
||||
---
|
||||
|
||||
## 2. User management (admin-only)
|
||||
|
||||
- [ ] **Sidebar / drawer**: only one nav entry highlights at a time.
|
||||
On `/settings/users`, only `Admin` lights up; `Settings` does
|
||||
not.
|
||||
- [ ] `/settings/users` → Add user → username `alice`, password
|
||||
`alpha7!`, role `user` → "User created."
|
||||
- [ ] `alice` row shows: username + `you` chip if applicable, role
|
||||
pill, Promote / Reset / Delete buttons on row 2.
|
||||
- [ ] Promote `alice` to admin → page revalidates, badge flips to
|
||||
`admin`.
|
||||
- [ ] Demote back to `user`.
|
||||
- [ ] **Last-admin guard**: Demote / Delete on the only remaining
|
||||
admin row are both disabled.
|
||||
- [ ] Delete `alice` via the confirm dialog (Cancel + Delete user
|
||||
buttons; **no third "Close" button** — the static guard test
|
||||
catches that regression but eyeball it anyway).
|
||||
|
||||
---
|
||||
|
||||
## 3. Account pairing
|
||||
|
||||
- [ ] `/accounts` → New Account → label `WaBot Test` → Pair WhatsApp.
|
||||
Land on the live QR page within ~2 s.
|
||||
- [ ] Login screen header is JUST the centered brand mark — no nav,
|
||||
no menu drawer.
|
||||
- [ ] Scan with WhatsApp → "Linked Devices" → "Link a device".
|
||||
- [ ] **Connection success.** Page transitions through `qr` → (brief
|
||||
`restart-required` close handled silently) → `connected` with
|
||||
a green check and `+60xxx` phone number → auto-redirect to
|
||||
`/accounts/<id>` after 3 s.
|
||||
- [ ] **Refresh Groups** button on `/accounts/<id>/groups` → spinner
|
||||
during the sync, page auto-refreshes when the bot pushes
|
||||
`groups.synced` over SSE. No manual reload needed.
|
||||
|
||||
### Pair regression checks (these caught real bugs)
|
||||
|
||||
- [ ] **Back → Re-pair**: from a live QR, click ← Back → Pair again
|
||||
from the account detail page. Should NOT instantly flash
|
||||
"Pairing timed out". A new QR appears and the countdown
|
||||
restarts at 5:00.
|
||||
- [ ] **Duplicate phone**: with one phone already paired, scan its QR
|
||||
from a *second* account row → see the amber "Phone already
|
||||
linked" panel naming the existing account. The original
|
||||
account's session stays intact.
|
||||
|
||||
---
|
||||
|
||||
## 4. Reminder lifecycle
|
||||
|
||||
- [ ] `/reminders` → New Reminder → walk the wizard:
|
||||
- Step 1: pick `WaBot Test`.
|
||||
- Step 2: enter a short text message ("smoke test <timestamp>").
|
||||
- Step 3: pick `Daily` recurrence, fire ~2 minutes from now.
|
||||
Confirm "Pause sending by" checkbox is **unchecked by default**.
|
||||
- Step 4: select 1 group.
|
||||
- Step 5: review → Save.
|
||||
- [ ] Reminder appears on `/reminders` with status `Active`.
|
||||
Recurrence column shows the human-readable description; long
|
||||
descriptions truncate with `…`.
|
||||
- [ ] **Wait for the fire window.** When the time hits, the message
|
||||
lands in the WhatsApp group **exactly once**.
|
||||
- [ ] `/activity` → the run shows under `Success`. Default tab is
|
||||
Success (no `All` tab).
|
||||
- [ ] Swipe-left a row → Delete shelf appears. Swipe-right → Pause /
|
||||
Restart shelf. Tapping a row navigates to its detail; dragging
|
||||
does NOT navigate (6-px threshold).
|
||||
- [ ] Pause the reminder → status flips to `Paused` immediately and
|
||||
the next-fire-time disappears.
|
||||
- [ ] Restart → fires on the next scheduled occurrence.
|
||||
|
||||
### Reminder regression checks
|
||||
|
||||
- [ ] **Triple-fire repro** (only if you have a tame group): edit
|
||||
the reminder repeatedly within microseconds of each other (e.g.
|
||||
the wizard Save button hammered three times). The message must
|
||||
land **exactly once**. The bot logs should show
|
||||
"duplicate fire detected inside mutex" warnings on the second
|
||||
and third attempts.
|
||||
- [ ] **Reschedule under existing job**: edit a recurring reminder's
|
||||
schedule to a NEW time before its next-fire arrives. The new
|
||||
time must fire (the old `created` job is now `cancelled` in
|
||||
`pgboss.job`; verify with `select state, count(*) from
|
||||
pgboss.job where name='reminder.fire' group by state`).
|
||||
|
||||
---
|
||||
|
||||
## 5. Account lifecycle
|
||||
|
||||
- [ ] **Unpair** the account from `/accounts/<id>`. Confirm dialog
|
||||
(Cancel + Yes, unpair). The account row stays in the list with
|
||||
"Unpaired" status; groups disappear from the picker (they're
|
||||
soft-archived, not deleted).
|
||||
- [ ] **Re-pair** the same account → groups come back via the
|
||||
on-conflict upsert flipping `is_archived` back to false.
|
||||
- [ ] **Delete** the account from `/accounts/<id>` → Confirm dialog →
|
||||
the account vanishes from `/accounts`. Check on the *phone*'s
|
||||
WhatsApp Linked Devices list — the entry is gone (the
|
||||
logout-before-stop flow tells WhatsApp to drop it).
|
||||
|
||||
---
|
||||
|
||||
## 6. Sign-out + session lifetime
|
||||
|
||||
- [ ] **Sign out** from the sidebar / drawer footer → land on `/login`.
|
||||
- [ ] Hit any protected URL → redirected to login.
|
||||
- [ ] **Token-version kill switch**: set `OPERATOR_TOKEN_VERSION=2`
|
||||
in `.env.development`, restart the web container. Every
|
||||
previously-issued cookie is now invalid; every authenticated
|
||||
request bounces to `/login`. Reset to `1` after.
|
||||
|
||||
---
|
||||
|
||||
## 7. Cross-tenant isolation
|
||||
|
||||
- [ ] Sign in as `admin`. Note dashboard counter values.
|
||||
- [ ] As admin, create a second user `bob` and give them a fresh
|
||||
account / reminder / fire it once.
|
||||
- [ ] Sign out, sign in as `bob`. Dashboard counters MUST show only
|
||||
bob's numbers (not admin's). `/reminders` lists only bob's
|
||||
reminders. `/accounts` only bob's accounts.
|
||||
|
||||
---
|
||||
|
||||
## 8. Sweep
|
||||
|
||||
- [ ] `docker logs cmbot-web --since 10m | grep -iE 'error|⨯'` — no
|
||||
output (or only Baileys "Stream Errored (restart required)"
|
||||
noise; that's upstream).
|
||||
- [ ] `docker logs cmbot-bot --since 10m | grep -iE 'error|fatal'` —
|
||||
no output beyond the same Baileys upstream noise.
|
||||
- [ ] `git status` clean (no leftover `_check.ts` or temp files).
|
||||
|
||||
---
|
||||
|
||||
## When a step fails
|
||||
|
||||
- **Migration refused** with "Refusing to run drizzle migrate":
|
||||
open `packages/db/migrations/meta/_journal.json` and bump the
|
||||
flagged entry's `when` to the suggested value. Re-run.
|
||||
- **Pair shows immediate timeout**: bot logs should mention "ignoring
|
||||
close from previous attempt while warming up" — that's the fix
|
||||
working, but check a stale Baileys session isn't gummed up. Last
|
||||
resort: `rm -rf dev-data/sessions/<accountId>` and re-pair.
|
||||
- **Reminder fires twice**: check `pgboss.queue.policy` for
|
||||
`reminder.fire` — must be `standard`, not `stately` (stately drops
|
||||
reschedules silently). The `registerReminderJobs` boot hook
|
||||
force-flips this on every bot start.
|
||||
- **Delete didn't remove the linked-device entry on the phone**:
|
||||
the bot's `socket.logout()` is best-effort — if the socket was
|
||||
already disconnected when delete fired, the operator removes the
|
||||
entry manually from WhatsApp's UI.
|
||||
|
||||
If any of the regression checks (Back→Re-pair, duplicate phone,
|
||||
triple-fire, reschedule) fail, that's a real bug — capture the bot
|
||||
log and file an issue before shipping.
|
||||
Loading…
x
Reference in New Issue
Block a user