diff --git a/docs/superpowers/specs/2026-05-10-auth-and-prod-hardening-design.md b/docs/superpowers/specs/2026-05-10-auth-and-prod-hardening-design.md new file mode 100644 index 0000000..4db8769 --- /dev/null +++ b/docs/superpowers/specs/2026-05-10-auth-and-prod-hardening-design.md @@ -0,0 +1,437 @@ +# Auth + Production Hardening Design + +> Spec for closing the production-readiness gap before promoting the +> bot to public-internet exposure at `wabot.04080616.xyz`. Covers the +> session-cookie auth model with username + password + role, plus the +> hygiene work that has to land alongside it (robots, env, container +> non-root) so the public surface is safe in one change. + +## Goal + +Add operator authentication to the web app so the public URL stops +being a foothold for anyone who finds it, and at the same time close +the highest-risk production gaps surfaced in the v1.1.0 audit: +indexable content, committed credentials, root-running containers, +and four un-rate-limited Server Actions. + +## Constraints + +- Single-host self-hosted deployment, public-internet via reverse + proxy + TLS at `wabot.04080616.xyz`. +- Up to a handful of users today, with room to grow. One must be + `admin`; the rest are `user`. +- Mobile PWA homescreen workflow: 30-day cookie, no friction at + re-open, no third-party identity provider. +- No new infra dependencies. Postgres + Docker compose stay the + whole platform. No NextAuth / Auth.js, no external KV, no SMS. +- Existing call sites must be cleanly retrofitted without breaking + the 66 call sites that currently use `getSeededOperator()`. +- All code changes covered by unit tests; no test relies on a live + Postgres or browser. + +## Approach: roll-our-own session cookie + +A library would be heavy for one role gate and one cookie. We pick +up `bcrypt` for password hashing (battle-tested) and Web Crypto's +HMAC for cookie signing (stdlib, edge-runtime compatible). All other +code is domain-owned and exhaustively tested. + +The model: the user posts username + password to a Server Action, +the action verifies against a per-user `password_hash` row, and the +response sets a signed cookie carrying `{ userId, role, iat, exp, v }`. +Middleware verifies the cookie on every request; Server Actions +double-check via `requireUser()` / `requireAdmin()` so a forgotten +middleware path can't bypass the gate. + +## Schema migration (`0010_add_user_auth.sql`) + +```sql +ALTER TABLE operators + ADD COLUMN username text, + ADD COLUMN password_hash text; +CREATE UNIQUE INDEX operators_username_uq + ON operators (lower(username)); +-- Backfill the seed row so it has a username; password_hash stays NULL +-- so the operator is forced to set one via the CLI before they can sign +-- in. Sets a clear "you have to do this before going live" gate. +UPDATE operators + SET username = 'admin' + WHERE username IS NULL; +ALTER TABLE operators + ALTER COLUMN username SET NOT NULL; +``` + +`telegramUserId` stays for now (it's referenced from existing migrations +and seed flow) but no longer drives auth. `defaultTimezone` and `role` +are unchanged. `operators.role` already defaults to `"admin"`. + +## Roles + +Two values, no enum constraint at the DB layer (text — same as +existing). + +| role | can do | +| ----- | ------------------------------------------------------------- | +| admin | everything in the app + user management (CRUD other users) | +| user | everything except `/settings/users` and the user-mgmt actions | + +A third "viewer" role isn't worth it today; can be added later by +extending the role check. + +## Cookie format + +Header value: `session=.` + +```ts +type SessionPayload = { + userId: string; // operators.id (uuid) + role: "admin" | "user"; + iat: number; // issued-at, unix seconds + exp: number; // expires-at, unix seconds (iat + 30 days) + v: number; // OPERATOR_TOKEN_VERSION at issue time +}; +``` + +HMAC is HMAC-SHA256 over the base64url-encoded payload string with +`AUTH_SECRET` as the key. Verification rejects on: + +- Bad shape (no `.`, base64 decode fails, JSON parse fails). +- HMAC mismatch (uses constant-time compare). +- `exp <= now`. +- `iat > now + 60` (clock-skew guard, 60s tolerance). +- `v !== process.env.OPERATOR_TOKEN_VERSION` (defaults to `"1"`). +- `role` not one of `"admin"` / `"user"`. + +Cookie attributes: `HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=2592000`. +`Max-Age=0` on logout to clear. + +`OPERATOR_TOKEN_VERSION` env var (default `"1"`) is the global +session-invalidation lever. Bumping it on the host instantly logs out +every user — no DB writes — useful after a host compromise or a +known-shared password. + +## Login flow + +Page: `apps/web/src/app/login/page.tsx`. Single form with: + +- Username input (`type=text`, autocomplete `username`) +- Password input (`type=password`, autocomplete `current-password`) +- Submit button "Sign in" +- Error slot for the generic message +- A small note: "First time? Run `./scripts/set-password.sh ` + in your tools container to set a password." + +Server action `loginAction(formData: FormData)`: + +```text +1. Read username, password from FormData. +2. Reject if either >256 chars (DoS guard, no bcrypt). +3. Reject if either empty. +4. Apply rate limit: checkRateLimit("login:" + ip, { max: 10, windowSec: 300 }). + On exhaustion → return { ok: false, error: "Too many attempts, try later." } +5. Look up user: select * from operators where lower(username)=lower($1) +6. If user not found OR user.password_hash IS NULL: + await bcrypt.compare(password, DUMMY_HASH); // timing equivalence + return { ok: false, error: "Invalid username or password." } +7. await bcrypt.compare(password, user.password_hash) + if false: return { ok: false, error: "Invalid username or password." } +8. Issue cookie: signSession({ userId, role, iat: now, exp: now + 30d, v: TOKEN_VERSION }) +9. Redirect to safe(next) ?? "/" +``` + +`safe(next)`: must be a string starting with `/` AND not starting +with `//`. Otherwise return `null`. + +Logout action `logoutAction()`: clear the cookie via +`cookies().set("session", "", { maxAge: 0, ... })` and redirect to +`/login`. + +## Middleware gate + +`apps/web/src/middleware.ts` extends the existing API allowlist with +the auth check. + +```text +For every request: + - If path is in allowlist (auth-free): + /login, /logout, /api/health, /manifest.webmanifest, + /icon-*, /favicon.ico, /_next/static/*, /_next/image + → NextResponse.next() + - Read session cookie. Verify (HMAC, exp, iat-skew, version, role shape). + - On valid: NextResponse.next() + - On invalid + path starts with /api/: 401, no body + - On invalid + page request: 302 to /login?next= +``` + +`/api/events` and `/api/qr/[accountId]` are explicitly removed from +the unauth allowlist — middleware now requires a session for them. + +The middleware imports the verifier from `@/lib/auth-cookie` (a +dependency-free module that runs on the edge runtime — no bcrypt, +no DB). + +## Server-action defense-in-depth + +`apps/web/src/lib/auth.ts` (Node runtime — DB access OK): + +```ts +export async function getCurrentUser(): Promise +export async function requireUser(): Promise // throws Response 401 / redirects +export async function requireAdmin(): Promise // requireUser + role === "admin" +``` + +`getSeededOperator()` is renamed to `getCurrentUser()` (and rewired +to read the verified cookie + look up the user). All 66 call sites +swap mechanically. Existing typing stays compatible because the +returned shape is a superset. + +Every Server Action begins with `await requireUser()` (or +`requireAdmin()` for admin-only). This is the second layer; the +middleware is the first. Both must agree before any state mutates. + +## User management surface + +Admin-only, gated by `requireAdmin()` at every entry point. + +- `/settings/users` (page) — list of users with role chip + createdAt; + inline "Reset password", "Demote/Promote", "Delete" buttons. New + user form at top. +- `createUserAction({ username, password, role })` — validate inputs, + bcrypt the password, insert. +- `setUserRoleAction({ userId, role })` — guard: if `userId === self.id` + AND `role !== "admin"`, refuse with "you can't demote yourself". +- `resetUserPasswordAction({ userId, newPassword })` — bcrypt + update. + Does NOT change cookies — the affected user keeps their existing + session until expiry or a token-version bump. +- `deleteUserAction({ userId })` — guard: refuse self-delete. + Additional guard: if deleting the last admin, refuse with "promote + another user to admin first". + +All admin actions fan out a refresh of `/settings/users` via +`revalidatePath`. + +## CLI bootstrap + +The actual hashing happens in a small TSX script (so it can `import +bcrypt` from the workspace), wrapped by a one-line bash launcher +that runs it through the `tools` container. Two pieces: + +`packages/db/src/scripts/set-password.ts` — reads `username` from +argv, prompts for password on stdin (echo off via `readline`'s +`writeMask`), bcrypts at 12 rounds, runs an `UPDATE operators SET +password_hash = $1 WHERE lower(username) = lower($2)`, exits +non-zero if no rows matched. + +`packages/db/src/scripts/create-user.ts` — same pattern, but +INSERTs a fresh row with `username`, `role`, `password_hash`, +default timezone, and a synthetic `telegramUserId` (current time- +millis) since the column is still NOT NULL until a future cleanup +migration. + +`scripts/set-password.sh` and `scripts/create-user.sh` — thin +wrappers that invoke the TSX scripts via `pnpm --filter @cmbot/db +exec tsx ...` inside the tools container, matching the existing +script-runner pattern. + +Used to bootstrap the first admin and to recover when an admin +loses their password. After bootstrap, all user management happens +through the web UI. + +## Rate limits added + +| action | limit | +| ---------------------------- | -------------------------------- | +| loginAction | 10 / 5 min per IP | +| sendTestAction | 3 / 60 s per groupId | +| resumeReminderRunAction | 30 / 10 s per IP (existing infra)| +| cancelReminderRunAction | 30 / 10 s per IP | +| createUserAction | 5 / 60 s per IP | +| resetUserPasswordAction | 5 / 60 s per IP | + +`checkRateLimit` is the existing Postgres-backed helper. + +## Robots / noindex + +`apps/web/src/app/robots.ts`: + +```ts +import type { MetadataRoute } from "next"; +export default function robots(): MetadataRoute.Robots { + return { rules: [{ userAgent: "*", disallow: "/" }] }; +} +``` + +Plus `metadata.robots = { index: false, follow: false }` in the root +`apps/web/src/app/layout.tsx`. Two layers — robots.txt is advisory, +the meta is authoritative. + +## Env hygiene + +- Add `.env*` to `.gitignore` (already excludes `.env.local`, + `.env.*.local` — this widens to all `.env*` outside `.env.example`). +- `git rm --cached .env.development` and recreate locally without + committing. +- New `.env.example` documents every required key with placeholder + values, including the new `OPERATOR_TOKEN_VERSION`. +- After this change ships, the operator rotates the leaked + `AUTH_SECRET` and Postgres password (manual step, called out in + the upgrade notes). + +## Container hardening + +Both Dockerfiles: + +```dockerfile +RUN useradd -m -u 1000 -s /usr/sbin/nologin app && \ + mkdir -p /data/sessions /data/media && \ + chown -R app:app /app /data && \ + chmod 700 /data/sessions +USER app +``` + +The `dev-data:/data` volume mount in `docker-compose.dev.yml` keeps +working since the host UID matches the in-container `app` UID 1000. + +## Origin allowlist + +`next.config.ts` adds: + +```ts +experimental: { + serverActions: { + allowedOrigins: ["wabot.04080616.xyz", "localhost:9000"], + }, +}, +``` + +Same-origin Server Action posts already work; this guards against +cross-origin POSTs from another domain attempting to invoke an +action via a known cookie. + +## Test plan (38 tests) + +### `auth-cookie.test.ts` — pure HMAC + verification logic + +1. `signSession` then `verifySession` round-trips. +2. Tampered payload → verify rejects. +3. Tampered signature → verify rejects. +4. Wrong secret → verify rejects. +5. Constant-time compare prevents char-by-char timing leak (assert + `crypto.timingSafeEqual` is used). +6. Cookie expired (`exp <= now`) → reject. +7. Cookie issued in the future (`iat > now + 60`) → reject (clock-skew). +8. Cookie with stale `v` (TOKEN_VERSION bumped after issue) → reject. +9. Cookie with bad `role` value (`"superadmin"`) → reject. +10. Cookie missing fields → reject. + +### `login-action.test.ts` — login flow + +11. Valid credentials → cookie issued with right shape. +12. Wrong password → no cookie, generic error. +13. Wrong username → no cookie, generic error, dummy-bcrypt called + (timing equivalence). +14. `password_hash IS NULL` user → "set password via CLI" error. +15. Empty username or password → 400-equivalent (no DB hit). +16. Username/password >256 chars → rejected before bcrypt. +17. Username case-insensitive (`Admin` matches `admin`). +18. 11th login attempt within window → 429 (rate-limited). +19. After window expiry, attempts succeed. +20. Failed login logs warning with username + IP, no password. +21. Cookie sets correct attrs (HttpOnly, Secure, SameSite, Path, + Max-Age). + +### `middleware.test.ts` — gate behavior + +22. No cookie + page request → 302 to `/login?next=`. +23. No cookie + `/api/...` (non-allowlisted) → 401. +24. Valid cookie + page → next(). +25. Tampered cookie → 302 to `/login`. +26. Allowlisted (`/login`, `/api/health`, manifest, icons) bypasses. +27. `/api/events` and `/api/qr/[id]` are NOT in allowlist (regression + against the audit's Critical findings). + +### `next-param.test.ts` — open-redirect prevention + +28. `/dashboard` → preserved. +29. `//evil.com` → falls back to `/`. +30. `https://evil.com` → falls back to `/`. +31. `javascript:alert(1)` → falls back to `/`. +32. `/path?with=query&extra=fine` → preserved verbatim. + +### `require-helpers.test.ts` — Server-action gates + +33. `requireUser()` throws with no session. +34. `requireUser()` returns the user with valid session. +35. `requireAdmin()` throws when role === "user". +36. `requireAdmin()` returns the user when role === "admin". + +### `user-management.test.ts` — admin guards + +37. Self-demote (`setUserRoleAction({ userId: self, role: "user" })`) + → ok:false with clear error. +38. Last-admin delete (deleting only admin user) → ok:false with + "promote another user first". + +## Migration risk + +`getSeededOperator()` is the one big touch. The 66 call sites are +mostly Server Actions and queries that read `.id` and +`.defaultTimezone` off the returned object — the new shape is a +superset, so the change is mechanical. + +To keep churn off the existing test suite (~12 tests mock +`@/lib/operator`), `apps/web/src/lib/operator.ts` keeps its export +but reimplements `getSeededOperator` as a thin pass-through to +`getCurrentUser` from `@/lib/auth`. Existing mocks that target +`@/lib/operator` keep working unchanged. New code uses +`getCurrentUser` / `requireUser` / `requireAdmin` directly; the old +name is kept as a compatibility shim and removed in a follow-up +once all sites are swept. + +A `DUMMY_HASH` constant lives at the top of the login action — it's +a precomputed bcrypt hash of a known throwaway string (`"x"`), +generated once at build time and committed. We compare against it +on the user-not-found path so timing is identical to the wrong- +password path. Generating a fresh dummy hash per request would +double the bcrypt work and create its own timing signal. + +## Out of scope (deferred) + +- WebAuthn / passkeys. +- 2FA / TOTP. +- Email-based password recovery (operator restarts container with + a new env var `OPERATOR_TOKEN_VERSION` if all admins lose their + passwords; CLI helps the rest). +- Account lockout (rate limit is enough for one operator's threat + model). +- SSO / OAuth providers. +- Audit-log surface for "who logged in when". The pino warn line + is the minimum; a structured audit table is later work. +- A "remember this device" feature distinct from the 30-day cookie. + +## Acceptance + +- The bot can be exposed at `wabot.04080616.xyz` and any + unauthenticated request to a non-allowlisted path returns 401 + (API) or redirects to `/login` (page). +- A correct username + password issues a 30-day cookie that survives + reload, browser restart, and PWA homescreen launches. +- A wrong username, a wrong password, and a missing-password user + all produce the same generic "Invalid username or password" + error and the same wall-clock duration (timing-equivalent). +- Bumping `OPERATOR_TOKEN_VERSION` on the host invalidates every + active session immediately. +- An attacker tampering with the cookie payload, signature, or + issued-at can't pass middleware. +- Eleven login attempts from the same IP within five minutes + produce a 429 on the eleventh. +- A `user`-role session can browse, schedule, and resume reminders + but cannot reach `/settings/users`. +- An admin can't demote or delete their own row, and can't delete + the last admin. +- `robots.txt` returns `Disallow: /` and the rendered HTML carries + ``. +- Both containers run as UID 1000, sessions dir is `chmod 700`. +- `.env.development` is gone from the repo and `.gitignore` excludes + every `.env*` except `.env.example`. +- All 38 tests in the plan pass; existing 471 tests still pass.