# Auth + Production Hardening Design > Spec for closing the production-readiness gap before promoting the > bot to public-internet exposure at `wabot.04080616.xyz`. Covers the > session-cookie auth model with username + password + role, plus the > hygiene work that has to land alongside it (robots, env, container > non-root) so the public surface is safe in one change. ## Goal Add operator authentication to the web app so the public URL stops being a foothold for anyone who finds it, and at the same time close the highest-risk production gaps surfaced in the v1.1.0 audit: indexable content, committed credentials, root-running containers, and four un-rate-limited Server Actions. ## Constraints - Single-host self-hosted deployment, public-internet via reverse proxy + TLS at `wabot.04080616.xyz`. - Up to a handful of users today, with room to grow. One must be `admin`; the rest are `user`. - Mobile PWA homescreen workflow: 30-day cookie, no friction at re-open, no third-party identity provider. - No new infra dependencies. Postgres + Docker compose stay the whole platform. No NextAuth / Auth.js, no external KV, no SMS. - Existing call sites must be cleanly retrofitted without breaking the 66 call sites that currently use `getSeededOperator()`. - All code changes covered by unit tests; no test relies on a live Postgres or browser. ## Approach: roll-our-own session cookie A library would be heavy for one role gate and one cookie. We pick up `bcrypt` for password hashing (battle-tested) and Web Crypto's HMAC for cookie signing (stdlib, edge-runtime compatible). All other code is domain-owned and exhaustively tested. The model: the user posts username + password to a Server Action, the action verifies against a per-user `password_hash` row, and the response sets a signed cookie carrying `{ userId, role, iat, exp, v }`. Middleware verifies the cookie on every request; Server Actions double-check via `requireUser()` / `requireAdmin()` so a forgotten middleware path can't bypass the gate. ## Schema migration (`0010_add_user_auth.sql`) ```sql ALTER TABLE operators ADD COLUMN username text, ADD COLUMN password_hash text; CREATE UNIQUE INDEX operators_username_uq ON operators (lower(username)); -- Backfill the seed row so it has a username; password_hash stays NULL -- so the operator is forced to set one via the CLI before they can sign -- in. Sets a clear "you have to do this before going live" gate. UPDATE operators SET username = 'admin' WHERE username IS NULL; ALTER TABLE operators ALTER COLUMN username SET NOT NULL; ``` `telegramUserId` stays for now (it's referenced from existing migrations and seed flow) but no longer drives auth. `defaultTimezone` and `role` are unchanged. `operators.role` already defaults to `"admin"`. ## Roles Two values, no enum constraint at the DB layer (text — same as existing). | role | can do | | ----- | ------------------------------------------------------------- | | admin | everything in the app + user management (CRUD other users) | | user | everything except `/settings/users` and the user-mgmt actions | A third "viewer" role isn't worth it today; can be added later by extending the role check. ## Cookie format Header value: `session=.` ```ts type SessionPayload = { userId: string; // operators.id (uuid) role: "admin" | "user"; iat: number; // issued-at, unix seconds exp: number; // expires-at, unix seconds (iat + 30 days) v: number; // OPERATOR_TOKEN_VERSION at issue time }; ``` HMAC is HMAC-SHA256 over the base64url-encoded payload string with `AUTH_SECRET` as the key. Verification rejects on: - Bad shape (no `.`, base64 decode fails, JSON parse fails). - HMAC mismatch (uses constant-time compare). - `exp <= now`. - `iat > now + 60` (clock-skew guard, 60s tolerance). - `v !== process.env.OPERATOR_TOKEN_VERSION` (defaults to `"1"`). - `role` not one of `"admin"` / `"user"`. Cookie attributes: `HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=2592000`. `Max-Age=0` on logout to clear. `OPERATOR_TOKEN_VERSION` env var (default `"1"`) is the global session-invalidation lever. Bumping it on the host instantly logs out every user — no DB writes — useful after a host compromise or a known-shared password. ## Login flow Page: `apps/web/src/app/login/page.tsx`. Single form with: - Username input (`type=text`, autocomplete `username`) - Password input (`type=password`, autocomplete `current-password`) - Submit button "Sign in" - Error slot for the generic message - A small note: "First time? Run `./scripts/set-password.sh ` in your tools container to set a password." Server action `loginAction(formData: FormData)`: ```text 1. Read username, password from FormData. 2. Reject if either >256 chars (DoS guard, no bcrypt). 3. Reject if either empty. 4. Apply rate limit: checkRateLimit("login:" + ip, { max: 10, windowSec: 300 }). On exhaustion → return { ok: false, error: "Too many attempts, try later." } 5. Look up user: select * from operators where lower(username)=lower($1) 6. If user not found OR user.password_hash IS NULL: await bcrypt.compare(password, DUMMY_HASH); // timing equivalence return { ok: false, error: "Invalid username or password." } 7. await bcrypt.compare(password, user.password_hash) if false: return { ok: false, error: "Invalid username or password." } 8. Issue cookie: signSession({ userId, role, iat: now, exp: now + 30d, v: TOKEN_VERSION }) 9. Redirect to safe(next) ?? "/" ``` `safe(next)`: must be a string starting with `/` AND not starting with `//`. Otherwise return `null`. Logout action `logoutAction()`: clear the cookie via `cookies().set("session", "", { maxAge: 0, ... })` and redirect to `/login`. ## Middleware gate `apps/web/src/middleware.ts` extends the existing API allowlist with the auth check. ```text For every request: - If path is in allowlist (auth-free): /login, /logout, /api/health, /manifest.webmanifest, /icon-*, /favicon.ico, /_next/static/*, /_next/image → NextResponse.next() - Read session cookie. Verify (HMAC, exp, iat-skew, version, role shape). - On valid: NextResponse.next() - On invalid + path starts with /api/: 401, no body - On invalid + page request: 302 to /login?next= ``` `/api/events` and `/api/qr/[accountId]` are explicitly removed from the unauth allowlist — middleware now requires a session for them. The middleware imports the verifier from `@/lib/auth-cookie` (a dependency-free module that runs on the edge runtime — no bcrypt, no DB). ## Server-action defense-in-depth `apps/web/src/lib/auth.ts` (Node runtime — DB access OK): ```ts export async function getCurrentUser(): Promise export async function requireUser(): Promise // throws Response 401 / redirects export async function requireAdmin(): Promise // requireUser + role === "admin" ``` `getSeededOperator()` is renamed to `getCurrentUser()` (and rewired to read the verified cookie + look up the user). All 66 call sites swap mechanically. Existing typing stays compatible because the returned shape is a superset. Every Server Action begins with `await requireUser()` (or `requireAdmin()` for admin-only). This is the second layer; the middleware is the first. Both must agree before any state mutates. ## User management surface Admin-only, gated by `requireAdmin()` at every entry point. - `/settings/users` (page) — list of users with role chip + createdAt; inline "Reset password", "Demote/Promote", "Delete" buttons. New user form at top. - `createUserAction({ username, password, role })` — validate inputs, bcrypt the password, insert. - `setUserRoleAction({ userId, role })` — guard: if `userId === self.id` AND `role !== "admin"`, refuse with "you can't demote yourself". - `resetUserPasswordAction({ userId, newPassword })` — bcrypt + update. Does NOT change cookies — the affected user keeps their existing session until expiry or a token-version bump. - `deleteUserAction({ userId })` — guard: refuse self-delete. Additional guard: if deleting the last admin, refuse with "promote another user to admin first". All admin actions fan out a refresh of `/settings/users` via `revalidatePath`. ## CLI bootstrap The actual hashing happens in a small TSX script (so it can `import bcrypt` from the workspace), wrapped by a one-line bash launcher that runs it through the `tools` container. Two pieces: `packages/db/src/scripts/set-password.ts` — reads `username` from argv, prompts for password on stdin (echo off via `readline`'s `writeMask`), bcrypts at 12 rounds, runs an `UPDATE operators SET password_hash = $1 WHERE lower(username) = lower($2)`, exits non-zero if no rows matched. `packages/db/src/scripts/create-user.ts` — same pattern, but INSERTs a fresh row with `username`, `role`, `password_hash`, default timezone, and a synthetic `telegramUserId` (current time- millis) since the column is still NOT NULL until a future cleanup migration. `scripts/set-password.sh` and `scripts/create-user.sh` — thin wrappers that invoke the TSX scripts via `pnpm --filter @cmbot/db exec tsx ...` inside the tools container, matching the existing script-runner pattern. Used to bootstrap the first admin and to recover when an admin loses their password. After bootstrap, all user management happens through the web UI. ## Rate limits added | action | limit | | ---------------------------- | -------------------------------- | | loginAction | 10 / 5 min per IP | | sendTestAction | 3 / 60 s per groupId | | resumeReminderRunAction | 30 / 10 s per IP (existing infra)| | cancelReminderRunAction | 30 / 10 s per IP | | createUserAction | 5 / 60 s per IP | | resetUserPasswordAction | 5 / 60 s per IP | `checkRateLimit` is the existing Postgres-backed helper. ## Robots / noindex `apps/web/src/app/robots.ts`: ```ts import type { MetadataRoute } from "next"; export default function robots(): MetadataRoute.Robots { return { rules: [{ userAgent: "*", disallow: "/" }] }; } ``` Plus `metadata.robots = { index: false, follow: false }` in the root `apps/web/src/app/layout.tsx`. Two layers — robots.txt is advisory, the meta is authoritative. ## Env hygiene - Add `.env*` to `.gitignore` (already excludes `.env.local`, `.env.*.local` — this widens to all `.env*` outside `.env.example`). - `git rm --cached .env.development` and recreate locally without committing. - New `.env.example` documents every required key with placeholder values, including the new `OPERATOR_TOKEN_VERSION`. - After this change ships, the operator rotates the leaked `AUTH_SECRET` and Postgres password (manual step, called out in the upgrade notes). ## Container hardening Both Dockerfiles: ```dockerfile RUN useradd -m -u 1000 -s /usr/sbin/nologin app && \ mkdir -p /data/sessions /data/media && \ chown -R app:app /app /data && \ chmod 700 /data/sessions USER app ``` The `dev-data:/data` volume mount in `docker-compose.dev.yml` keeps working since the host UID matches the in-container `app` UID 1000. ## Origin allowlist `next.config.ts` adds: ```ts experimental: { serverActions: { allowedOrigins: ["wabot.04080616.xyz", "localhost:9000"], }, }, ``` Same-origin Server Action posts already work; this guards against cross-origin POSTs from another domain attempting to invoke an action via a known cookie. ## Test plan (38 tests) ### `auth-cookie.test.ts` — pure HMAC + verification logic 1. `signSession` then `verifySession` round-trips. 2. Tampered payload → verify rejects. 3. Tampered signature → verify rejects. 4. Wrong secret → verify rejects. 5. Constant-time compare prevents char-by-char timing leak (assert `crypto.timingSafeEqual` is used). 6. Cookie expired (`exp <= now`) → reject. 7. Cookie issued in the future (`iat > now + 60`) → reject (clock-skew). 8. Cookie with stale `v` (TOKEN_VERSION bumped after issue) → reject. 9. Cookie with bad `role` value (`"superadmin"`) → reject. 10. Cookie missing fields → reject. ### `login-action.test.ts` — login flow 11. Valid credentials → cookie issued with right shape. 12. Wrong password → no cookie, generic error. 13. Wrong username → no cookie, generic error, dummy-bcrypt called (timing equivalence). 14. `password_hash IS NULL` user → "set password via CLI" error. 15. Empty username or password → 400-equivalent (no DB hit). 16. Username/password >256 chars → rejected before bcrypt. 17. Username case-insensitive (`Admin` matches `admin`). 18. 11th login attempt within window → 429 (rate-limited). 19. After window expiry, attempts succeed. 20. Failed login logs warning with username + IP, no password. 21. Cookie sets correct attrs (HttpOnly, Secure, SameSite, Path, Max-Age). ### `middleware.test.ts` — gate behavior 22. No cookie + page request → 302 to `/login?next=`. 23. No cookie + `/api/...` (non-allowlisted) → 401. 24. Valid cookie + page → next(). 25. Tampered cookie → 302 to `/login`. 26. Allowlisted (`/login`, `/api/health`, manifest, icons) bypasses. 27. `/api/events` and `/api/qr/[id]` are NOT in allowlist (regression against the audit's Critical findings). ### `next-param.test.ts` — open-redirect prevention 28. `/dashboard` → preserved. 29. `//evil.com` → falls back to `/`. 30. `https://evil.com` → falls back to `/`. 31. `javascript:alert(1)` → falls back to `/`. 32. `/path?with=query&extra=fine` → preserved verbatim. ### `require-helpers.test.ts` — Server-action gates 33. `requireUser()` throws with no session. 34. `requireUser()` returns the user with valid session. 35. `requireAdmin()` throws when role === "user". 36. `requireAdmin()` returns the user when role === "admin". ### `user-management.test.ts` — admin guards 37. Self-demote (`setUserRoleAction({ userId: self, role: "user" })`) → ok:false with clear error. 38. Last-admin delete (deleting only admin user) → ok:false with "promote another user first". ## Migration risk `getSeededOperator()` is the one big touch. The 66 call sites are mostly Server Actions and queries that read `.id` and `.defaultTimezone` off the returned object — the new shape is a superset, so the change is mechanical. To keep churn off the existing test suite (~12 tests mock `@/lib/operator`), `apps/web/src/lib/operator.ts` keeps its export but reimplements `getSeededOperator` as a thin pass-through to `getCurrentUser` from `@/lib/auth`. Existing mocks that target `@/lib/operator` keep working unchanged. New code uses `getCurrentUser` / `requireUser` / `requireAdmin` directly; the old name is kept as a compatibility shim and removed in a follow-up once all sites are swept. A `DUMMY_HASH` constant lives at the top of the login action — it's a precomputed bcrypt hash of a known throwaway string (`"x"`), generated once at build time and committed. We compare against it on the user-not-found path so timing is identical to the wrong- password path. Generating a fresh dummy hash per request would double the bcrypt work and create its own timing signal. ## Out of scope (deferred) - WebAuthn / passkeys. - 2FA / TOTP. - Email-based password recovery (operator restarts container with a new env var `OPERATOR_TOKEN_VERSION` if all admins lose their passwords; CLI helps the rest). - Account lockout (rate limit is enough for one operator's threat model). - SSO / OAuth providers. - Audit-log surface for "who logged in when". The pino warn line is the minimum; a structured audit table is later work. - A "remember this device" feature distinct from the 30-day cookie. ## Acceptance - The bot can be exposed at `wabot.04080616.xyz` and any unauthenticated request to a non-allowlisted path returns 401 (API) or redirects to `/login` (page). - A correct username + password issues a 30-day cookie that survives reload, browser restart, and PWA homescreen launches. - A wrong username, a wrong password, and a missing-password user all produce the same generic "Invalid username or password" error and the same wall-clock duration (timing-equivalent). - Bumping `OPERATOR_TOKEN_VERSION` on the host invalidates every active session immediately. - An attacker tampering with the cookie payload, signature, or issued-at can't pass middleware. - Eleven login attempts from the same IP within five minutes produce a 429 on the eleventh. - A `user`-role session can browse, schedule, and resume reminders but cannot reach `/settings/users`. - An admin can't demote or delete their own row, and can't delete the last admin. - `robots.txt` returns `Disallow: /` and the rendered HTML carries ``. - Both containers run as UID 1000, sessions dir is `chmod 700`. - `.env.development` is gone from the repo and `.gitignore` excludes every `.env*` except `.env.example`. - All 38 tests in the plan pass; existing 471 tests still pass.