cm_whatsapp_bot_v1/docs/superpowers/specs/2026-05-10-auth-and-prod-hardening-design.md
yiekheng feffe419db docs: design spec — auth + production hardening for v1.1.x → v1.2.0
Drives the work that closes the v1.1.0 production-readiness audit
findings: username + password + role auth on the web app, gated
SSE / QR endpoints, robots/noindex, env hygiene, container non-
root, and rate limits on the four currently-naked Server Actions.

Auth design highlights:
* Roll-our-own session cookie (no NextAuth) — bcrypt password +
  HMAC-SHA256 signed cookie; edge-runtime middleware verifies on
  every request; defense-in-depth requireUser / requireAdmin in
  every Server Action.
* Username + password + 2-role model (admin / user). Schema
  migration adds username + password_hash to existing operators
  table.
* CLI bootstrap (scripts/set-password.sh) sets the first admin's
  password before going live; user management UI gates everything
  else.
* OPERATOR_TOKEN_VERSION env var as a global session-invalidation
  lever.
* 38 unit tests covering brute-force / cookie tampering / replay /
  expiry / fixation / open redirect / timing leak / rate limit /
  origin-allowlist / unauth API regression / role gates / self-
  demote and last-admin guards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:09:46 +08:00

17 KiB

Auth + Production Hardening Design

Spec for closing the production-readiness gap before promoting the bot to public-internet exposure at wabot.04080616.xyz. Covers the session-cookie auth model with username + password + role, plus the hygiene work that has to land alongside it (robots, env, container non-root) so the public surface is safe in one change.

Goal

Add operator authentication to the web app so the public URL stops being a foothold for anyone who finds it, and at the same time close the highest-risk production gaps surfaced in the v1.1.0 audit: indexable content, committed credentials, root-running containers, and four un-rate-limited Server Actions.

Constraints

  • Single-host self-hosted deployment, public-internet via reverse proxy + TLS at wabot.04080616.xyz.
  • Up to a handful of users today, with room to grow. One must be admin; the rest are user.
  • Mobile PWA homescreen workflow: 30-day cookie, no friction at re-open, no third-party identity provider.
  • No new infra dependencies. Postgres + Docker compose stay the whole platform. No NextAuth / Auth.js, no external KV, no SMS.
  • Existing call sites must be cleanly retrofitted without breaking the 66 call sites that currently use getSeededOperator().
  • All code changes covered by unit tests; no test relies on a live Postgres or browser.

A library would be heavy for one role gate and one cookie. We pick up bcrypt for password hashing (battle-tested) and Web Crypto's HMAC for cookie signing (stdlib, edge-runtime compatible). All other code is domain-owned and exhaustively tested.

The model: the user posts username + password to a Server Action, the action verifies against a per-user password_hash row, and the response sets a signed cookie carrying { userId, role, iat, exp, v }. Middleware verifies the cookie on every request; Server Actions double-check via requireUser() / requireAdmin() so a forgotten middleware path can't bypass the gate.

Schema migration (0010_add_user_auth.sql)

ALTER TABLE operators
  ADD COLUMN username text,
  ADD COLUMN password_hash text;
CREATE UNIQUE INDEX operators_username_uq
  ON operators (lower(username));
-- Backfill the seed row so it has a username; password_hash stays NULL
-- so the operator is forced to set one via the CLI before they can sign
-- in. Sets a clear "you have to do this before going live" gate.
UPDATE operators
  SET username = 'admin'
  WHERE username IS NULL;
ALTER TABLE operators
  ALTER COLUMN username SET NOT NULL;

telegramUserId stays for now (it's referenced from existing migrations and seed flow) but no longer drives auth. defaultTimezone and role are unchanged. operators.role already defaults to "admin".

Roles

Two values, no enum constraint at the DB layer (text — same as existing).

role can do
admin everything in the app + user management (CRUD other users)
user everything except /settings/users and the user-mgmt actions

A third "viewer" role isn't worth it today; can be added later by extending the role check.

Header value: session=<base64url(payload)>.<base64url(hmac)>

type SessionPayload = {
  userId: string;     // operators.id (uuid)
  role: "admin" | "user";
  iat: number;        // issued-at, unix seconds
  exp: number;        // expires-at, unix seconds (iat + 30 days)
  v: number;          // OPERATOR_TOKEN_VERSION at issue time
};

HMAC is HMAC-SHA256 over the base64url-encoded payload string with AUTH_SECRET as the key. Verification rejects on:

  • Bad shape (no ., base64 decode fails, JSON parse fails).
  • HMAC mismatch (uses constant-time compare).
  • exp <= now.
  • iat > now + 60 (clock-skew guard, 60s tolerance).
  • v !== process.env.OPERATOR_TOKEN_VERSION (defaults to "1").
  • role not one of "admin" / "user".

Cookie attributes: HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=2592000. Max-Age=0 on logout to clear.

OPERATOR_TOKEN_VERSION env var (default "1") is the global session-invalidation lever. Bumping it on the host instantly logs out every user — no DB writes — useful after a host compromise or a known-shared password.

Login flow

Page: apps/web/src/app/login/page.tsx. Single form with:

  • Username input (type=text, autocomplete username)
  • Password input (type=password, autocomplete current-password)
  • Submit button "Sign in"
  • Error slot for the generic message
  • A small note: "First time? Run ./scripts/set-password.sh <username> in your tools container to set a password."

Server action loginAction(formData: FormData):

1. Read username, password from FormData.
2. Reject if either >256 chars (DoS guard, no bcrypt).
3. Reject if either empty.
4. Apply rate limit: checkRateLimit("login:" + ip, { max: 10, windowSec: 300 }).
   On exhaustion → return { ok: false, error: "Too many attempts, try later." }
5. Look up user: select * from operators where lower(username)=lower($1)
6. If user not found OR user.password_hash IS NULL:
   await bcrypt.compare(password, DUMMY_HASH);  // timing equivalence
   return { ok: false, error: "Invalid username or password." }
7. await bcrypt.compare(password, user.password_hash)
   if false: return { ok: false, error: "Invalid username or password." }
8. Issue cookie: signSession({ userId, role, iat: now, exp: now + 30d, v: TOKEN_VERSION })
9. Redirect to safe(next) ?? "/"

safe(next): must be a string starting with / AND not starting with //. Otherwise return null.

Logout action logoutAction(): clear the cookie via cookies().set("session", "", { maxAge: 0, ... }) and redirect to /login.

Middleware gate

apps/web/src/middleware.ts extends the existing API allowlist with the auth check.

For every request:
  - If path is in allowlist (auth-free):
      /login, /logout, /api/health, /manifest.webmanifest,
      /icon-*, /favicon.ico, /_next/static/*, /_next/image
    → NextResponse.next()
  - Read session cookie. Verify (HMAC, exp, iat-skew, version, role shape).
    - On valid: NextResponse.next()
    - On invalid + path starts with /api/: 401, no body
    - On invalid + page request: 302 to /login?next=<encoded path>

/api/events and /api/qr/[accountId] are explicitly removed from the unauth allowlist — middleware now requires a session for them.

The middleware imports the verifier from @/lib/auth-cookie (a dependency-free module that runs on the edge runtime — no bcrypt, no DB).

Server-action defense-in-depth

apps/web/src/lib/auth.ts (Node runtime — DB access OK):

export async function getCurrentUser(): Promise<User | null>
export async function requireUser(): Promise<User>     // throws Response 401 / redirects
export async function requireAdmin(): Promise<User>    // requireUser + role === "admin"

getSeededOperator() is renamed to getCurrentUser() (and rewired to read the verified cookie + look up the user). All 66 call sites swap mechanically. Existing typing stays compatible because the returned shape is a superset.

Every Server Action begins with await requireUser() (or requireAdmin() for admin-only). This is the second layer; the middleware is the first. Both must agree before any state mutates.

User management surface

Admin-only, gated by requireAdmin() at every entry point.

  • /settings/users (page) — list of users with role chip + createdAt; inline "Reset password", "Demote/Promote", "Delete" buttons. New user form at top.
  • createUserAction({ username, password, role }) — validate inputs, bcrypt the password, insert.
  • setUserRoleAction({ userId, role }) — guard: if userId === self.id AND role !== "admin", refuse with "you can't demote yourself".
  • resetUserPasswordAction({ userId, newPassword }) — bcrypt + update. Does NOT change cookies — the affected user keeps their existing session until expiry or a token-version bump.
  • deleteUserAction({ userId }) — guard: refuse self-delete. Additional guard: if deleting the last admin, refuse with "promote another user to admin first".

All admin actions fan out a refresh of /settings/users via revalidatePath.

CLI bootstrap

The actual hashing happens in a small TSX script (so it can import bcrypt from the workspace), wrapped by a one-line bash launcher that runs it through the tools container. Two pieces:

packages/db/src/scripts/set-password.ts — reads username from argv, prompts for password on stdin (echo off via readline's writeMask), bcrypts at 12 rounds, runs an UPDATE operators SET password_hash = $1 WHERE lower(username) = lower($2), exits non-zero if no rows matched.

packages/db/src/scripts/create-user.ts — same pattern, but INSERTs a fresh row with username, role, password_hash, default timezone, and a synthetic telegramUserId (current time- millis) since the column is still NOT NULL until a future cleanup migration.

scripts/set-password.sh and scripts/create-user.sh — thin wrappers that invoke the TSX scripts via pnpm --filter @cmbot/db exec tsx ... inside the tools container, matching the existing script-runner pattern.

Used to bootstrap the first admin and to recover when an admin loses their password. After bootstrap, all user management happens through the web UI.

Rate limits added

action limit
loginAction 10 / 5 min per IP
sendTestAction 3 / 60 s per groupId
resumeReminderRunAction 30 / 10 s per IP (existing infra)
cancelReminderRunAction 30 / 10 s per IP
createUserAction 5 / 60 s per IP
resetUserPasswordAction 5 / 60 s per IP

checkRateLimit is the existing Postgres-backed helper.

Robots / noindex

apps/web/src/app/robots.ts:

import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
  return { rules: [{ userAgent: "*", disallow: "/" }] };
}

Plus metadata.robots = { index: false, follow: false } in the root apps/web/src/app/layout.tsx. Two layers — robots.txt is advisory, the meta is authoritative.

Env hygiene

  • Add .env* to .gitignore (already excludes .env.local, .env.*.local — this widens to all .env* outside .env.example).
  • git rm --cached .env.development and recreate locally without committing.
  • New .env.example documents every required key with placeholder values, including the new OPERATOR_TOKEN_VERSION.
  • After this change ships, the operator rotates the leaked AUTH_SECRET and Postgres password (manual step, called out in the upgrade notes).

Container hardening

Both Dockerfiles:

RUN useradd -m -u 1000 -s /usr/sbin/nologin app && \
    mkdir -p /data/sessions /data/media && \
    chown -R app:app /app /data && \
    chmod 700 /data/sessions
USER app

The dev-data:/data volume mount in docker-compose.dev.yml keeps working since the host UID matches the in-container app UID 1000.

Origin allowlist

next.config.ts adds:

experimental: {
  serverActions: {
    allowedOrigins: ["wabot.04080616.xyz", "localhost:9000"],
  },
},

Same-origin Server Action posts already work; this guards against cross-origin POSTs from another domain attempting to invoke an action via a known cookie.

Test plan (38 tests)

auth-cookie.test.ts — pure HMAC + verification logic

  1. signSession then verifySession round-trips.
  2. Tampered payload → verify rejects.
  3. Tampered signature → verify rejects.
  4. Wrong secret → verify rejects.
  5. Constant-time compare prevents char-by-char timing leak (assert crypto.timingSafeEqual is used).
  6. Cookie expired (exp <= now) → reject.
  7. Cookie issued in the future (iat > now + 60) → reject (clock-skew).
  8. Cookie with stale v (TOKEN_VERSION bumped after issue) → reject.
  9. Cookie with bad role value ("superadmin") → reject.
  10. Cookie missing fields → reject.

login-action.test.ts — login flow

  1. Valid credentials → cookie issued with right shape.
  2. Wrong password → no cookie, generic error.
  3. Wrong username → no cookie, generic error, dummy-bcrypt called (timing equivalence).
  4. password_hash IS NULL user → "set password via CLI" error.
  5. Empty username or password → 400-equivalent (no DB hit).
  6. Username/password >256 chars → rejected before bcrypt.
  7. Username case-insensitive (Admin matches admin).
  8. 11th login attempt within window → 429 (rate-limited).
  9. After window expiry, attempts succeed.
  10. Failed login logs warning with username + IP, no password.
  11. Cookie sets correct attrs (HttpOnly, Secure, SameSite, Path, Max-Age).

middleware.test.ts — gate behavior

  1. No cookie + page request → 302 to /login?next=<path>.
  2. No cookie + /api/... (non-allowlisted) → 401.
  3. Valid cookie + page → next().
  4. Tampered cookie → 302 to /login.
  5. Allowlisted (/login, /api/health, manifest, icons) bypasses.
  6. /api/events and /api/qr/[id] are NOT in allowlist (regression against the audit's Critical findings).

next-param.test.ts — open-redirect prevention

  1. /dashboard → preserved.
  2. //evil.com → falls back to /.
  3. https://evil.com → falls back to /.
  4. javascript:alert(1) → falls back to /.
  5. /path?with=query&extra=fine → preserved verbatim.

require-helpers.test.ts — Server-action gates

  1. requireUser() throws with no session.
  2. requireUser() returns the user with valid session.
  3. requireAdmin() throws when role === "user".
  4. requireAdmin() returns the user when role === "admin".

user-management.test.ts — admin guards

  1. Self-demote (setUserRoleAction({ userId: self, role: "user" })) → ok:false with clear error.
  2. Last-admin delete (deleting only admin user) → ok:false with "promote another user first".

Migration risk

getSeededOperator() is the one big touch. The 66 call sites are mostly Server Actions and queries that read .id and .defaultTimezone off the returned object — the new shape is a superset, so the change is mechanical.

To keep churn off the existing test suite (~12 tests mock @/lib/operator), apps/web/src/lib/operator.ts keeps its export but reimplements getSeededOperator as a thin pass-through to getCurrentUser from @/lib/auth. Existing mocks that target @/lib/operator keep working unchanged. New code uses getCurrentUser / requireUser / requireAdmin directly; the old name is kept as a compatibility shim and removed in a follow-up once all sites are swept.

A DUMMY_HASH constant lives at the top of the login action — it's a precomputed bcrypt hash of a known throwaway string ("x"), generated once at build time and committed. We compare against it on the user-not-found path so timing is identical to the wrong- password path. Generating a fresh dummy hash per request would double the bcrypt work and create its own timing signal.

Out of scope (deferred)

  • WebAuthn / passkeys.
  • 2FA / TOTP.
  • Email-based password recovery (operator restarts container with a new env var OPERATOR_TOKEN_VERSION if all admins lose their passwords; CLI helps the rest).
  • Account lockout (rate limit is enough for one operator's threat model).
  • SSO / OAuth providers.
  • Audit-log surface for "who logged in when". The pino warn line is the minimum; a structured audit table is later work.
  • A "remember this device" feature distinct from the 30-day cookie.

Acceptance

  • The bot can be exposed at wabot.04080616.xyz and any unauthenticated request to a non-allowlisted path returns 401 (API) or redirects to /login (page).
  • A correct username + password issues a 30-day cookie that survives reload, browser restart, and PWA homescreen launches.
  • A wrong username, a wrong password, and a missing-password user all produce the same generic "Invalid username or password" error and the same wall-clock duration (timing-equivalent).
  • Bumping OPERATOR_TOKEN_VERSION on the host invalidates every active session immediately.
  • An attacker tampering with the cookie payload, signature, or issued-at can't pass middleware.
  • Eleven login attempts from the same IP within five minutes produce a 429 on the eleventh.
  • A user-role session can browse, schedule, and resume reminders but cannot reach /settings/users.
  • An admin can't demote or delete their own row, and can't delete the last admin.
  • robots.txt returns Disallow: / and the rendered HTML carries <meta name="robots" content="noindex, nofollow">.
  • Both containers run as UID 1000, sessions dir is chmod 700.
  • .env.development is gone from the repo and .gitignore excludes every .env* except .env.example.
  • All 38 tests in the plan pass; existing 471 tests still pass.