docs: design spec — auth + production hardening for v1.1.x → v1.2.0

Drives the work that closes the v1.1.0 production-readiness audit
findings: username + password + role auth on the web app, gated
SSE / QR endpoints, robots/noindex, env hygiene, container non-
root, and rate limits on the four currently-naked Server Actions.

Auth design highlights:
* Roll-our-own session cookie (no NextAuth) — bcrypt password +
  HMAC-SHA256 signed cookie; edge-runtime middleware verifies on
  every request; defense-in-depth requireUser / requireAdmin in
  every Server Action.
* Username + password + 2-role model (admin / user). Schema
  migration adds username + password_hash to existing operators
  table.
* CLI bootstrap (scripts/set-password.sh) sets the first admin's
  password before going live; user management UI gates everything
  else.
* OPERATOR_TOKEN_VERSION env var as a global session-invalidation
  lever.
* 38 unit tests covering brute-force / cookie tampering / replay /
  expiry / fixation / open redirect / timing leak / rate limit /
  origin-allowlist / unauth API regression / role gates / self-
  demote and last-admin guards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
yiekheng 2026-05-10 17:09:46 +08:00
parent 4cb4015666
commit feffe419db

View File

@ -0,0 +1,437 @@
# Auth + Production Hardening Design
> Spec for closing the production-readiness gap before promoting the
> bot to public-internet exposure at `wabot.04080616.xyz`. Covers the
> session-cookie auth model with username + password + role, plus the
> hygiene work that has to land alongside it (robots, env, container
> non-root) so the public surface is safe in one change.
## Goal
Add operator authentication to the web app so the public URL stops
being a foothold for anyone who finds it, and at the same time close
the highest-risk production gaps surfaced in the v1.1.0 audit:
indexable content, committed credentials, root-running containers,
and four un-rate-limited Server Actions.
## Constraints
- Single-host self-hosted deployment, public-internet via reverse
proxy + TLS at `wabot.04080616.xyz`.
- Up to a handful of users today, with room to grow. One must be
`admin`; the rest are `user`.
- Mobile PWA homescreen workflow: 30-day cookie, no friction at
re-open, no third-party identity provider.
- No new infra dependencies. Postgres + Docker compose stay the
whole platform. No NextAuth / Auth.js, no external KV, no SMS.
- Existing call sites must be cleanly retrofitted without breaking
the 66 call sites that currently use `getSeededOperator()`.
- All code changes covered by unit tests; no test relies on a live
Postgres or browser.
## Approach: roll-our-own session cookie
A library would be heavy for one role gate and one cookie. We pick
up `bcrypt` for password hashing (battle-tested) and Web Crypto's
HMAC for cookie signing (stdlib, edge-runtime compatible). All other
code is domain-owned and exhaustively tested.
The model: the user posts username + password to a Server Action,
the action verifies against a per-user `password_hash` row, and the
response sets a signed cookie carrying `{ userId, role, iat, exp, v }`.
Middleware verifies the cookie on every request; Server Actions
double-check via `requireUser()` / `requireAdmin()` so a forgotten
middleware path can't bypass the gate.
## Schema migration (`0010_add_user_auth.sql`)
```sql
ALTER TABLE operators
ADD COLUMN username text,
ADD COLUMN password_hash text;
CREATE UNIQUE INDEX operators_username_uq
ON operators (lower(username));
-- Backfill the seed row so it has a username; password_hash stays NULL
-- so the operator is forced to set one via the CLI before they can sign
-- in. Sets a clear "you have to do this before going live" gate.
UPDATE operators
SET username = 'admin'
WHERE username IS NULL;
ALTER TABLE operators
ALTER COLUMN username SET NOT NULL;
```
`telegramUserId` stays for now (it's referenced from existing migrations
and seed flow) but no longer drives auth. `defaultTimezone` and `role`
are unchanged. `operators.role` already defaults to `"admin"`.
## Roles
Two values, no enum constraint at the DB layer (text — same as
existing).
| role | can do |
| ----- | ------------------------------------------------------------- |
| admin | everything in the app + user management (CRUD other users) |
| user | everything except `/settings/users` and the user-mgmt actions |
A third "viewer" role isn't worth it today; can be added later by
extending the role check.
## Cookie format
Header value: `session=<base64url(payload)>.<base64url(hmac)>`
```ts
type SessionPayload = {
userId: string; // operators.id (uuid)
role: "admin" | "user";
iat: number; // issued-at, unix seconds
exp: number; // expires-at, unix seconds (iat + 30 days)
v: number; // OPERATOR_TOKEN_VERSION at issue time
};
```
HMAC is HMAC-SHA256 over the base64url-encoded payload string with
`AUTH_SECRET` as the key. Verification rejects on:
- Bad shape (no `.`, base64 decode fails, JSON parse fails).
- HMAC mismatch (uses constant-time compare).
- `exp <= now`.
- `iat > now + 60` (clock-skew guard, 60s tolerance).
- `v !== process.env.OPERATOR_TOKEN_VERSION` (defaults to `"1"`).
- `role` not one of `"admin"` / `"user"`.
Cookie attributes: `HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=2592000`.
`Max-Age=0` on logout to clear.
`OPERATOR_TOKEN_VERSION` env var (default `"1"`) is the global
session-invalidation lever. Bumping it on the host instantly logs out
every user — no DB writes — useful after a host compromise or a
known-shared password.
## Login flow
Page: `apps/web/src/app/login/page.tsx`. Single form with:
- Username input (`type=text`, autocomplete `username`)
- Password input (`type=password`, autocomplete `current-password`)
- Submit button "Sign in"
- Error slot for the generic message
- A small note: "First time? Run `./scripts/set-password.sh <username>`
in your tools container to set a password."
Server action `loginAction(formData: FormData)`:
```text
1. Read username, password from FormData.
2. Reject if either >256 chars (DoS guard, no bcrypt).
3. Reject if either empty.
4. Apply rate limit: checkRateLimit("login:" + ip, { max: 10, windowSec: 300 }).
On exhaustion → return { ok: false, error: "Too many attempts, try later." }
5. Look up user: select * from operators where lower(username)=lower($1)
6. If user not found OR user.password_hash IS NULL:
await bcrypt.compare(password, DUMMY_HASH); // timing equivalence
return { ok: false, error: "Invalid username or password." }
7. await bcrypt.compare(password, user.password_hash)
if false: return { ok: false, error: "Invalid username or password." }
8. Issue cookie: signSession({ userId, role, iat: now, exp: now + 30d, v: TOKEN_VERSION })
9. Redirect to safe(next) ?? "/"
```
`safe(next)`: must be a string starting with `/` AND not starting
with `//`. Otherwise return `null`.
Logout action `logoutAction()`: clear the cookie via
`cookies().set("session", "", { maxAge: 0, ... })` and redirect to
`/login`.
## Middleware gate
`apps/web/src/middleware.ts` extends the existing API allowlist with
the auth check.
```text
For every request:
- If path is in allowlist (auth-free):
/login, /logout, /api/health, /manifest.webmanifest,
/icon-*, /favicon.ico, /_next/static/*, /_next/image
→ NextResponse.next()
- Read session cookie. Verify (HMAC, exp, iat-skew, version, role shape).
- On valid: NextResponse.next()
- On invalid + path starts with /api/: 401, no body
- On invalid + page request: 302 to /login?next=<encoded path>
```
`/api/events` and `/api/qr/[accountId]` are explicitly removed from
the unauth allowlist — middleware now requires a session for them.
The middleware imports the verifier from `@/lib/auth-cookie` (a
dependency-free module that runs on the edge runtime — no bcrypt,
no DB).
## Server-action defense-in-depth
`apps/web/src/lib/auth.ts` (Node runtime — DB access OK):
```ts
export async function getCurrentUser(): Promise<User | null>
export async function requireUser(): Promise<User> // throws Response 401 / redirects
export async function requireAdmin(): Promise<User> // requireUser + role === "admin"
```
`getSeededOperator()` is renamed to `getCurrentUser()` (and rewired
to read the verified cookie + look up the user). All 66 call sites
swap mechanically. Existing typing stays compatible because the
returned shape is a superset.
Every Server Action begins with `await requireUser()` (or
`requireAdmin()` for admin-only). This is the second layer; the
middleware is the first. Both must agree before any state mutates.
## User management surface
Admin-only, gated by `requireAdmin()` at every entry point.
- `/settings/users` (page) — list of users with role chip + createdAt;
inline "Reset password", "Demote/Promote", "Delete" buttons. New
user form at top.
- `createUserAction({ username, password, role })` — validate inputs,
bcrypt the password, insert.
- `setUserRoleAction({ userId, role })` — guard: if `userId === self.id`
AND `role !== "admin"`, refuse with "you can't demote yourself".
- `resetUserPasswordAction({ userId, newPassword })` — bcrypt + update.
Does NOT change cookies — the affected user keeps their existing
session until expiry or a token-version bump.
- `deleteUserAction({ userId })` — guard: refuse self-delete.
Additional guard: if deleting the last admin, refuse with "promote
another user to admin first".
All admin actions fan out a refresh of `/settings/users` via
`revalidatePath`.
## CLI bootstrap
The actual hashing happens in a small TSX script (so it can `import
bcrypt` from the workspace), wrapped by a one-line bash launcher
that runs it through the `tools` container. Two pieces:
`packages/db/src/scripts/set-password.ts` — reads `username` from
argv, prompts for password on stdin (echo off via `readline`'s
`writeMask`), bcrypts at 12 rounds, runs an `UPDATE operators SET
password_hash = $1 WHERE lower(username) = lower($2)`, exits
non-zero if no rows matched.
`packages/db/src/scripts/create-user.ts` — same pattern, but
INSERTs a fresh row with `username`, `role`, `password_hash`,
default timezone, and a synthetic `telegramUserId` (current time-
millis) since the column is still NOT NULL until a future cleanup
migration.
`scripts/set-password.sh` and `scripts/create-user.sh` — thin
wrappers that invoke the TSX scripts via `pnpm --filter @cmbot/db
exec tsx ...` inside the tools container, matching the existing
script-runner pattern.
Used to bootstrap the first admin and to recover when an admin
loses their password. After bootstrap, all user management happens
through the web UI.
## Rate limits added
| action | limit |
| ---------------------------- | -------------------------------- |
| loginAction | 10 / 5 min per IP |
| sendTestAction | 3 / 60 s per groupId |
| resumeReminderRunAction | 30 / 10 s per IP (existing infra)|
| cancelReminderRunAction | 30 / 10 s per IP |
| createUserAction | 5 / 60 s per IP |
| resetUserPasswordAction | 5 / 60 s per IP |
`checkRateLimit` is the existing Postgres-backed helper.
## Robots / noindex
`apps/web/src/app/robots.ts`:
```ts
import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
return { rules: [{ userAgent: "*", disallow: "/" }] };
}
```
Plus `metadata.robots = { index: false, follow: false }` in the root
`apps/web/src/app/layout.tsx`. Two layers — robots.txt is advisory,
the meta is authoritative.
## Env hygiene
- Add `.env*` to `.gitignore` (already excludes `.env.local`,
`.env.*.local` — this widens to all `.env*` outside `.env.example`).
- `git rm --cached .env.development` and recreate locally without
committing.
- New `.env.example` documents every required key with placeholder
values, including the new `OPERATOR_TOKEN_VERSION`.
- After this change ships, the operator rotates the leaked
`AUTH_SECRET` and Postgres password (manual step, called out in
the upgrade notes).
## Container hardening
Both Dockerfiles:
```dockerfile
RUN useradd -m -u 1000 -s /usr/sbin/nologin app && \
mkdir -p /data/sessions /data/media && \
chown -R app:app /app /data && \
chmod 700 /data/sessions
USER app
```
The `dev-data:/data` volume mount in `docker-compose.dev.yml` keeps
working since the host UID matches the in-container `app` UID 1000.
## Origin allowlist
`next.config.ts` adds:
```ts
experimental: {
serverActions: {
allowedOrigins: ["wabot.04080616.xyz", "localhost:9000"],
},
},
```
Same-origin Server Action posts already work; this guards against
cross-origin POSTs from another domain attempting to invoke an
action via a known cookie.
## Test plan (38 tests)
### `auth-cookie.test.ts` — pure HMAC + verification logic
1. `signSession` then `verifySession` round-trips.
2. Tampered payload → verify rejects.
3. Tampered signature → verify rejects.
4. Wrong secret → verify rejects.
5. Constant-time compare prevents char-by-char timing leak (assert
`crypto.timingSafeEqual` is used).
6. Cookie expired (`exp <= now`) → reject.
7. Cookie issued in the future (`iat > now + 60`) → reject (clock-skew).
8. Cookie with stale `v` (TOKEN_VERSION bumped after issue) → reject.
9. Cookie with bad `role` value (`"superadmin"`) → reject.
10. Cookie missing fields → reject.
### `login-action.test.ts` — login flow
11. Valid credentials → cookie issued with right shape.
12. Wrong password → no cookie, generic error.
13. Wrong username → no cookie, generic error, dummy-bcrypt called
(timing equivalence).
14. `password_hash IS NULL` user → "set password via CLI" error.
15. Empty username or password → 400-equivalent (no DB hit).
16. Username/password >256 chars → rejected before bcrypt.
17. Username case-insensitive (`Admin` matches `admin`).
18. 11th login attempt within window → 429 (rate-limited).
19. After window expiry, attempts succeed.
20. Failed login logs warning with username + IP, no password.
21. Cookie sets correct attrs (HttpOnly, Secure, SameSite, Path,
Max-Age).
### `middleware.test.ts` — gate behavior
22. No cookie + page request → 302 to `/login?next=<path>`.
23. No cookie + `/api/...` (non-allowlisted) → 401.
24. Valid cookie + page → next().
25. Tampered cookie → 302 to `/login`.
26. Allowlisted (`/login`, `/api/health`, manifest, icons) bypasses.
27. `/api/events` and `/api/qr/[id]` are NOT in allowlist (regression
against the audit's Critical findings).
### `next-param.test.ts` — open-redirect prevention
28. `/dashboard` → preserved.
29. `//evil.com` → falls back to `/`.
30. `https://evil.com` → falls back to `/`.
31. `javascript:alert(1)` → falls back to `/`.
32. `/path?with=query&extra=fine` → preserved verbatim.
### `require-helpers.test.ts` — Server-action gates
33. `requireUser()` throws with no session.
34. `requireUser()` returns the user with valid session.
35. `requireAdmin()` throws when role === "user".
36. `requireAdmin()` returns the user when role === "admin".
### `user-management.test.ts` — admin guards
37. Self-demote (`setUserRoleAction({ userId: self, role: "user" })`)
→ ok:false with clear error.
38. Last-admin delete (deleting only admin user) → ok:false with
"promote another user first".
## Migration risk
`getSeededOperator()` is the one big touch. The 66 call sites are
mostly Server Actions and queries that read `.id` and
`.defaultTimezone` off the returned object — the new shape is a
superset, so the change is mechanical.
To keep churn off the existing test suite (~12 tests mock
`@/lib/operator`), `apps/web/src/lib/operator.ts` keeps its export
but reimplements `getSeededOperator` as a thin pass-through to
`getCurrentUser` from `@/lib/auth`. Existing mocks that target
`@/lib/operator` keep working unchanged. New code uses
`getCurrentUser` / `requireUser` / `requireAdmin` directly; the old
name is kept as a compatibility shim and removed in a follow-up
once all sites are swept.
A `DUMMY_HASH` constant lives at the top of the login action — it's
a precomputed bcrypt hash of a known throwaway string (`"x"`),
generated once at build time and committed. We compare against it
on the user-not-found path so timing is identical to the wrong-
password path. Generating a fresh dummy hash per request would
double the bcrypt work and create its own timing signal.
## Out of scope (deferred)
- WebAuthn / passkeys.
- 2FA / TOTP.
- Email-based password recovery (operator restarts container with
a new env var `OPERATOR_TOKEN_VERSION` if all admins lose their
passwords; CLI helps the rest).
- Account lockout (rate limit is enough for one operator's threat
model).
- SSO / OAuth providers.
- Audit-log surface for "who logged in when". The pino warn line
is the minimum; a structured audit table is later work.
- A "remember this device" feature distinct from the 30-day cookie.
## Acceptance
- The bot can be exposed at `wabot.04080616.xyz` and any
unauthenticated request to a non-allowlisted path returns 401
(API) or redirects to `/login` (page).
- A correct username + password issues a 30-day cookie that survives
reload, browser restart, and PWA homescreen launches.
- A wrong username, a wrong password, and a missing-password user
all produce the same generic "Invalid username or password"
error and the same wall-clock duration (timing-equivalent).
- Bumping `OPERATOR_TOKEN_VERSION` on the host invalidates every
active session immediately.
- An attacker tampering with the cookie payload, signature, or
issued-at can't pass middleware.
- Eleven login attempts from the same IP within five minutes
produce a 429 on the eleventh.
- A `user`-role session can browse, schedule, and resume reminders
but cannot reach `/settings/users`.
- An admin can't demote or delete their own row, and can't delete
the last admin.
- `robots.txt` returns `Disallow: /` and the rendered HTML carries
`<meta name="robots" content="noindex, nofollow">`.
- Both containers run as UID 1000, sessions dir is `chmod 700`.
- `.env.development` is gone from the repo and `.gitignore` excludes
every `.env*` except `.env.example`.
- All 38 tests in the plan pass; existing 471 tests still pass.