Drives the work that closes the v1.1.0 production-readiness audit findings: username + password + role auth on the web app, gated SSE / QR endpoints, robots/noindex, env hygiene, container non- root, and rate limits on the four currently-naked Server Actions. Auth design highlights: * Roll-our-own session cookie (no NextAuth) — bcrypt password + HMAC-SHA256 signed cookie; edge-runtime middleware verifies on every request; defense-in-depth requireUser / requireAdmin in every Server Action. * Username + password + 2-role model (admin / user). Schema migration adds username + password_hash to existing operators table. * CLI bootstrap (scripts/set-password.sh) sets the first admin's password before going live; user management UI gates everything else. * OPERATOR_TOKEN_VERSION env var as a global session-invalidation lever. * 38 unit tests covering brute-force / cookie tampering / replay / expiry / fixation / open redirect / timing leak / rate limit / origin-allowlist / unauth API regression / role gates / self- demote and last-admin guards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
438 lines
17 KiB
Markdown
438 lines
17 KiB
Markdown
# Auth + Production Hardening Design
|
|
|
|
> Spec for closing the production-readiness gap before promoting the
|
|
> bot to public-internet exposure at `wabot.04080616.xyz`. Covers the
|
|
> session-cookie auth model with username + password + role, plus the
|
|
> hygiene work that has to land alongside it (robots, env, container
|
|
> non-root) so the public surface is safe in one change.
|
|
|
|
## Goal
|
|
|
|
Add operator authentication to the web app so the public URL stops
|
|
being a foothold for anyone who finds it, and at the same time close
|
|
the highest-risk production gaps surfaced in the v1.1.0 audit:
|
|
indexable content, committed credentials, root-running containers,
|
|
and four un-rate-limited Server Actions.
|
|
|
|
## Constraints
|
|
|
|
- Single-host self-hosted deployment, public-internet via reverse
|
|
proxy + TLS at `wabot.04080616.xyz`.
|
|
- Up to a handful of users today, with room to grow. One must be
|
|
`admin`; the rest are `user`.
|
|
- Mobile PWA homescreen workflow: 30-day cookie, no friction at
|
|
re-open, no third-party identity provider.
|
|
- No new infra dependencies. Postgres + Docker compose stay the
|
|
whole platform. No NextAuth / Auth.js, no external KV, no SMS.
|
|
- Existing call sites must be cleanly retrofitted without breaking
|
|
the 66 call sites that currently use `getSeededOperator()`.
|
|
- All code changes covered by unit tests; no test relies on a live
|
|
Postgres or browser.
|
|
|
|
## Approach: roll-our-own session cookie
|
|
|
|
A library would be heavy for one role gate and one cookie. We pick
|
|
up `bcrypt` for password hashing (battle-tested) and Web Crypto's
|
|
HMAC for cookie signing (stdlib, edge-runtime compatible). All other
|
|
code is domain-owned and exhaustively tested.
|
|
|
|
The model: the user posts username + password to a Server Action,
|
|
the action verifies against a per-user `password_hash` row, and the
|
|
response sets a signed cookie carrying `{ userId, role, iat, exp, v }`.
|
|
Middleware verifies the cookie on every request; Server Actions
|
|
double-check via `requireUser()` / `requireAdmin()` so a forgotten
|
|
middleware path can't bypass the gate.
|
|
|
|
## Schema migration (`0010_add_user_auth.sql`)
|
|
|
|
```sql
|
|
ALTER TABLE operators
|
|
ADD COLUMN username text,
|
|
ADD COLUMN password_hash text;
|
|
CREATE UNIQUE INDEX operators_username_uq
|
|
ON operators (lower(username));
|
|
-- Backfill the seed row so it has a username; password_hash stays NULL
|
|
-- so the operator is forced to set one via the CLI before they can sign
|
|
-- in. Sets a clear "you have to do this before going live" gate.
|
|
UPDATE operators
|
|
SET username = 'admin'
|
|
WHERE username IS NULL;
|
|
ALTER TABLE operators
|
|
ALTER COLUMN username SET NOT NULL;
|
|
```
|
|
|
|
`telegramUserId` stays for now (it's referenced from existing migrations
|
|
and seed flow) but no longer drives auth. `defaultTimezone` and `role`
|
|
are unchanged. `operators.role` already defaults to `"admin"`.
|
|
|
|
## Roles
|
|
|
|
Two values, no enum constraint at the DB layer (text — same as
|
|
existing).
|
|
|
|
| role | can do |
|
|
| ----- | ------------------------------------------------------------- |
|
|
| admin | everything in the app + user management (CRUD other users) |
|
|
| user | everything except `/settings/users` and the user-mgmt actions |
|
|
|
|
A third "viewer" role isn't worth it today; can be added later by
|
|
extending the role check.
|
|
|
|
## Cookie format
|
|
|
|
Header value: `session=<base64url(payload)>.<base64url(hmac)>`
|
|
|
|
```ts
|
|
type SessionPayload = {
|
|
userId: string; // operators.id (uuid)
|
|
role: "admin" | "user";
|
|
iat: number; // issued-at, unix seconds
|
|
exp: number; // expires-at, unix seconds (iat + 30 days)
|
|
v: number; // OPERATOR_TOKEN_VERSION at issue time
|
|
};
|
|
```
|
|
|
|
HMAC is HMAC-SHA256 over the base64url-encoded payload string with
|
|
`AUTH_SECRET` as the key. Verification rejects on:
|
|
|
|
- Bad shape (no `.`, base64 decode fails, JSON parse fails).
|
|
- HMAC mismatch (uses constant-time compare).
|
|
- `exp <= now`.
|
|
- `iat > now + 60` (clock-skew guard, 60s tolerance).
|
|
- `v !== process.env.OPERATOR_TOKEN_VERSION` (defaults to `"1"`).
|
|
- `role` not one of `"admin"` / `"user"`.
|
|
|
|
Cookie attributes: `HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=2592000`.
|
|
`Max-Age=0` on logout to clear.
|
|
|
|
`OPERATOR_TOKEN_VERSION` env var (default `"1"`) is the global
|
|
session-invalidation lever. Bumping it on the host instantly logs out
|
|
every user — no DB writes — useful after a host compromise or a
|
|
known-shared password.
|
|
|
|
## Login flow
|
|
|
|
Page: `apps/web/src/app/login/page.tsx`. Single form with:
|
|
|
|
- Username input (`type=text`, autocomplete `username`)
|
|
- Password input (`type=password`, autocomplete `current-password`)
|
|
- Submit button "Sign in"
|
|
- Error slot for the generic message
|
|
- A small note: "First time? Run `./scripts/set-password.sh <username>`
|
|
in your tools container to set a password."
|
|
|
|
Server action `loginAction(formData: FormData)`:
|
|
|
|
```text
|
|
1. Read username, password from FormData.
|
|
2. Reject if either >256 chars (DoS guard, no bcrypt).
|
|
3. Reject if either empty.
|
|
4. Apply rate limit: checkRateLimit("login:" + ip, { max: 10, windowSec: 300 }).
|
|
On exhaustion → return { ok: false, error: "Too many attempts, try later." }
|
|
5. Look up user: select * from operators where lower(username)=lower($1)
|
|
6. If user not found OR user.password_hash IS NULL:
|
|
await bcrypt.compare(password, DUMMY_HASH); // timing equivalence
|
|
return { ok: false, error: "Invalid username or password." }
|
|
7. await bcrypt.compare(password, user.password_hash)
|
|
if false: return { ok: false, error: "Invalid username or password." }
|
|
8. Issue cookie: signSession({ userId, role, iat: now, exp: now + 30d, v: TOKEN_VERSION })
|
|
9. Redirect to safe(next) ?? "/"
|
|
```
|
|
|
|
`safe(next)`: must be a string starting with `/` AND not starting
|
|
with `//`. Otherwise return `null`.
|
|
|
|
Logout action `logoutAction()`: clear the cookie via
|
|
`cookies().set("session", "", { maxAge: 0, ... })` and redirect to
|
|
`/login`.
|
|
|
|
## Middleware gate
|
|
|
|
`apps/web/src/middleware.ts` extends the existing API allowlist with
|
|
the auth check.
|
|
|
|
```text
|
|
For every request:
|
|
- If path is in allowlist (auth-free):
|
|
/login, /logout, /api/health, /manifest.webmanifest,
|
|
/icon-*, /favicon.ico, /_next/static/*, /_next/image
|
|
→ NextResponse.next()
|
|
- Read session cookie. Verify (HMAC, exp, iat-skew, version, role shape).
|
|
- On valid: NextResponse.next()
|
|
- On invalid + path starts with /api/: 401, no body
|
|
- On invalid + page request: 302 to /login?next=<encoded path>
|
|
```
|
|
|
|
`/api/events` and `/api/qr/[accountId]` are explicitly removed from
|
|
the unauth allowlist — middleware now requires a session for them.
|
|
|
|
The middleware imports the verifier from `@/lib/auth-cookie` (a
|
|
dependency-free module that runs on the edge runtime — no bcrypt,
|
|
no DB).
|
|
|
|
## Server-action defense-in-depth
|
|
|
|
`apps/web/src/lib/auth.ts` (Node runtime — DB access OK):
|
|
|
|
```ts
|
|
export async function getCurrentUser(): Promise<User | null>
|
|
export async function requireUser(): Promise<User> // throws Response 401 / redirects
|
|
export async function requireAdmin(): Promise<User> // requireUser + role === "admin"
|
|
```
|
|
|
|
`getSeededOperator()` is renamed to `getCurrentUser()` (and rewired
|
|
to read the verified cookie + look up the user). All 66 call sites
|
|
swap mechanically. Existing typing stays compatible because the
|
|
returned shape is a superset.
|
|
|
|
Every Server Action begins with `await requireUser()` (or
|
|
`requireAdmin()` for admin-only). This is the second layer; the
|
|
middleware is the first. Both must agree before any state mutates.
|
|
|
|
## User management surface
|
|
|
|
Admin-only, gated by `requireAdmin()` at every entry point.
|
|
|
|
- `/settings/users` (page) — list of users with role chip + createdAt;
|
|
inline "Reset password", "Demote/Promote", "Delete" buttons. New
|
|
user form at top.
|
|
- `createUserAction({ username, password, role })` — validate inputs,
|
|
bcrypt the password, insert.
|
|
- `setUserRoleAction({ userId, role })` — guard: if `userId === self.id`
|
|
AND `role !== "admin"`, refuse with "you can't demote yourself".
|
|
- `resetUserPasswordAction({ userId, newPassword })` — bcrypt + update.
|
|
Does NOT change cookies — the affected user keeps their existing
|
|
session until expiry or a token-version bump.
|
|
- `deleteUserAction({ userId })` — guard: refuse self-delete.
|
|
Additional guard: if deleting the last admin, refuse with "promote
|
|
another user to admin first".
|
|
|
|
All admin actions fan out a refresh of `/settings/users` via
|
|
`revalidatePath`.
|
|
|
|
## CLI bootstrap
|
|
|
|
The actual hashing happens in a small TSX script (so it can `import
|
|
bcrypt` from the workspace), wrapped by a one-line bash launcher
|
|
that runs it through the `tools` container. Two pieces:
|
|
|
|
`packages/db/src/scripts/set-password.ts` — reads `username` from
|
|
argv, prompts for password on stdin (echo off via `readline`'s
|
|
`writeMask`), bcrypts at 12 rounds, runs an `UPDATE operators SET
|
|
password_hash = $1 WHERE lower(username) = lower($2)`, exits
|
|
non-zero if no rows matched.
|
|
|
|
`packages/db/src/scripts/create-user.ts` — same pattern, but
|
|
INSERTs a fresh row with `username`, `role`, `password_hash`,
|
|
default timezone, and a synthetic `telegramUserId` (current time-
|
|
millis) since the column is still NOT NULL until a future cleanup
|
|
migration.
|
|
|
|
`scripts/set-password.sh` and `scripts/create-user.sh` — thin
|
|
wrappers that invoke the TSX scripts via `pnpm --filter @cmbot/db
|
|
exec tsx ...` inside the tools container, matching the existing
|
|
script-runner pattern.
|
|
|
|
Used to bootstrap the first admin and to recover when an admin
|
|
loses their password. After bootstrap, all user management happens
|
|
through the web UI.
|
|
|
|
## Rate limits added
|
|
|
|
| action | limit |
|
|
| ---------------------------- | -------------------------------- |
|
|
| loginAction | 10 / 5 min per IP |
|
|
| sendTestAction | 3 / 60 s per groupId |
|
|
| resumeReminderRunAction | 30 / 10 s per IP (existing infra)|
|
|
| cancelReminderRunAction | 30 / 10 s per IP |
|
|
| createUserAction | 5 / 60 s per IP |
|
|
| resetUserPasswordAction | 5 / 60 s per IP |
|
|
|
|
`checkRateLimit` is the existing Postgres-backed helper.
|
|
|
|
## Robots / noindex
|
|
|
|
`apps/web/src/app/robots.ts`:
|
|
|
|
```ts
|
|
import type { MetadataRoute } from "next";
|
|
export default function robots(): MetadataRoute.Robots {
|
|
return { rules: [{ userAgent: "*", disallow: "/" }] };
|
|
}
|
|
```
|
|
|
|
Plus `metadata.robots = { index: false, follow: false }` in the root
|
|
`apps/web/src/app/layout.tsx`. Two layers — robots.txt is advisory,
|
|
the meta is authoritative.
|
|
|
|
## Env hygiene
|
|
|
|
- Add `.env*` to `.gitignore` (already excludes `.env.local`,
|
|
`.env.*.local` — this widens to all `.env*` outside `.env.example`).
|
|
- `git rm --cached .env.development` and recreate locally without
|
|
committing.
|
|
- New `.env.example` documents every required key with placeholder
|
|
values, including the new `OPERATOR_TOKEN_VERSION`.
|
|
- After this change ships, the operator rotates the leaked
|
|
`AUTH_SECRET` and Postgres password (manual step, called out in
|
|
the upgrade notes).
|
|
|
|
## Container hardening
|
|
|
|
Both Dockerfiles:
|
|
|
|
```dockerfile
|
|
RUN useradd -m -u 1000 -s /usr/sbin/nologin app && \
|
|
mkdir -p /data/sessions /data/media && \
|
|
chown -R app:app /app /data && \
|
|
chmod 700 /data/sessions
|
|
USER app
|
|
```
|
|
|
|
The `dev-data:/data` volume mount in `docker-compose.dev.yml` keeps
|
|
working since the host UID matches the in-container `app` UID 1000.
|
|
|
|
## Origin allowlist
|
|
|
|
`next.config.ts` adds:
|
|
|
|
```ts
|
|
experimental: {
|
|
serverActions: {
|
|
allowedOrigins: ["wabot.04080616.xyz", "localhost:9000"],
|
|
},
|
|
},
|
|
```
|
|
|
|
Same-origin Server Action posts already work; this guards against
|
|
cross-origin POSTs from another domain attempting to invoke an
|
|
action via a known cookie.
|
|
|
|
## Test plan (38 tests)
|
|
|
|
### `auth-cookie.test.ts` — pure HMAC + verification logic
|
|
|
|
1. `signSession` then `verifySession` round-trips.
|
|
2. Tampered payload → verify rejects.
|
|
3. Tampered signature → verify rejects.
|
|
4. Wrong secret → verify rejects.
|
|
5. Constant-time compare prevents char-by-char timing leak (assert
|
|
`crypto.timingSafeEqual` is used).
|
|
6. Cookie expired (`exp <= now`) → reject.
|
|
7. Cookie issued in the future (`iat > now + 60`) → reject (clock-skew).
|
|
8. Cookie with stale `v` (TOKEN_VERSION bumped after issue) → reject.
|
|
9. Cookie with bad `role` value (`"superadmin"`) → reject.
|
|
10. Cookie missing fields → reject.
|
|
|
|
### `login-action.test.ts` — login flow
|
|
|
|
11. Valid credentials → cookie issued with right shape.
|
|
12. Wrong password → no cookie, generic error.
|
|
13. Wrong username → no cookie, generic error, dummy-bcrypt called
|
|
(timing equivalence).
|
|
14. `password_hash IS NULL` user → "set password via CLI" error.
|
|
15. Empty username or password → 400-equivalent (no DB hit).
|
|
16. Username/password >256 chars → rejected before bcrypt.
|
|
17. Username case-insensitive (`Admin` matches `admin`).
|
|
18. 11th login attempt within window → 429 (rate-limited).
|
|
19. After window expiry, attempts succeed.
|
|
20. Failed login logs warning with username + IP, no password.
|
|
21. Cookie sets correct attrs (HttpOnly, Secure, SameSite, Path,
|
|
Max-Age).
|
|
|
|
### `middleware.test.ts` — gate behavior
|
|
|
|
22. No cookie + page request → 302 to `/login?next=<path>`.
|
|
23. No cookie + `/api/...` (non-allowlisted) → 401.
|
|
24. Valid cookie + page → next().
|
|
25. Tampered cookie → 302 to `/login`.
|
|
26. Allowlisted (`/login`, `/api/health`, manifest, icons) bypasses.
|
|
27. `/api/events` and `/api/qr/[id]` are NOT in allowlist (regression
|
|
against the audit's Critical findings).
|
|
|
|
### `next-param.test.ts` — open-redirect prevention
|
|
|
|
28. `/dashboard` → preserved.
|
|
29. `//evil.com` → falls back to `/`.
|
|
30. `https://evil.com` → falls back to `/`.
|
|
31. `javascript:alert(1)` → falls back to `/`.
|
|
32. `/path?with=query&extra=fine` → preserved verbatim.
|
|
|
|
### `require-helpers.test.ts` — Server-action gates
|
|
|
|
33. `requireUser()` throws with no session.
|
|
34. `requireUser()` returns the user with valid session.
|
|
35. `requireAdmin()` throws when role === "user".
|
|
36. `requireAdmin()` returns the user when role === "admin".
|
|
|
|
### `user-management.test.ts` — admin guards
|
|
|
|
37. Self-demote (`setUserRoleAction({ userId: self, role: "user" })`)
|
|
→ ok:false with clear error.
|
|
38. Last-admin delete (deleting only admin user) → ok:false with
|
|
"promote another user first".
|
|
|
|
## Migration risk
|
|
|
|
`getSeededOperator()` is the one big touch. The 66 call sites are
|
|
mostly Server Actions and queries that read `.id` and
|
|
`.defaultTimezone` off the returned object — the new shape is a
|
|
superset, so the change is mechanical.
|
|
|
|
To keep churn off the existing test suite (~12 tests mock
|
|
`@/lib/operator`), `apps/web/src/lib/operator.ts` keeps its export
|
|
but reimplements `getSeededOperator` as a thin pass-through to
|
|
`getCurrentUser` from `@/lib/auth`. Existing mocks that target
|
|
`@/lib/operator` keep working unchanged. New code uses
|
|
`getCurrentUser` / `requireUser` / `requireAdmin` directly; the old
|
|
name is kept as a compatibility shim and removed in a follow-up
|
|
once all sites are swept.
|
|
|
|
A `DUMMY_HASH` constant lives at the top of the login action — it's
|
|
a precomputed bcrypt hash of a known throwaway string (`"x"`),
|
|
generated once at build time and committed. We compare against it
|
|
on the user-not-found path so timing is identical to the wrong-
|
|
password path. Generating a fresh dummy hash per request would
|
|
double the bcrypt work and create its own timing signal.
|
|
|
|
## Out of scope (deferred)
|
|
|
|
- WebAuthn / passkeys.
|
|
- 2FA / TOTP.
|
|
- Email-based password recovery (operator restarts container with
|
|
a new env var `OPERATOR_TOKEN_VERSION` if all admins lose their
|
|
passwords; CLI helps the rest).
|
|
- Account lockout (rate limit is enough for one operator's threat
|
|
model).
|
|
- SSO / OAuth providers.
|
|
- Audit-log surface for "who logged in when". The pino warn line
|
|
is the minimum; a structured audit table is later work.
|
|
- A "remember this device" feature distinct from the 30-day cookie.
|
|
|
|
## Acceptance
|
|
|
|
- The bot can be exposed at `wabot.04080616.xyz` and any
|
|
unauthenticated request to a non-allowlisted path returns 401
|
|
(API) or redirects to `/login` (page).
|
|
- A correct username + password issues a 30-day cookie that survives
|
|
reload, browser restart, and PWA homescreen launches.
|
|
- A wrong username, a wrong password, and a missing-password user
|
|
all produce the same generic "Invalid username or password"
|
|
error and the same wall-clock duration (timing-equivalent).
|
|
- Bumping `OPERATOR_TOKEN_VERSION` on the host invalidates every
|
|
active session immediately.
|
|
- An attacker tampering with the cookie payload, signature, or
|
|
issued-at can't pass middleware.
|
|
- Eleven login attempts from the same IP within five minutes
|
|
produce a 429 on the eleventh.
|
|
- A `user`-role session can browse, schedule, and resume reminders
|
|
but cannot reach `/settings/users`.
|
|
- An admin can't demote or delete their own row, and can't delete
|
|
the last admin.
|
|
- `robots.txt` returns `Disallow: /` and the rendered HTML carries
|
|
`<meta name="robots" content="noindex, nofollow">`.
|
|
- Both containers run as UID 1000, sessions dir is `chmod 700`.
|
|
- `.env.development` is gone from the repo and `.gitignore` excludes
|
|
every `.env*` except `.env.example`.
|
|
- All 38 tests in the plan pass; existing 471 tests still pass.
|