Drives the work that closes the v1.1.0 production-readiness audit findings: username + password + role auth on the web app, gated SSE / QR endpoints, robots/noindex, env hygiene, container non- root, and rate limits on the four currently-naked Server Actions. Auth design highlights: * Roll-our-own session cookie (no NextAuth) — bcrypt password + HMAC-SHA256 signed cookie; edge-runtime middleware verifies on every request; defense-in-depth requireUser / requireAdmin in every Server Action. * Username + password + 2-role model (admin / user). Schema migration adds username + password_hash to existing operators table. * CLI bootstrap (scripts/set-password.sh) sets the first admin's password before going live; user management UI gates everything else. * OPERATOR_TOKEN_VERSION env var as a global session-invalidation lever. * 38 unit tests covering brute-force / cookie tampering / replay / expiry / fixation / open redirect / timing leak / rate limit / origin-allowlist / unauth API regression / role gates / self- demote and last-admin guards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
Auth + Production Hardening Design
Spec for closing the production-readiness gap before promoting the bot to public-internet exposure at
wabot.04080616.xyz. Covers the session-cookie auth model with username + password + role, plus the hygiene work that has to land alongside it (robots, env, container non-root) so the public surface is safe in one change.
Goal
Add operator authentication to the web app so the public URL stops being a foothold for anyone who finds it, and at the same time close the highest-risk production gaps surfaced in the v1.1.0 audit: indexable content, committed credentials, root-running containers, and four un-rate-limited Server Actions.
Constraints
- Single-host self-hosted deployment, public-internet via reverse
proxy + TLS at
wabot.04080616.xyz. - Up to a handful of users today, with room to grow. One must be
admin; the rest areuser. - Mobile PWA homescreen workflow: 30-day cookie, no friction at re-open, no third-party identity provider.
- No new infra dependencies. Postgres + Docker compose stay the whole platform. No NextAuth / Auth.js, no external KV, no SMS.
- Existing call sites must be cleanly retrofitted without breaking
the 66 call sites that currently use
getSeededOperator(). - All code changes covered by unit tests; no test relies on a live Postgres or browser.
Approach: roll-our-own session cookie
A library would be heavy for one role gate and one cookie. We pick
up bcrypt for password hashing (battle-tested) and Web Crypto's
HMAC for cookie signing (stdlib, edge-runtime compatible). All other
code is domain-owned and exhaustively tested.
The model: the user posts username + password to a Server Action,
the action verifies against a per-user password_hash row, and the
response sets a signed cookie carrying { userId, role, iat, exp, v }.
Middleware verifies the cookie on every request; Server Actions
double-check via requireUser() / requireAdmin() so a forgotten
middleware path can't bypass the gate.
Schema migration (0010_add_user_auth.sql)
ALTER TABLE operators
ADD COLUMN username text,
ADD COLUMN password_hash text;
CREATE UNIQUE INDEX operators_username_uq
ON operators (lower(username));
-- Backfill the seed row so it has a username; password_hash stays NULL
-- so the operator is forced to set one via the CLI before they can sign
-- in. Sets a clear "you have to do this before going live" gate.
UPDATE operators
SET username = 'admin'
WHERE username IS NULL;
ALTER TABLE operators
ALTER COLUMN username SET NOT NULL;
telegramUserId stays for now (it's referenced from existing migrations
and seed flow) but no longer drives auth. defaultTimezone and role
are unchanged. operators.role already defaults to "admin".
Roles
Two values, no enum constraint at the DB layer (text — same as existing).
| role | can do |
|---|---|
| admin | everything in the app + user management (CRUD other users) |
| user | everything except /settings/users and the user-mgmt actions |
A third "viewer" role isn't worth it today; can be added later by extending the role check.
Cookie format
Header value: session=<base64url(payload)>.<base64url(hmac)>
type SessionPayload = {
userId: string; // operators.id (uuid)
role: "admin" | "user";
iat: number; // issued-at, unix seconds
exp: number; // expires-at, unix seconds (iat + 30 days)
v: number; // OPERATOR_TOKEN_VERSION at issue time
};
HMAC is HMAC-SHA256 over the base64url-encoded payload string with
AUTH_SECRET as the key. Verification rejects on:
- Bad shape (no
., base64 decode fails, JSON parse fails). - HMAC mismatch (uses constant-time compare).
exp <= now.iat > now + 60(clock-skew guard, 60s tolerance).v !== process.env.OPERATOR_TOKEN_VERSION(defaults to"1").rolenot one of"admin"/"user".
Cookie attributes: HttpOnly; Secure; SameSite=Lax; Path=/; Max-Age=2592000.
Max-Age=0 on logout to clear.
OPERATOR_TOKEN_VERSION env var (default "1") is the global
session-invalidation lever. Bumping it on the host instantly logs out
every user — no DB writes — useful after a host compromise or a
known-shared password.
Login flow
Page: apps/web/src/app/login/page.tsx. Single form with:
- Username input (
type=text, autocompleteusername) - Password input (
type=password, autocompletecurrent-password) - Submit button "Sign in"
- Error slot for the generic message
- A small note: "First time? Run
./scripts/set-password.sh <username>in your tools container to set a password."
Server action loginAction(formData: FormData):
1. Read username, password from FormData.
2. Reject if either >256 chars (DoS guard, no bcrypt).
3. Reject if either empty.
4. Apply rate limit: checkRateLimit("login:" + ip, { max: 10, windowSec: 300 }).
On exhaustion → return { ok: false, error: "Too many attempts, try later." }
5. Look up user: select * from operators where lower(username)=lower($1)
6. If user not found OR user.password_hash IS NULL:
await bcrypt.compare(password, DUMMY_HASH); // timing equivalence
return { ok: false, error: "Invalid username or password." }
7. await bcrypt.compare(password, user.password_hash)
if false: return { ok: false, error: "Invalid username or password." }
8. Issue cookie: signSession({ userId, role, iat: now, exp: now + 30d, v: TOKEN_VERSION })
9. Redirect to safe(next) ?? "/"
safe(next): must be a string starting with / AND not starting
with //. Otherwise return null.
Logout action logoutAction(): clear the cookie via
cookies().set("session", "", { maxAge: 0, ... }) and redirect to
/login.
Middleware gate
apps/web/src/middleware.ts extends the existing API allowlist with
the auth check.
For every request:
- If path is in allowlist (auth-free):
/login, /logout, /api/health, /manifest.webmanifest,
/icon-*, /favicon.ico, /_next/static/*, /_next/image
→ NextResponse.next()
- Read session cookie. Verify (HMAC, exp, iat-skew, version, role shape).
- On valid: NextResponse.next()
- On invalid + path starts with /api/: 401, no body
- On invalid + page request: 302 to /login?next=<encoded path>
/api/events and /api/qr/[accountId] are explicitly removed from
the unauth allowlist — middleware now requires a session for them.
The middleware imports the verifier from @/lib/auth-cookie (a
dependency-free module that runs on the edge runtime — no bcrypt,
no DB).
Server-action defense-in-depth
apps/web/src/lib/auth.ts (Node runtime — DB access OK):
export async function getCurrentUser(): Promise<User | null>
export async function requireUser(): Promise<User> // throws Response 401 / redirects
export async function requireAdmin(): Promise<User> // requireUser + role === "admin"
getSeededOperator() is renamed to getCurrentUser() (and rewired
to read the verified cookie + look up the user). All 66 call sites
swap mechanically. Existing typing stays compatible because the
returned shape is a superset.
Every Server Action begins with await requireUser() (or
requireAdmin() for admin-only). This is the second layer; the
middleware is the first. Both must agree before any state mutates.
User management surface
Admin-only, gated by requireAdmin() at every entry point.
/settings/users(page) — list of users with role chip + createdAt; inline "Reset password", "Demote/Promote", "Delete" buttons. New user form at top.createUserAction({ username, password, role })— validate inputs, bcrypt the password, insert.setUserRoleAction({ userId, role })— guard: ifuserId === self.idANDrole !== "admin", refuse with "you can't demote yourself".resetUserPasswordAction({ userId, newPassword })— bcrypt + update. Does NOT change cookies — the affected user keeps their existing session until expiry or a token-version bump.deleteUserAction({ userId })— guard: refuse self-delete. Additional guard: if deleting the last admin, refuse with "promote another user to admin first".
All admin actions fan out a refresh of /settings/users via
revalidatePath.
CLI bootstrap
The actual hashing happens in a small TSX script (so it can import bcrypt from the workspace), wrapped by a one-line bash launcher
that runs it through the tools container. Two pieces:
packages/db/src/scripts/set-password.ts — reads username from
argv, prompts for password on stdin (echo off via readline's
writeMask), bcrypts at 12 rounds, runs an UPDATE operators SET password_hash = $1 WHERE lower(username) = lower($2), exits
non-zero if no rows matched.
packages/db/src/scripts/create-user.ts — same pattern, but
INSERTs a fresh row with username, role, password_hash,
default timezone, and a synthetic telegramUserId (current time-
millis) since the column is still NOT NULL until a future cleanup
migration.
scripts/set-password.sh and scripts/create-user.sh — thin
wrappers that invoke the TSX scripts via pnpm --filter @cmbot/db exec tsx ... inside the tools container, matching the existing
script-runner pattern.
Used to bootstrap the first admin and to recover when an admin loses their password. After bootstrap, all user management happens through the web UI.
Rate limits added
| action | limit |
|---|---|
| loginAction | 10 / 5 min per IP |
| sendTestAction | 3 / 60 s per groupId |
| resumeReminderRunAction | 30 / 10 s per IP (existing infra) |
| cancelReminderRunAction | 30 / 10 s per IP |
| createUserAction | 5 / 60 s per IP |
| resetUserPasswordAction | 5 / 60 s per IP |
checkRateLimit is the existing Postgres-backed helper.
Robots / noindex
apps/web/src/app/robots.ts:
import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
return { rules: [{ userAgent: "*", disallow: "/" }] };
}
Plus metadata.robots = { index: false, follow: false } in the root
apps/web/src/app/layout.tsx. Two layers — robots.txt is advisory,
the meta is authoritative.
Env hygiene
- Add
.env*to.gitignore(already excludes.env.local,.env.*.local— this widens to all.env*outside.env.example). git rm --cached .env.developmentand recreate locally without committing.- New
.env.exampledocuments every required key with placeholder values, including the newOPERATOR_TOKEN_VERSION. - After this change ships, the operator rotates the leaked
AUTH_SECRETand Postgres password (manual step, called out in the upgrade notes).
Container hardening
Both Dockerfiles:
RUN useradd -m -u 1000 -s /usr/sbin/nologin app && \
mkdir -p /data/sessions /data/media && \
chown -R app:app /app /data && \
chmod 700 /data/sessions
USER app
The dev-data:/data volume mount in docker-compose.dev.yml keeps
working since the host UID matches the in-container app UID 1000.
Origin allowlist
next.config.ts adds:
experimental: {
serverActions: {
allowedOrigins: ["wabot.04080616.xyz", "localhost:9000"],
},
},
Same-origin Server Action posts already work; this guards against cross-origin POSTs from another domain attempting to invoke an action via a known cookie.
Test plan (38 tests)
auth-cookie.test.ts — pure HMAC + verification logic
signSessionthenverifySessionround-trips.- Tampered payload → verify rejects.
- Tampered signature → verify rejects.
- Wrong secret → verify rejects.
- Constant-time compare prevents char-by-char timing leak (assert
crypto.timingSafeEqualis used). - Cookie expired (
exp <= now) → reject. - Cookie issued in the future (
iat > now + 60) → reject (clock-skew). - Cookie with stale
v(TOKEN_VERSION bumped after issue) → reject. - Cookie with bad
rolevalue ("superadmin") → reject. - Cookie missing fields → reject.
login-action.test.ts — login flow
- Valid credentials → cookie issued with right shape.
- Wrong password → no cookie, generic error.
- Wrong username → no cookie, generic error, dummy-bcrypt called (timing equivalence).
password_hash IS NULLuser → "set password via CLI" error.- Empty username or password → 400-equivalent (no DB hit).
- Username/password >256 chars → rejected before bcrypt.
- Username case-insensitive (
Adminmatchesadmin). - 11th login attempt within window → 429 (rate-limited).
- After window expiry, attempts succeed.
- Failed login logs warning with username + IP, no password.
- Cookie sets correct attrs (HttpOnly, Secure, SameSite, Path, Max-Age).
middleware.test.ts — gate behavior
- No cookie + page request → 302 to
/login?next=<path>. - No cookie +
/api/...(non-allowlisted) → 401. - Valid cookie + page → next().
- Tampered cookie → 302 to
/login. - Allowlisted (
/login,/api/health, manifest, icons) bypasses. /api/eventsand/api/qr/[id]are NOT in allowlist (regression against the audit's Critical findings).
next-param.test.ts — open-redirect prevention
/dashboard→ preserved.//evil.com→ falls back to/.https://evil.com→ falls back to/.javascript:alert(1)→ falls back to/./path?with=query&extra=fine→ preserved verbatim.
require-helpers.test.ts — Server-action gates
requireUser()throws with no session.requireUser()returns the user with valid session.requireAdmin()throws when role === "user".requireAdmin()returns the user when role === "admin".
user-management.test.ts — admin guards
- Self-demote (
setUserRoleAction({ userId: self, role: "user" })) → ok:false with clear error. - Last-admin delete (deleting only admin user) → ok:false with "promote another user first".
Migration risk
getSeededOperator() is the one big touch. The 66 call sites are
mostly Server Actions and queries that read .id and
.defaultTimezone off the returned object — the new shape is a
superset, so the change is mechanical.
To keep churn off the existing test suite (~12 tests mock
@/lib/operator), apps/web/src/lib/operator.ts keeps its export
but reimplements getSeededOperator as a thin pass-through to
getCurrentUser from @/lib/auth. Existing mocks that target
@/lib/operator keep working unchanged. New code uses
getCurrentUser / requireUser / requireAdmin directly; the old
name is kept as a compatibility shim and removed in a follow-up
once all sites are swept.
A DUMMY_HASH constant lives at the top of the login action — it's
a precomputed bcrypt hash of a known throwaway string ("x"),
generated once at build time and committed. We compare against it
on the user-not-found path so timing is identical to the wrong-
password path. Generating a fresh dummy hash per request would
double the bcrypt work and create its own timing signal.
Out of scope (deferred)
- WebAuthn / passkeys.
- 2FA / TOTP.
- Email-based password recovery (operator restarts container with
a new env var
OPERATOR_TOKEN_VERSIONif all admins lose their passwords; CLI helps the rest). - Account lockout (rate limit is enough for one operator's threat model).
- SSO / OAuth providers.
- Audit-log surface for "who logged in when". The pino warn line is the minimum; a structured audit table is later work.
- A "remember this device" feature distinct from the 30-day cookie.
Acceptance
- The bot can be exposed at
wabot.04080616.xyzand any unauthenticated request to a non-allowlisted path returns 401 (API) or redirects to/login(page). - A correct username + password issues a 30-day cookie that survives reload, browser restart, and PWA homescreen launches.
- A wrong username, a wrong password, and a missing-password user all produce the same generic "Invalid username or password" error and the same wall-clock duration (timing-equivalent).
- Bumping
OPERATOR_TOKEN_VERSIONon the host invalidates every active session immediately. - An attacker tampering with the cookie payload, signature, or issued-at can't pass middleware.
- Eleven login attempts from the same IP within five minutes produce a 429 on the eleventh.
- A
user-role session can browse, schedule, and resume reminders but cannot reach/settings/users. - An admin can't demote or delete their own row, and can't delete the last admin.
robots.txtreturnsDisallow: /and the rendered HTML carries<meta name="robots" content="noindex, nofollow">.- Both containers run as UID 1000, sessions dir is
chmod 700. .env.developmentis gone from the repo and.gitignoreexcludes every.env*except.env.example.- All 38 tests in the plan pass; existing 471 tests still pass.