yiekheng e7ab6b1325 Add design spec for prod hardening (C1+C5+C6) and aaPanel guide

Bundles three independent prod-side improvements: replace Flask dev
server with gunicorn (C1), drop api-server's host port (C5), fix the
HAL set_security_pin_api bool/dict contract bug + clean up stale
AGENTS.md note (C6). Appendix is a hand-over guide for the aaPanel
operator (C3 basic auth, C4 rate-limit + scanner deflection, C7 host
firewall) including a vhost for heng.04080616.xyz routing to the dev
PC. Auth path locked to G3 (basic auth + iOS/Android keychain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 17:28:45 +08:00

21 KiB

Raw Blame History

Prod Hardening C1+C5+C6 + aaPanel Guide Design

Date: 2026-05-02 Status: Approved (design) Sequel to: 2026-05-02-debug-mode-hotfix-design.md, 2026-05-02-local-as-dev-design.md Related sub-projects (not in this scope): C3 auth, C4 rate-limit + scanner deflection, C7 host firewall — all live in the aaPanel layer; covered in the appendix as a hand-over guide rather than as repo changes. B (Next.js webview), R3 (cm_bot.py scraper resilience) — separate cycles.

Problem

Three independent issues, all under the "production hardening" bucket, that I'm bundling into one cycle because they all touch the prod Flask container surface and review well together:

Flask dev server in production. cm_api and cm_web_view both call app.run(...) as their container entrypoint. The Flask docs print a WARNING: This is a development server. line into the user's container logs every restart. The dev server is single-threaded, has no graceful reload, no proper signal handling, and is not designed for production load. The earlier debug-mode hotfix (commit c3f02b3) closed the RCE risk but left the dev server itself running.
api-server host port exposure. Base docker-compose.yml has ports: - "3000" for api-server (no host binding), which Docker maps to a random host port (e.g. 0.0.0.0:32768->3000). api-server is only ever reached by web-view over the compose network — the host port serves no production purpose and broadens the LAN-reachable surface unnecessarily.
Stale documentation + a latent contract bug. AGENTS.md (line 94) still says app/cm_bot_hal.py contains hardcoded agent credentials/PIN, but commit 45303d0 already moved them to env vars (_get_required_env('CM_AGENT_ID') etc.). Separately, cm_bot_hal.set_security_pin_api() returns a bool while cm_telegram.py:87 does result['f_username'] — a TypeError currently masked by the surrounding except Exception clause. The spec for sub-project A noted this and worked around it in bot_cli.py; this is the cycle that fixes it at the source.

The user's reverse proxy (aaPanel) lives on a separate host and reaches the Flask containers over the LAN, so any "edge layer" hardening (TLS, auth, rate limit, scanner deflection) must happen in aaPanel, not in this repo. The appendix below documents what to paste into aaPanel; no repo code implements it.

Goal

Replace app.run with gunicorn in both Flask services for production, hide api-server's host port (only web-view stays LAN-reachable for aaPanel), fix the stale doc and the latent HAL contract bug, and write a one-page aaPanel-side hardening guide so the operator can land C3/C4/C7 in their proxy themselves.

Non-Goals

Adding Caddy/Traefik/nginx as a docker service. aaPanel already proxies; adding a second proxy in compose would just duplicate concerns. (Was C2; dropped.)
Implementing auth, rate limit, or scanner deflection in Python middleware. Same reason — wrong layer; aaPanel is where this belongs.
Writing a host-firewall config script. The aaPanel guide names ufw/iptables rules but doesn't ship a script — host firewall state is too operator-specific.
Changing cm_bot.py's scraper code. That's R3.
Migrating to ASGI / uvicorn / async Flask. The app is sync Flask; gunicorn's sync worker is the right fit.

Architecture

Container entrypoint: gunicorn for prod, `app.run` for dev

The same Docker image runs in both prod (rex/siong via base docker-compose.yml) and dev (via docker-compose.override.yml). The override pattern is already used to swap registry images for local builds — extending it to swap entrypoints is the natural fit.

Surface	Today	After
`docker/api/Dockerfile` `CMD`	`python -m app.cm_api`	`gunicorn --workers 2 --timeout 30 --bind 0.0.0.0:3000 app.cm_api:create_app()`
`docker/web/Dockerfile` `CMD`	`python -m app.cm_web_view`	`gunicorn --workers 2 --timeout 30 --bind 0.0.0.0:8000 app.cm_web_view:app`
`docker-compose.override.yml` (dev)	(no `command:` overrides)	`command: python -m app.cm_api` for api-server; `command: python -m app.cm_web_view` for web-view

This keeps Flask's debugger and auto-reloader available in dev (when CM_DEBUG=true) without changing any runtime semantics in prod beyond replacing the WSGI server.

Why an `app.cm_api:create_app()` factory

cm_api.py currently exposes class CM_API whose constructor builds a Flask app and registers routes. gunicorn's WSGI loader needs a module-level callable that returns a WSGI app. Smallest viable change: add a create_app() module function that does return CM_API().app. The class stays — both python -m app.cm_api (__main__ block calls CM_API().run()) and gunicorn 'app.cm_api:create_app()' work without duplicated bootstrap.

cm_web_view.py already has a module-level app = Flask(__name__), so gunicorn binds directly to app.cm_web_view:app — no factory needed.

Worker count

Two workers per service is the conservative default for a small mostly-DB-bound app. Goes into gunicorn flags directly, not into env vars — these aren't operationally tuned right now. Tuning becomes a follow-up if load ever justifies it.

Logging

gunicorn writes access + error logs to stdout/stderr by default; PYTHONUNBUFFERED=1 is already set in compose; aaPanel access logs cover the upstream side. No log routing changes needed.

`api-server` host port: drop in base, add `127.0.0.1:3000` in dev override

File	Today	After
`docker-compose.yml` (api-server)	`ports: - "3000"`	(block removed; api-server reachable only via the compose network)
`docker-compose.override.yml` (api-server)	(no ports override)	`ports: - "127.0.0.1:3000:3000"` so dev `curl http://localhost:3000/...` keeps working

In prod, web-view talks to api-server through the docker bridge network at http://api-server:3000. The host port mapping in the base file was incidental, not load-bearing. Removing it makes api-server invisible to the LAN.

web-view's host binding stays as ${CM_WEB_HOST_PORT:-8001}:8000 (no IP prefix → 0.0.0.0), because aaPanel on a different host needs to reach it over the LAN. That's the intentional public-ish surface; the rest of the docker network goes back to private.

HAL contract fix: `set_security_pin_api` returns a dict

Today (app/cm_bot_hal.py:152), the method ends:

result = self.update_user_status_to_done(f_username)
if result == False:
    raise Exception('Failed to update user status to done')

result = self.insert_user_to_table_user(...)
if result == False:
    raise Exception('Failed to insert user to table user')
return result   # <-- returns bool

cm_telegram.py:87 is what's "right":

result = bot.set_security_pin_api(context.args[0])
del bot
await update.message.reply_text(f"Done setting Security Pin for {result['f_username']} - {result['t_username']} !")

Fix the producer, not the consumers. After this change set_security_pin_api returns {"f_username": ..., "t_username": ...} on success, and the existing cm_telegram.py line just works. app/bot_cli.py cmd_set_pin (added in sub-project A) currently re-extracts names locally as a workaround — that workaround is replaced by reading the dict.

The four if result == False: raise lines stay; they're checking the inner step (DB write) returns true, not the outer return shape.

Cleanups

AGENTS.md line 94 is removed. The "hardcoded credentials" claim is no longer true.
No other doc edits in this cycle.

Files Created / Modified

File	Operation	Purpose
`requirements.txt`	Modify	Add `gunicorn==23.0.0` (current stable on Python 3.9).
`app/cm_api.py`	Modify	Add `def create_app(): return CM_API().app` factory.
`app/cm_bot_hal.py`	Modify	`set_security_pin_api` returns `{"f_username", "t_username"}` instead of bool.
`app/bot_cli.py`	Modify	`cmd_set_pin` reads `result["f_username"]` / `result["t_username"]` instead of pre-fetching them via `get_whatsapp_link_username`.
`tests/test_bot_cli.py`	Modify	Update `CmdSetPinTests` mocks to return the dict.
`docker/api/Dockerfile`	Modify	`CMD` swaps to `gunicorn ... app.cm_api:create_app()`.
`docker/web/Dockerfile`	Modify	`CMD` swaps to `gunicorn ... app.cm_web_view:app`.
`docker-compose.yml`	Modify	Remove `ports: - "3000"` from `api-server`.
`docker-compose.override.yml`	Modify	Add `command: python -m app.cm_api` to api-server (preserves Flask dev server in dev); add `command: python -m app.cm_web_view` to web-view; add `ports: - "127.0.0.1:3000:3000"` to api-server.
`AGENTS.md`	Modify	Remove the stale "cm_bot_hal.py contains hardcoded credentials" line.
`docs/aapanel-hardening.md`	Create	Operator-facing nginx snippets for C3 (auth), C4 (rate-limit + scanner deflection), C7 (host firewall). Pasted into aaPanel; no repo code references it.

Verification

Unit tests still green. .venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli -v → OK. Updated CmdSetPinTests exercises the new dict return shape.
Dev: Flask dev server still runs. bash scripts/dev.sh up → web-view log shows * Debug mode: on/off (whichever CM_DEBUG is) and * Running on.... The command: override puts python -m app.cm_web_view back in front of gunicorn.
Prod parity check (compose-only). docker compose -f docker-compose.yml config | grep -E "Listening|gunicorn" || true; docker compose -f docker-compose.yml config | grep -E "^\s+ports:" -A 1 confirms (a) api-server has no host port, (b) web-view still has ${CM_WEB_HOST_PORT:-8001}:8000.
Prod cold start (deploy host). With a published image tag (or a CM_IMAGE_PREFIX=local DOCKER_IMAGE_TAG=dev rebuild + docker compose up -d from base only), web-view logs show [INFO] Starting gunicorn and [INFO] Listening at: http://0.0.0.0:8000. No more WARNING: This is a development server line.
/api/acc/ round-trip. Hit web-view via aaPanel: load the UI, account list renders. The api-server is no longer LAN-reachable on its 3000 port (nmap -p 3000 <host-ip> from another machine returns closed). web-view's 8001/8005 still reachable from aaPanel's host.
HAL contract fix. python -c "from app.cm_bot_hal import CM_BOT_HAL; print(getattr(CM_BOT_HAL.set_security_pin_api, '__doc__'))" (or just read the diff) shows the new return shape. cm_telegram.py:87's result['f_username'] no longer raises in the success path.
Stale doc gone. grep -n "hardcoded" AGENTS.md returns nothing.

Risk

Medium. Three concerns worth naming:

Dropping api-server's host port could surprise an operator who was relying on curl http://prod-host:32768/acc/ for ad-hoc debugging in prod. Mitigation: it's mentioned in the AGENTS.md updated section in this cycle, and prod debugging through docker exec or via web-view's /api/acc/ proxy still works.
gunicorn worker count and timeout are heuristics, not measurements. Two workers / 30s timeout is fine for current load (a handful of cm99.net calls in flight); it may need tuning if load grows. Captured as "tune later" out-of-scope item.
The HAL return-shape change is a behavior change in the public API of set_security_pin_api. Both call sites are in this repo (cm_telegram.py, app/bot_cli.py) and both are updated in this cycle. No external consumers exist.

Out-of-Scope Follow-Ups

gunicorn config tuning (workers, threads, keep-alive) once we have any production traffic data.
C3 / C4 / C7 — operator pastes the appendix into aaPanel. If one of them turns out to be repo-relevant after the fact (e.g., we want app-level rate limiting too), it can come back as its own cycle.
Authelia (or similar) for passkey-based auth — the upgrade path from G3 (basic auth + keychain) when biometric UX in basic auth becomes annoying. Self-hosted Authelia container, nginx auth_request delegation, WebAuthn enrollment flow. Its own brainstorm cycle when needed.
Tailscale-only access — alternative to public auth: drop the Flask hosts onto a Tailnet, remove the public vhosts. Better phone biometric UX (via Tailscale's app), but loses the "share a public URL" property.
Health endpoints (/healthz) for readiness/liveness probes. gunicorn's default 200 on / works for now; aaPanel doesn't probe; no orchestrator is doing it. YAGNI.
cm_transfer_credit.py and cm_telegram.py — neither runs a Flask server, so gunicorn does not apply. Their restart: unless-stopped plus the existing crash-resume logic in cm_telegram.py:run_polling_forever is the right shape.

Appendix: aaPanel hardening guide (C3 + C4 + C7)

This appendix is the same content that lands at docs/aapanel-hardening.md. The repo cycle does not implement these — they are operator actions in aaPanel.

Threat model recap

aaPanel terminates TLS for https://<rex-domain>, https://<siong-domain>, and https://heng.04080616.xyz (the dev tier — see "Dev vhost" below) and proxies to LAN-reachable web-view ports on the Flask hosts (8001 rex, 8005 siong, 8000 dev). A scanner on the public internet → aaPanel → Flask. Without these mitigations, every /.env /.git/config /.aws/config /.htpasswd /php.php probe round-trips through the proxy to Flask. With them, aaPanel returns 444 immediately and Flask never sees the request.

C3 — Basic auth on the rex/siong/dev vhosts

Goal: the web-view UI requires a password. Anyone hitting https://<domain>/ with no creds gets 401.

Generate an htpasswd file (one per deployment is cleaner):

# On the aaPanel host, as root:
htpasswd -c /www/server/panel/data/htpasswd-rex  rex-operator
htpasswd -c /www/server/panel/data/htpasswd-siong siong-operator
htpasswd -c /www/server/panel/data/htpasswd-dev   dev-operator
chmod 640 /www/server/panel/data/htpasswd-*
chown www:www /www/server/panel/data/htpasswd-*

Add to the rex vhost's server { ... } block (aaPanel: site → settings → "Configuration File"):

auth_basic "rex restricted";
auth_basic_user_file /www/server/panel/data/htpasswd-rex;

Same shape for siong (htpasswd-siong) and dev (htpasswd-dev). Use a different password per deployment — reusing the same one means a leaked dev credential exposes prod. Reload nginx (aaPanel does this automatically on save).

Phone UX note. Basic auth + iOS/Android keychain + Face ID / Touch ID flow: on first login, save the password into the OS keychain when prompted ("Save password to iCloud Keychain" on iOS, "Save to Google Password Manager" on Android). Subsequent visits trigger Face ID / fingerprint to autofill the basic-auth dialog. Caveats:

Safari (iOS): integration is reliable. Face ID prompts almost every visit unless you tick "Remember me on this device" in Safari's password autofill settings.
Chrome (Android): Google Password Manager autofills basic-auth in newer Chrome versions; biometric prompt appears.
In-app browsers (Telegram, WhatsApp link previews): often don't autofill basic-auth and force you to type. If this matters, share https://... URLs and ask people to open in their default browser.

If autofill behavior is choppy, the upgrade path is G2 (Authelia + passkeys) — captured as a follow-up below, not in this cycle.

C4 — Rate limit + scanner deflection

Scanner deflection — return 444 on known probe paths

In the same vhost, server { ... }:

# Deflect generic web vulnerability scanners. Return 444 (no response,
# closes connection) instead of letting them reach Flask.
location ~* "^/(\.env|\.env\..*|\.git/.*|\.aws/.*|\.dockerenv|\.htpasswd|\.npmrc|.+\.php|i\.php|test\.php|php\.php|wp-(login|admin|content)/)" {
    access_log off;
    return 444;
}

# Robots: tell well-behaved crawlers to leave us alone.
location = /robots.txt {
    add_header Content-Type text/plain;
    return 200 "User-agent: *\nDisallow: /\n";
}

Rate limit — cap requests per source IP

In the http { ... } block (one level above server; in aaPanel typically lives in the global nginx config or in a snippet):

# Define a 10MB shared zone, rate 30 requests/sec per source IP.
limit_req_zone $binary_remote_addr zone=cm_general:10m rate=30r/s;

Then inside the rex/siong server { ... }:

# Allow short bursts (60 reqs above rate) before throttling.
limit_req zone=cm_general burst=60 nodelay;
limit_req_status 429;

30 r/s × per-IP is generous for legitimate UI traffic and tight enough to slow a scanner down to nuisance levels.

Dev vhost — `heng.04080616.xyz` → dev PC

The dev tier (sub-project A) runs locally on a dev PC: bash scripts/dev.sh up → web-view on 0.0.0.0:8000. Routing aaPanel to it adds public reach (with auth) so you can hand someone a URL to test against without giving them VPN.

aaPanel vhost for heng.04080616.xyz (in addition to the C3/C4 blocks above):

location / {
    proxy_pass http://<dev-pc-lan-ip>:8000;
    proxy_set_header Host              $host;
    proxy_set_header X-Real-IP         $remote_addr;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_read_timeout 60s;
}

Replace <dev-pc-lan-ip> with the dev PC's address on your LAN.

⚠️ Important: turn CM_DEBUG OFF in .env before letting aaPanel proxy to dev. The dev tier defaults to CM_DEBUG=true (per envs/dev/.env.example), which enables Werkzeug's debugger. With aaPanel proxying publicly, basic auth is the only thing standing between the internet and an interactive Python REPL on the dev PC. The right pattern is:

CM_DEBUG=true only when iterating fully locally (no aaPanel proxy active, no port forward).
CM_DEBUG=false whenever the dev tier is reachable through heng.04080616.xyz.

If you'd rather not flip the flag manually, set CM_DEBUG=false permanently in your dev .env and run bash scripts/bot_cli.sh for the workflows you used to want the debugger for. The Flask in-browser tracebacks aren't worth the RCE surface.

C7 — Host firewall on the Flask docker host(s)

Restrict the LAN-reachable web-view ports to only aaPanel's IP. Without this, anyone else on the LAN can hit Flask directly and bypass everything in C3 and C4. Apply on each host that runs a Flask stack: rex, siong, and the dev PC.

Replace <aapanel-host-ip> with the address of your aaPanel box.

On rex/siong hosts (ports 8001 / 8005 respectively):

sudo ufw allow from <aapanel-host-ip> to any port 8001 proto tcp comment 'rex web-view ← aaPanel only'
sudo ufw allow from <aapanel-host-ip> to any port 8005 proto tcp comment 'siong web-view ← aaPanel only'
sudo ufw deny 8001/tcp
sudo ufw deny 8005/tcp
sudo ufw reload
sudo ufw status numbered

On the dev PC (port 8000 — match CM_WEB_HOST_PORT from envs/dev/.env):

sudo ufw allow from <aapanel-host-ip> to any port 8000 proto tcp comment 'dev web-view ← aaPanel only'
sudo ufw allow from 127.0.0.1 to any port 8000 proto tcp comment 'dev web-view ← localhost'
sudo ufw deny 8000/tcp
sudo ufw reload

The localhost rule on the dev PC is so you can still load http://localhost:8000 directly while iterating, without going through aaPanel.

Verify from a third machine on the LAN:

nmap -p 8000,8001,8005 <flask-host-ip>
# All three ports should show 'filtered' from anywhere except the aaPanel host
# (and except localhost on the dev PC).

If you don't run ufw and prefer iptables directly, the equivalent rules are:

iptables -A INPUT -p tcp --dport 8001 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8005 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -s 127.0.0.1       -j ACCEPT
iptables -A INPUT -p tcp --dport 8001 -j DROP
iptables -A INPUT -p tcp --dport 8005 -j DROP
iptables -A INPUT -p tcp --dport 8000 -j DROP

(Persist via iptables-save > /etc/iptables/rules.v4 or your distro's preferred mechanism.)

Verification after applying C3/C4/C7

Curl any UI without creds: curl -i https://<rex-domain>/ → 401 Unauthorized. Same shape for siong and https://heng.04080616.xyz/.
Curl with creds: curl -i -u rex-operator:<password> https://<rex-domain>/api/acc/ → 200 OK with JSON.
Probe a scanner path: curl -i https://<rex-domain>/.env → connection closed (444 → curl shows "Empty reply from server"). Flask logs show no entry for this request.
Hammer-test rate limit: for i in $(seq 1 200); do curl -s -o /dev/null -w "%{http_code}\n" https://<rex-domain>/; done | sort | uniq -c → should see 200s up to the burst window then 429s.
From a non-aaPanel host on the LAN: nmap -p 8000,8001,8005 <flask-host-ip> → filtered (localhost on dev PC still allowed).
Dev-specific check. On the dev PC, bash scripts/dev.sh logs | grep "Debugger PIN" should return nothing once CM_DEBUG is off. Then curl -i -u dev-operator:<password> https://heng.04080616.xyz/api/acc/ returns the seed accounts.

21 KiB Raw Blame History Unescape Escape