Bundles three independent prod-side improvements: replace Flask dev server with gunicorn (C1), drop api-server's host port (C5), fix the HAL set_security_pin_api bool/dict contract bug + clean up stale AGENTS.md note (C6). Appendix is a hand-over guide for the aaPanel operator (C3 basic auth, C4 rate-limit + scanner deflection, C7 host firewall) including a vhost for heng.04080616.xyz routing to the dev PC. Auth path locked to G3 (basic auth + iOS/Android keychain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 KiB
Prod Hardening C1+C5+C6 + aaPanel Guide Design
Date: 2026-05-02 Status: Approved (design) Sequel to: 2026-05-02-debug-mode-hotfix-design.md, 2026-05-02-local-as-dev-design.md Related sub-projects (not in this scope): C3 auth, C4 rate-limit + scanner deflection, C7 host firewall — all live in the aaPanel layer; covered in the appendix as a hand-over guide rather than as repo changes. B (Next.js webview), R3 (cm_bot.py scraper resilience) — separate cycles.
Problem
Three independent issues, all under the "production hardening" bucket, that I'm bundling into one cycle because they all touch the prod Flask container surface and review well together:
-
Flask dev server in production.
cm_apiandcm_web_viewboth callapp.run(...)as their container entrypoint. The Flask docs print aWARNING: This is a development server.line into the user's container logs every restart. The dev server is single-threaded, has no graceful reload, no proper signal handling, and is not designed for production load. The earlier debug-mode hotfix (commitc3f02b3) closed the RCE risk but left the dev server itself running. -
api-serverhost port exposure. Basedocker-compose.ymlhasports: - "3000"forapi-server(no host binding), which Docker maps to a random host port (e.g.0.0.0.0:32768->3000).api-serveris only ever reached byweb-viewover the compose network — the host port serves no production purpose and broadens the LAN-reachable surface unnecessarily. -
Stale documentation + a latent contract bug. AGENTS.md (line 94) still says
app/cm_bot_hal.pycontains hardcoded agent credentials/PIN, but commit45303d0already moved them to env vars (_get_required_env('CM_AGENT_ID')etc.). Separately,cm_bot_hal.set_security_pin_api()returns aboolwhilecm_telegram.py:87doesresult['f_username']— a TypeError currently masked by the surroundingexcept Exceptionclause. The spec for sub-project A noted this and worked around it inbot_cli.py; this is the cycle that fixes it at the source.
The user's reverse proxy (aaPanel) lives on a separate host and reaches the Flask containers over the LAN, so any "edge layer" hardening (TLS, auth, rate limit, scanner deflection) must happen in aaPanel, not in this repo. The appendix below documents what to paste into aaPanel; no repo code implements it.
Goal
Replace app.run with gunicorn in both Flask services for production, hide api-server's host port (only web-view stays LAN-reachable for aaPanel), fix the stale doc and the latent HAL contract bug, and write a one-page aaPanel-side hardening guide so the operator can land C3/C4/C7 in their proxy themselves.
Non-Goals
- Adding Caddy/Traefik/nginx as a docker service. aaPanel already proxies; adding a second proxy in compose would just duplicate concerns. (Was C2; dropped.)
- Implementing auth, rate limit, or scanner deflection in Python middleware. Same reason — wrong layer; aaPanel is where this belongs.
- Writing a host-firewall config script. The aaPanel guide names ufw/iptables rules but doesn't ship a script — host firewall state is too operator-specific.
- Changing
cm_bot.py's scraper code. That's R3. - Migrating to ASGI / uvicorn / async Flask. The app is sync Flask; gunicorn's
syncworker is the right fit.
Architecture
Container entrypoint: gunicorn for prod, app.run for dev
The same Docker image runs in both prod (rex/siong via base docker-compose.yml) and dev (via docker-compose.override.yml). The override pattern is already used to swap registry images for local builds — extending it to swap entrypoints is the natural fit.
| Surface | Today | After |
|---|---|---|
docker/api/Dockerfile CMD |
python -m app.cm_api |
gunicorn --workers 2 --timeout 30 --bind 0.0.0.0:3000 app.cm_api:create_app() |
docker/web/Dockerfile CMD |
python -m app.cm_web_view |
gunicorn --workers 2 --timeout 30 --bind 0.0.0.0:8000 app.cm_web_view:app |
docker-compose.override.yml (dev) |
(no command: overrides) |
command: python -m app.cm_api for api-server; command: python -m app.cm_web_view for web-view |
This keeps Flask's debugger and auto-reloader available in dev (when CM_DEBUG=true) without changing any runtime semantics in prod beyond replacing the WSGI server.
Why an app.cm_api:create_app() factory
cm_api.py currently exposes class CM_API whose constructor builds a Flask app and registers routes. gunicorn's WSGI loader needs a module-level callable that returns a WSGI app. Smallest viable change: add a create_app() module function that does return CM_API().app. The class stays — both python -m app.cm_api (__main__ block calls CM_API().run()) and gunicorn 'app.cm_api:create_app()' work without duplicated bootstrap.
cm_web_view.py already has a module-level app = Flask(__name__), so gunicorn binds directly to app.cm_web_view:app — no factory needed.
Worker count
Two workers per service is the conservative default for a small mostly-DB-bound app. Goes into gunicorn flags directly, not into env vars — these aren't operationally tuned right now. Tuning becomes a follow-up if load ever justifies it.
Logging
gunicorn writes access + error logs to stdout/stderr by default; PYTHONUNBUFFERED=1 is already set in compose; aaPanel access logs cover the upstream side. No log routing changes needed.
api-server host port: drop in base, add 127.0.0.1:3000 in dev override
| File | Today | After |
|---|---|---|
docker-compose.yml (api-server) |
ports: - "3000" |
(block removed; api-server reachable only via the compose network) |
docker-compose.override.yml (api-server) |
(no ports override) | ports: - "127.0.0.1:3000:3000" so dev curl http://localhost:3000/... keeps working |
In prod, web-view talks to api-server through the docker bridge network at http://api-server:3000. The host port mapping in the base file was incidental, not load-bearing. Removing it makes api-server invisible to the LAN.
web-view's host binding stays as ${CM_WEB_HOST_PORT:-8001}:8000 (no IP prefix → 0.0.0.0), because aaPanel on a different host needs to reach it over the LAN. That's the intentional public-ish surface; the rest of the docker network goes back to private.
HAL contract fix: set_security_pin_api returns a dict
Today (app/cm_bot_hal.py:152), the method ends:
result = self.update_user_status_to_done(f_username)
if result == False:
raise Exception('Failed to update user status to done')
result = self.insert_user_to_table_user(...)
if result == False:
raise Exception('Failed to insert user to table user')
return result # <-- returns bool
cm_telegram.py:87 is what's "right":
result = bot.set_security_pin_api(context.args[0])
del bot
await update.message.reply_text(f"Done setting Security Pin for {result['f_username']} - {result['t_username']} !")
Fix the producer, not the consumers. After this change set_security_pin_api returns {"f_username": ..., "t_username": ...} on success, and the existing cm_telegram.py line just works. app/bot_cli.py cmd_set_pin (added in sub-project A) currently re-extracts names locally as a workaround — that workaround is replaced by reading the dict.
The four if result == False: raise lines stay; they're checking the inner step (DB write) returns true, not the outer return shape.
Cleanups
AGENTS.mdline 94 is removed. The "hardcoded credentials" claim is no longer true.- No other doc edits in this cycle.
Files Created / Modified
| File | Operation | Purpose |
|---|---|---|
requirements.txt |
Modify | Add gunicorn==23.0.0 (current stable on Python 3.9). |
app/cm_api.py |
Modify | Add def create_app(): return CM_API().app factory. |
app/cm_bot_hal.py |
Modify | set_security_pin_api returns {"f_username", "t_username"} instead of bool. |
app/bot_cli.py |
Modify | cmd_set_pin reads result["f_username"] / result["t_username"] instead of pre-fetching them via get_whatsapp_link_username. |
tests/test_bot_cli.py |
Modify | Update CmdSetPinTests mocks to return the dict. |
docker/api/Dockerfile |
Modify | CMD swaps to gunicorn ... app.cm_api:create_app(). |
docker/web/Dockerfile |
Modify | CMD swaps to gunicorn ... app.cm_web_view:app. |
docker-compose.yml |
Modify | Remove ports: - "3000" from api-server. |
docker-compose.override.yml |
Modify | Add command: python -m app.cm_api to api-server (preserves Flask dev server in dev); add command: python -m app.cm_web_view to web-view; add ports: - "127.0.0.1:3000:3000" to api-server. |
AGENTS.md |
Modify | Remove the stale "cm_bot_hal.py contains hardcoded credentials" line. |
docs/aapanel-hardening.md |
Create | Operator-facing nginx snippets for C3 (auth), C4 (rate-limit + scanner deflection), C7 (host firewall). Pasted into aaPanel; no repo code references it. |
Verification
- Unit tests still green.
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli -v→OK. UpdatedCmdSetPinTestsexercises the new dict return shape. - Dev: Flask dev server still runs.
bash scripts/dev.sh up→ web-view log shows* Debug mode: on/off(whicheverCM_DEBUGis) and* Running on.... Thecommand:override putspython -m app.cm_web_viewback in front of gunicorn. - Prod parity check (compose-only).
docker compose -f docker-compose.yml config | grep -E "Listening|gunicorn" || true; docker compose -f docker-compose.yml config | grep -E "^\s+ports:" -A 1confirms (a) api-server has no host port, (b) web-view still has${CM_WEB_HOST_PORT:-8001}:8000. - Prod cold start (deploy host). With a published image tag (or a
CM_IMAGE_PREFIX=local DOCKER_IMAGE_TAG=devrebuild +docker compose up -dfrom base only), web-view logs show[INFO] Starting gunicornand[INFO] Listening at: http://0.0.0.0:8000. No moreWARNING: This is a development serverline. /api/acc/round-trip. Hit web-view via aaPanel: load the UI, account list renders. The api-server is no longer LAN-reachable on its 3000 port (nmap -p 3000 <host-ip>from another machine returns closed). web-view's 8001/8005 still reachable from aaPanel's host.- HAL contract fix.
python -c "from app.cm_bot_hal import CM_BOT_HAL; print(getattr(CM_BOT_HAL.set_security_pin_api, '__doc__'))"(or just read the diff) shows the new return shape.cm_telegram.py:87'sresult['f_username']no longer raises in the success path. - Stale doc gone.
grep -n "hardcoded" AGENTS.mdreturns nothing.
Risk
Medium. Three concerns worth naming:
- Dropping api-server's host port could surprise an operator who was relying on
curl http://prod-host:32768/acc/for ad-hoc debugging in prod. Mitigation: it's mentioned in the AGENTS.md updated section in this cycle, and prod debugging throughdocker execor via web-view's/api/acc/proxy still works. - gunicorn worker count and timeout are heuristics, not measurements. Two workers / 30s timeout is fine for current load (a handful of cm99.net calls in flight); it may need tuning if load grows. Captured as "tune later" out-of-scope item.
- The HAL return-shape change is a behavior change in the public API of
set_security_pin_api. Both call sites are in this repo (cm_telegram.py,app/bot_cli.py) and both are updated in this cycle. No external consumers exist.
Out-of-Scope Follow-Ups
- gunicorn config tuning (workers, threads, keep-alive) once we have any production traffic data.
- C3 / C4 / C7 — operator pastes the appendix into aaPanel. If one of them turns out to be repo-relevant after the fact (e.g., we want app-level rate limiting too), it can come back as its own cycle.
- Authelia (or similar) for passkey-based auth — the upgrade path from G3 (basic auth + keychain) when biometric UX in basic auth becomes annoying. Self-hosted Authelia container, nginx
auth_requestdelegation, WebAuthn enrollment flow. Its own brainstorm cycle when needed. - Tailscale-only access — alternative to public auth: drop the Flask hosts onto a Tailnet, remove the public vhosts. Better phone biometric UX (via Tailscale's app), but loses the "share a public URL" property.
- Health endpoints (
/healthz) for readiness/liveness probes. gunicorn's default 200 on/works for now; aaPanel doesn't probe; no orchestrator is doing it. YAGNI. cm_transfer_credit.pyandcm_telegram.py— neither runs a Flask server, so gunicorn does not apply. Theirrestart: unless-stoppedplus the existing crash-resume logic incm_telegram.py:run_polling_foreveris the right shape.
Appendix: aaPanel hardening guide (C3 + C4 + C7)
This appendix is the same content that lands at docs/aapanel-hardening.md. The repo cycle does not implement these — they are operator actions in aaPanel.
Threat model recap
aaPanel terminates TLS for https://<rex-domain>, https://<siong-domain>, and https://heng.04080616.xyz (the dev tier — see "Dev vhost" below) and proxies to LAN-reachable web-view ports on the Flask hosts (8001 rex, 8005 siong, 8000 dev). A scanner on the public internet → aaPanel → Flask. Without these mitigations, every /.env /.git/config /.aws/config /.htpasswd /php.php probe round-trips through the proxy to Flask. With them, aaPanel returns 444 immediately and Flask never sees the request.
C3 — Basic auth on the rex/siong/dev vhosts
Goal: the web-view UI requires a password. Anyone hitting https://<domain>/ with no creds gets 401.
Generate an htpasswd file (one per deployment is cleaner):
# On the aaPanel host, as root:
htpasswd -c /www/server/panel/data/htpasswd-rex rex-operator
htpasswd -c /www/server/panel/data/htpasswd-siong siong-operator
htpasswd -c /www/server/panel/data/htpasswd-dev dev-operator
chmod 640 /www/server/panel/data/htpasswd-*
chown www:www /www/server/panel/data/htpasswd-*
Add to the rex vhost's server { ... } block (aaPanel: site → settings → "Configuration File"):
auth_basic "rex restricted";
auth_basic_user_file /www/server/panel/data/htpasswd-rex;
Same shape for siong (htpasswd-siong) and dev (htpasswd-dev). Use a different password per deployment — reusing the same one means a leaked dev credential exposes prod. Reload nginx (aaPanel does this automatically on save).
Phone UX note. Basic auth + iOS/Android keychain + Face ID / Touch ID flow: on first login, save the password into the OS keychain when prompted ("Save password to iCloud Keychain" on iOS, "Save to Google Password Manager" on Android). Subsequent visits trigger Face ID / fingerprint to autofill the basic-auth dialog. Caveats:
- Safari (iOS): integration is reliable. Face ID prompts almost every visit unless you tick "Remember me on this device" in Safari's password autofill settings.
- Chrome (Android): Google Password Manager autofills basic-auth in newer Chrome versions; biometric prompt appears.
- In-app browsers (Telegram, WhatsApp link previews): often don't autofill basic-auth and force you to type. If this matters, share
https://...URLs and ask people to open in their default browser.
If autofill behavior is choppy, the upgrade path is G2 (Authelia + passkeys) — captured as a follow-up below, not in this cycle.
C4 — Rate limit + scanner deflection
Scanner deflection — return 444 on known probe paths
In the same vhost, server { ... }:
# Deflect generic web vulnerability scanners. Return 444 (no response,
# closes connection) instead of letting them reach Flask.
location ~* "^/(\.env|\.env\..*|\.git/.*|\.aws/.*|\.dockerenv|\.htpasswd|\.npmrc|.+\.php|i\.php|test\.php|php\.php|wp-(login|admin|content)/)" {
access_log off;
return 444;
}
# Robots: tell well-behaved crawlers to leave us alone.
location = /robots.txt {
add_header Content-Type text/plain;
return 200 "User-agent: *\nDisallow: /\n";
}
Rate limit — cap requests per source IP
In the http { ... } block (one level above server; in aaPanel typically lives in the global nginx config or in a snippet):
# Define a 10MB shared zone, rate 30 requests/sec per source IP.
limit_req_zone $binary_remote_addr zone=cm_general:10m rate=30r/s;
Then inside the rex/siong server { ... }:
# Allow short bursts (60 reqs above rate) before throttling.
limit_req zone=cm_general burst=60 nodelay;
limit_req_status 429;
30 r/s × per-IP is generous for legitimate UI traffic and tight enough to slow a scanner down to nuisance levels.
Dev vhost — heng.04080616.xyz → dev PC
The dev tier (sub-project A) runs locally on a dev PC: bash scripts/dev.sh up → web-view on 0.0.0.0:8000. Routing aaPanel to it adds public reach (with auth) so you can hand someone a URL to test against without giving them VPN.
aaPanel vhost for heng.04080616.xyz (in addition to the C3/C4 blocks above):
location / {
proxy_pass http://<dev-pc-lan-ip>:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
Replace <dev-pc-lan-ip> with the dev PC's address on your LAN.
⚠️ Important: turn CM_DEBUG OFF in .env before letting aaPanel proxy to dev.
The dev tier defaults to CM_DEBUG=true (per envs/dev/.env.example), which enables Werkzeug's debugger. With aaPanel proxying publicly, basic auth is the only thing standing between the internet and an interactive Python REPL on the dev PC. The right pattern is:
CM_DEBUG=trueonly when iterating fully locally (no aaPanel proxy active, no port forward).CM_DEBUG=falsewhenever the dev tier is reachable throughheng.04080616.xyz.
If you'd rather not flip the flag manually, set CM_DEBUG=false permanently in your dev .env and run bash scripts/bot_cli.sh for the workflows you used to want the debugger for. The Flask in-browser tracebacks aren't worth the RCE surface.
C7 — Host firewall on the Flask docker host(s)
Restrict the LAN-reachable web-view ports to only aaPanel's IP. Without this, anyone else on the LAN can hit Flask directly and bypass everything in C3 and C4. Apply on each host that runs a Flask stack: rex, siong, and the dev PC.
Replace <aapanel-host-ip> with the address of your aaPanel box.
On rex/siong hosts (ports 8001 / 8005 respectively):
sudo ufw allow from <aapanel-host-ip> to any port 8001 proto tcp comment 'rex web-view ← aaPanel only'
sudo ufw allow from <aapanel-host-ip> to any port 8005 proto tcp comment 'siong web-view ← aaPanel only'
sudo ufw deny 8001/tcp
sudo ufw deny 8005/tcp
sudo ufw reload
sudo ufw status numbered
On the dev PC (port 8000 — match CM_WEB_HOST_PORT from envs/dev/.env):
sudo ufw allow from <aapanel-host-ip> to any port 8000 proto tcp comment 'dev web-view ← aaPanel only'
sudo ufw allow from 127.0.0.1 to any port 8000 proto tcp comment 'dev web-view ← localhost'
sudo ufw deny 8000/tcp
sudo ufw reload
The localhost rule on the dev PC is so you can still load http://localhost:8000 directly while iterating, without going through aaPanel.
Verify from a third machine on the LAN:
nmap -p 8000,8001,8005 <flask-host-ip>
# All three ports should show 'filtered' from anywhere except the aaPanel host
# (and except localhost on the dev PC).
If you don't run ufw and prefer iptables directly, the equivalent rules are:
iptables -A INPUT -p tcp --dport 8001 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8005 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -s 127.0.0.1 -j ACCEPT
iptables -A INPUT -p tcp --dport 8001 -j DROP
iptables -A INPUT -p tcp --dport 8005 -j DROP
iptables -A INPUT -p tcp --dport 8000 -j DROP
(Persist via iptables-save > /etc/iptables/rules.v4 or your distro's preferred mechanism.)
Verification after applying C3/C4/C7
- Curl any UI without creds:
curl -i https://<rex-domain>/→401 Unauthorized. Same shape for siong andhttps://heng.04080616.xyz/. - Curl with creds:
curl -i -u rex-operator:<password> https://<rex-domain>/api/acc/→200 OKwith JSON. - Probe a scanner path:
curl -i https://<rex-domain>/.env→ connection closed (444 → curl shows "Empty reply from server"). Flask logs show no entry for this request. - Hammer-test rate limit:
for i in $(seq 1 200); do curl -s -o /dev/null -w "%{http_code}\n" https://<rex-domain>/; done | sort | uniq -c→ should see200s up to the burst window then429s. - From a non-aaPanel host on the LAN:
nmap -p 8000,8001,8005 <flask-host-ip>→filtered(localhost on dev PC still allowed). - Dev-specific check. On the dev PC,
bash scripts/dev.sh logs | grep "Debugger PIN"should return nothing onceCM_DEBUGis off. Thencurl -i -u dev-operator:<password> https://heng.04080616.xyz/api/acc/returns the seed accounts.