cm_bot_v2/docs/superpowers/specs/2026-05-02-prod-hardening-c1-c5-c6-design.md
yiekheng e7ab6b1325 Add design spec for prod hardening (C1+C5+C6) and aaPanel guide
Bundles three independent prod-side improvements: replace Flask dev
server with gunicorn (C1), drop api-server's host port (C5), fix the
HAL set_security_pin_api bool/dict contract bug + clean up stale
AGENTS.md note (C6). Appendix is a hand-over guide for the aaPanel
operator (C3 basic auth, C4 rate-limit + scanner deflection, C7 host
firewall) including a vhost for heng.04080616.xyz routing to the dev
PC. Auth path locked to G3 (basic auth + iOS/Android keychain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:28:45 +08:00

313 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Prod Hardening C1+C5+C6 + aaPanel Guide Design
**Date:** 2026-05-02
**Status:** Approved (design)
**Sequel to:** [2026-05-02-debug-mode-hotfix-design.md](2026-05-02-debug-mode-hotfix-design.md), [2026-05-02-local-as-dev-design.md](2026-05-02-local-as-dev-design.md)
**Related sub-projects (not in this scope):** **C3** auth, **C4** rate-limit + scanner deflection, **C7** host firewall — all live in the aaPanel layer; covered in the appendix as a hand-over guide rather than as repo changes. **B** (Next.js webview), **R3** (cm_bot.py scraper resilience) — separate cycles.
## Problem
Three independent issues, all under the "production hardening" bucket, that I'm bundling into one cycle because they all touch the prod Flask container surface and review well together:
1. **Flask dev server in production.** `cm_api` and `cm_web_view` both call `app.run(...)` as their container entrypoint. The Flask docs print a `WARNING: This is a development server.` line into the user's container logs every restart. The dev server is single-threaded, has no graceful reload, no proper signal handling, and is not designed for production load. The earlier debug-mode hotfix (commit `c3f02b3`) closed the RCE risk but left the dev server itself running.
2. **`api-server` host port exposure.** Base `docker-compose.yml` has `ports: - "3000"` for `api-server` (no host binding), which Docker maps to a random host port (e.g. `0.0.0.0:32768->3000`). `api-server` is only ever reached by `web-view` over the compose network — the host port serves no production purpose and broadens the LAN-reachable surface unnecessarily.
3. **Stale documentation + a latent contract bug.** AGENTS.md (line 94) still says `app/cm_bot_hal.py` contains hardcoded agent credentials/PIN, but commit `45303d0` already moved them to env vars (`_get_required_env('CM_AGENT_ID')` etc.). Separately, `cm_bot_hal.set_security_pin_api()` returns a `bool` while `cm_telegram.py:87` does `result['f_username']` — a TypeError currently masked by the surrounding `except Exception` clause. The spec for sub-project A noted this and worked around it in `bot_cli.py`; this is the cycle that fixes it at the source.
The user's reverse proxy (aaPanel) lives on a separate host and reaches the Flask containers over the LAN, so any "edge layer" hardening (TLS, auth, rate limit, scanner deflection) must happen in aaPanel, not in this repo. The appendix below documents what to paste into aaPanel; no repo code implements it.
## Goal
Replace `app.run` with gunicorn in both Flask services for production, hide `api-server`'s host port (only `web-view` stays LAN-reachable for aaPanel), fix the stale doc and the latent HAL contract bug, and write a one-page aaPanel-side hardening guide so the operator can land C3/C4/C7 in their proxy themselves.
## Non-Goals
- Adding Caddy/Traefik/nginx as a docker service. aaPanel already proxies; adding a second proxy in compose would just duplicate concerns. (Was C2; dropped.)
- Implementing auth, rate limit, or scanner deflection in Python middleware. Same reason — wrong layer; aaPanel is where this belongs.
- Writing a host-firewall config script. The aaPanel guide names ufw/iptables rules but doesn't ship a script — host firewall state is too operator-specific.
- Changing `cm_bot.py`'s scraper code. That's R3.
- Migrating to ASGI / uvicorn / async Flask. The app is sync Flask; gunicorn's `sync` worker is the right fit.
## Architecture
### Container entrypoint: gunicorn for prod, `app.run` for dev
The same Docker image runs in both prod (rex/siong via base `docker-compose.yml`) and dev (via `docker-compose.override.yml`). The override pattern is already used to swap registry images for local builds — extending it to swap entrypoints is the natural fit.
| Surface | Today | After |
|---|---|---|
| `docker/api/Dockerfile` `CMD` | `python -m app.cm_api` | `gunicorn --workers 2 --timeout 30 --bind 0.0.0.0:3000 app.cm_api:create_app()` |
| `docker/web/Dockerfile` `CMD` | `python -m app.cm_web_view` | `gunicorn --workers 2 --timeout 30 --bind 0.0.0.0:8000 app.cm_web_view:app` |
| `docker-compose.override.yml` (dev) | (no `command:` overrides) | `command: python -m app.cm_api` for api-server; `command: python -m app.cm_web_view` for web-view |
This keeps Flask's debugger and auto-reloader available in dev (when `CM_DEBUG=true`) without changing any runtime semantics in prod beyond replacing the WSGI server.
#### Why an `app.cm_api:create_app()` factory
`cm_api.py` currently exposes `class CM_API` whose constructor builds a Flask app and registers routes. gunicorn's WSGI loader needs a module-level callable that returns a WSGI app. Smallest viable change: add a `create_app()` module function that does `return CM_API().app`. The class stays — both `python -m app.cm_api` (`__main__` block calls `CM_API().run()`) and `gunicorn 'app.cm_api:create_app()'` work without duplicated bootstrap.
`cm_web_view.py` already has a module-level `app = Flask(__name__)`, so gunicorn binds directly to `app.cm_web_view:app` — no factory needed.
#### Worker count
Two workers per service is the conservative default for a small mostly-DB-bound app. Goes into `gunicorn` flags directly, not into env vars — these aren't operationally tuned right now. Tuning becomes a follow-up if load ever justifies it.
#### Logging
gunicorn writes access + error logs to stdout/stderr by default; `PYTHONUNBUFFERED=1` is already set in compose; aaPanel access logs cover the upstream side. No log routing changes needed.
### `api-server` host port: drop in base, add `127.0.0.1:3000` in dev override
| File | Today | After |
|---|---|---|
| `docker-compose.yml` (api-server) | `ports: - "3000"` | (block removed; api-server reachable only via the compose network) |
| `docker-compose.override.yml` (api-server) | (no ports override) | `ports: - "127.0.0.1:3000:3000"` so dev `curl http://localhost:3000/...` keeps working |
In prod, web-view talks to api-server through the docker bridge network at `http://api-server:3000`. The host port mapping in the base file was incidental, not load-bearing. Removing it makes api-server invisible to the LAN.
`web-view`'s host binding stays as `${CM_WEB_HOST_PORT:-8001}:8000` (no IP prefix → `0.0.0.0`), because aaPanel on a different host needs to reach it over the LAN. That's the intentional public-ish surface; the rest of the docker network goes back to private.
### HAL contract fix: `set_security_pin_api` returns a dict
Today (`app/cm_bot_hal.py:152`), the method ends:
```python
result = self.update_user_status_to_done(f_username)
if result == False:
raise Exception('Failed to update user status to done')
result = self.insert_user_to_table_user(...)
if result == False:
raise Exception('Failed to insert user to table user')
return result # <-- returns bool
```
`cm_telegram.py:87` is what's "right":
```python
result = bot.set_security_pin_api(context.args[0])
del bot
await update.message.reply_text(f"Done setting Security Pin for {result['f_username']} - {result['t_username']} !")
```
Fix the producer, not the consumers. After this change `set_security_pin_api` returns `{"f_username": ..., "t_username": ...}` on success, and the existing `cm_telegram.py` line just works. `app/bot_cli.py` `cmd_set_pin` (added in sub-project A) currently re-extracts names locally as a workaround — that workaround is replaced by reading the dict.
The four `if result == False: raise` lines stay; they're checking the inner step (DB write) returns true, not the outer return shape.
### Cleanups
- `AGENTS.md` line 94 is removed. The "hardcoded credentials" claim is no longer true.
- No other doc edits in this cycle.
## Files Created / Modified
| File | Operation | Purpose |
|---|---|---|
| `requirements.txt` | Modify | Add `gunicorn==23.0.0` (current stable on Python 3.9). |
| `app/cm_api.py` | Modify | Add `def create_app(): return CM_API().app` factory. |
| `app/cm_bot_hal.py` | Modify | `set_security_pin_api` returns `{"f_username", "t_username"}` instead of bool. |
| `app/bot_cli.py` | Modify | `cmd_set_pin` reads `result["f_username"]` / `result["t_username"]` instead of pre-fetching them via `get_whatsapp_link_username`. |
| `tests/test_bot_cli.py` | Modify | Update `CmdSetPinTests` mocks to return the dict. |
| `docker/api/Dockerfile` | Modify | `CMD` swaps to `gunicorn ... app.cm_api:create_app()`. |
| `docker/web/Dockerfile` | Modify | `CMD` swaps to `gunicorn ... app.cm_web_view:app`. |
| `docker-compose.yml` | Modify | Remove `ports: - "3000"` from `api-server`. |
| `docker-compose.override.yml` | Modify | Add `command: python -m app.cm_api` to api-server (preserves Flask dev server in dev); add `command: python -m app.cm_web_view` to web-view; add `ports: - "127.0.0.1:3000:3000"` to api-server. |
| `AGENTS.md` | Modify | Remove the stale "cm_bot_hal.py contains hardcoded credentials" line. |
| `docs/aapanel-hardening.md` | Create | Operator-facing nginx snippets for C3 (auth), C4 (rate-limit + scanner deflection), C7 (host firewall). Pasted into aaPanel; no repo code references it. |
## Verification
1. **Unit tests still green.** `.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli -v``OK`. Updated `CmdSetPinTests` exercises the new dict return shape.
2. **Dev: Flask dev server still runs.** `bash scripts/dev.sh up` → web-view log shows `* Debug mode: on/off` (whichever `CM_DEBUG` is) and `* Running on...`. The `command:` override puts `python -m app.cm_web_view` back in front of gunicorn.
3. **Prod parity check (compose-only).** `docker compose -f docker-compose.yml config | grep -E "Listening|gunicorn" || true; docker compose -f docker-compose.yml config | grep -E "^\s+ports:" -A 1` confirms (a) api-server has no host port, (b) web-view still has `${CM_WEB_HOST_PORT:-8001}:8000`.
4. **Prod cold start (deploy host).** With a published image tag (or a `CM_IMAGE_PREFIX=local DOCKER_IMAGE_TAG=dev` rebuild + `docker compose up -d` from base only), web-view logs show `[INFO] Starting gunicorn` and `[INFO] Listening at: http://0.0.0.0:8000`. No more `WARNING: This is a development server` line.
5. **`/api/acc/` round-trip.** Hit web-view via aaPanel: load the UI, account list renders. The api-server is no longer LAN-reachable on its 3000 port (`nmap -p 3000 <host-ip>` from another machine returns closed). web-view's 8001/8005 still reachable from aaPanel's host.
6. **HAL contract fix.** `python -c "from app.cm_bot_hal import CM_BOT_HAL; print(getattr(CM_BOT_HAL.set_security_pin_api, '__doc__'))"` (or just read the diff) shows the new return shape. `cm_telegram.py:87`'s `result['f_username']` no longer raises in the success path.
7. **Stale doc gone.** `grep -n "hardcoded" AGENTS.md` returns nothing.
## Risk
Medium. Three concerns worth naming:
- **Dropping api-server's host port could surprise an operator** who was relying on `curl http://prod-host:32768/acc/` for ad-hoc debugging in prod. Mitigation: it's mentioned in the AGENTS.md updated section in this cycle, and prod debugging through `docker exec` or via web-view's `/api/acc/` proxy still works.
- **gunicorn worker count and timeout are heuristics, not measurements.** Two workers / 30s timeout is fine for current load (a handful of cm99.net calls in flight); it may need tuning if load grows. Captured as "tune later" out-of-scope item.
- **The HAL return-shape change is a behavior change in the public API of `set_security_pin_api`.** Both call sites are in this repo (`cm_telegram.py`, `app/bot_cli.py`) and both are updated in this cycle. No external consumers exist.
## Out-of-Scope Follow-Ups
- **gunicorn config tuning** (workers, threads, keep-alive) once we have any production traffic data.
- **C3 / C4 / C7** — operator pastes the appendix into aaPanel. If one of them turns out to be repo-relevant after the fact (e.g., we want app-level rate limiting too), it can come back as its own cycle.
- **Authelia (or similar) for passkey-based auth** — the upgrade path from G3 (basic auth + keychain) when biometric UX in basic auth becomes annoying. Self-hosted Authelia container, nginx `auth_request` delegation, WebAuthn enrollment flow. Its own brainstorm cycle when needed.
- **Tailscale-only access** — alternative to public auth: drop the Flask hosts onto a Tailnet, remove the public vhosts. Better phone biometric UX (via Tailscale's app), but loses the "share a public URL" property.
- **Health endpoints** (`/healthz`) for readiness/liveness probes. gunicorn's default 200 on `/` works for now; aaPanel doesn't probe; no orchestrator is doing it. YAGNI.
- **`cm_transfer_credit.py` and `cm_telegram.py`** — neither runs a Flask server, so gunicorn does not apply. Their `restart: unless-stopped` plus the existing crash-resume logic in `cm_telegram.py:run_polling_forever` is the right shape.
---
## Appendix: aaPanel hardening guide (C3 + C4 + C7)
This appendix is the same content that lands at `docs/aapanel-hardening.md`. The repo cycle does not implement these — they are operator actions in aaPanel.
### Threat model recap
aaPanel terminates TLS for `https://<rex-domain>`, `https://<siong-domain>`, and `https://heng.04080616.xyz` (the dev tier — see "Dev vhost" below) and proxies to LAN-reachable web-view ports on the Flask hosts (8001 rex, 8005 siong, 8000 dev). A scanner on the public internet → aaPanel → Flask. Without these mitigations, every `/.env` `/.git/config` `/.aws/config` `/.htpasswd` `/php.php` probe round-trips through the proxy to Flask. With them, aaPanel returns 444 immediately and Flask never sees the request.
### C3 — Basic auth on the rex/siong/dev vhosts
Goal: the web-view UI requires a password. Anyone hitting `https://<domain>/` with no creds gets 401.
Generate an htpasswd file (one per deployment is cleaner):
```bash
# On the aaPanel host, as root:
htpasswd -c /www/server/panel/data/htpasswd-rex rex-operator
htpasswd -c /www/server/panel/data/htpasswd-siong siong-operator
htpasswd -c /www/server/panel/data/htpasswd-dev dev-operator
chmod 640 /www/server/panel/data/htpasswd-*
chown www:www /www/server/panel/data/htpasswd-*
```
Add to the rex vhost's `server { ... }` block (aaPanel: site → settings → "Configuration File"):
```nginx
auth_basic "rex restricted";
auth_basic_user_file /www/server/panel/data/htpasswd-rex;
```
Same shape for siong (`htpasswd-siong`) and dev (`htpasswd-dev`). Use a different password per deployment — reusing the same one means a leaked dev credential exposes prod. Reload nginx (aaPanel does this automatically on save).
**Phone UX note.** Basic auth + iOS/Android keychain + Face ID / Touch ID flow: on first login, save the password into the OS keychain when prompted ("Save password to iCloud Keychain" on iOS, "Save to Google Password Manager" on Android). Subsequent visits trigger Face ID / fingerprint to autofill the basic-auth dialog. Caveats:
- **Safari (iOS):** integration is reliable. Face ID prompts almost every visit unless you tick "Remember me on this device" in Safari's password autofill settings.
- **Chrome (Android):** Google Password Manager autofills basic-auth in newer Chrome versions; biometric prompt appears.
- **In-app browsers (Telegram, WhatsApp link previews):** often *don't* autofill basic-auth and force you to type. If this matters, share `https://...` URLs and ask people to open in their default browser.
If autofill behavior is choppy, the upgrade path is G2 (Authelia + passkeys) — captured as a follow-up below, not in this cycle.
### C4 — Rate limit + scanner deflection
#### Scanner deflection — return 444 on known probe paths
In the same vhost, `server { ... }`:
```nginx
# Deflect generic web vulnerability scanners. Return 444 (no response,
# closes connection) instead of letting them reach Flask.
location ~* "^/(\.env|\.env\..*|\.git/.*|\.aws/.*|\.dockerenv|\.htpasswd|\.npmrc|.+\.php|i\.php|test\.php|php\.php|wp-(login|admin|content)/)" {
access_log off;
return 444;
}
# Robots: tell well-behaved crawlers to leave us alone.
location = /robots.txt {
add_header Content-Type text/plain;
return 200 "User-agent: *\nDisallow: /\n";
}
```
#### Rate limit — cap requests per source IP
In the `http { ... }` block (one level above `server`; in aaPanel typically lives in the global nginx config or in a snippet):
```nginx
# Define a 10MB shared zone, rate 30 requests/sec per source IP.
limit_req_zone $binary_remote_addr zone=cm_general:10m rate=30r/s;
```
Then inside the rex/siong `server { ... }`:
```nginx
# Allow short bursts (60 reqs above rate) before throttling.
limit_req zone=cm_general burst=60 nodelay;
limit_req_status 429;
```
30 r/s × per-IP is generous for legitimate UI traffic and tight enough to slow a scanner down to nuisance levels.
### Dev vhost — `heng.04080616.xyz` → dev PC
The dev tier (sub-project A) runs locally on a dev PC: `bash scripts/dev.sh up` → web-view on `0.0.0.0:8000`. Routing aaPanel to it adds public reach (with auth) so you can hand someone a URL to test against without giving them VPN.
aaPanel vhost for `heng.04080616.xyz` (in addition to the C3/C4 blocks above):
```nginx
location / {
proxy_pass http://<dev-pc-lan-ip>:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
```
Replace `<dev-pc-lan-ip>` with the dev PC's address on your LAN.
**⚠️ Important: turn `CM_DEBUG` OFF in `.env` before letting aaPanel proxy to dev.**
The dev tier defaults to `CM_DEBUG=true` (per `envs/dev/.env.example`), which enables Werkzeug's debugger. With aaPanel proxying publicly, basic auth is the only thing standing between the internet and an interactive Python REPL on the dev PC. The right pattern is:
- `CM_DEBUG=true` only when iterating *fully locally* (no aaPanel proxy active, no port forward).
- `CM_DEBUG=false` whenever the dev tier is reachable through `heng.04080616.xyz`.
If you'd rather not flip the flag manually, set `CM_DEBUG=false` permanently in your dev `.env` and run `bash scripts/bot_cli.sh` for the workflows you used to want the debugger for. The Flask in-browser tracebacks aren't worth the RCE surface.
### C7 — Host firewall on the Flask docker host(s)
Restrict the LAN-reachable web-view ports to only aaPanel's IP. Without this, anyone else on the LAN can hit Flask directly and bypass everything in C3 and C4. Apply on each host that runs a Flask stack: rex, siong, *and* the dev PC.
Replace `<aapanel-host-ip>` with the address of your aaPanel box.
On rex/siong hosts (ports 8001 / 8005 respectively):
```bash
sudo ufw allow from <aapanel-host-ip> to any port 8001 proto tcp comment 'rex web-view ← aaPanel only'
sudo ufw allow from <aapanel-host-ip> to any port 8005 proto tcp comment 'siong web-view ← aaPanel only'
sudo ufw deny 8001/tcp
sudo ufw deny 8005/tcp
sudo ufw reload
sudo ufw status numbered
```
On the dev PC (port 8000 — match `CM_WEB_HOST_PORT` from `envs/dev/.env`):
```bash
sudo ufw allow from <aapanel-host-ip> to any port 8000 proto tcp comment 'dev web-view ← aaPanel only'
sudo ufw allow from 127.0.0.1 to any port 8000 proto tcp comment 'dev web-view ← localhost'
sudo ufw deny 8000/tcp
sudo ufw reload
```
The localhost rule on the dev PC is so you can still load `http://localhost:8000` directly while iterating, without going through aaPanel.
Verify from a third machine on the LAN:
```bash
nmap -p 8000,8001,8005 <flask-host-ip>
# All three ports should show 'filtered' from anywhere except the aaPanel host
# (and except localhost on the dev PC).
```
If you don't run ufw and prefer iptables directly, the equivalent rules are:
```bash
iptables -A INPUT -p tcp --dport 8001 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8005 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -s <aapanel-host-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -s 127.0.0.1 -j ACCEPT
iptables -A INPUT -p tcp --dport 8001 -j DROP
iptables -A INPUT -p tcp --dport 8005 -j DROP
iptables -A INPUT -p tcp --dport 8000 -j DROP
```
(Persist via `iptables-save > /etc/iptables/rules.v4` or your distro's preferred mechanism.)
### Verification after applying C3/C4/C7
1. Curl any UI without creds: `curl -i https://<rex-domain>/``401 Unauthorized`. Same shape for siong and `https://heng.04080616.xyz/`.
2. Curl with creds: `curl -i -u rex-operator:<password> https://<rex-domain>/api/acc/``200 OK` with JSON.
3. Probe a scanner path: `curl -i https://<rex-domain>/.env` → connection closed (444 → curl shows "Empty reply from server"). Flask logs show no entry for this request.
4. Hammer-test rate limit: `for i in $(seq 1 200); do curl -s -o /dev/null -w "%{http_code}\n" https://<rex-domain>/; done | sort | uniq -c` → should see `200`s up to the burst window then `429`s.
5. From a non-aaPanel host on the LAN: `nmap -p 8000,8001,8005 <flask-host-ip>``filtered` (localhost on dev PC still allowed).
6. **Dev-specific check.** On the dev PC, `bash scripts/dev.sh logs | grep "Debugger PIN"` should return nothing once `CM_DEBUG` is off. Then `curl -i -u dev-operator:<password> https://heng.04080616.xyz/api/acc/` returns the seed accounts.