diff --git a/docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md b/docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md new file mode 100644 index 0000000..3a088dd --- /dev/null +++ b/docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md @@ -0,0 +1,139 @@ +# R3: cm_bot.py Scraper Resilience Design + +**Date:** 2026-05-02 +**Status:** Approved (design) +**Scope:** Tiny one-shot. Self-contained inside `app/cm_bot.py` plus tests. +**Out of scope:** Caching Struts CSRF tokens — see "Why no token caching" below. Replacing the scraper with an HTTP-API client (cm99.net doesn't expose one — see Playwright probe earlier in this session). + +## Problem + +`app/cm_bot.py` integrates with cm99.net by parsing HTML responses with BeautifulSoup. When the site re-skins or returns an unexpected page (login expired, session timeout, server error), the existing code fails with cryptic errors like `TypeError: 'NoneType' object is not subscriptable` because lines such as `soup.find('input', {'name': 'token'})['value']` index `None`. The actual response body that caused the failure is lost — there is commented-out HTML-dump code (`with open('credit-{now}.html', 'wb') as f: f.write(response.content)`) hinting that the operator has done this manually before, but it is not active. + +The user's broader question earlier in this session was whether to migrate cm99.net integration to a real API; we confirmed it can't because cm99.net is a 2017-era Java/Struts2 monolith that returns server-rendered HTML for everything. So we make the scraper robust instead. + +## Goal + +Two narrow improvements: + +1. **Extract-or-raise helper.** A `_find_input_value(soup, name, context)` instance method that replaces the bare `soup.find('input', {'name': name})['value']` pattern. On miss, it dumps the failing response to `logs/scraper-failures/-.html` and raises a `ScraperError` with the context and what it was looking for. Existing exception handlers in `cm_bot_hal.py` continue to catch and report. +2. **HTML dump on parse failure.** A `_dump_html(context, content)` instance method that writes the raw response bytes to `logs/scraper-failures/`. Called from the helper above, plus from one already-attempted-but-commented site (`get_user_credit`'s text-parsing block) where the failure mode is more involved than a missing input. + +## Non-Goals + +- Caching the Struts CSRF token across operations. The tokens are anti-CSRF tokens that the Struts2 framework typically regenerates per form-load. Caching one and re-using it for a second POST risks "Invalid token" errors that are silent unless something checks the response body. Without confirmed evidence that this Struts deployment treats tokens as session-scoped (the user has not characterized this), caching is a correctness risk for a perf win we don't measurably need. Skipped intentionally. +- Replacing the bot's `requests`-based transport. Already validated as the right shape (no API exists). +- Touching `cm_telegram.py:87`'s `result['f_username']` line — that was already fixed in commit `231ae69` (the HAL now returns a dict). +- Adding a retry loop on transient HTTP errors. The HAL has its own retry logic (DB) and the bot's exception model is "raise → telegram bot or transfer worker reports". Adding scraper retries would change behavior in ways the operator hasn't asked for. + +## Architecture + +### `ScraperError` exception class + +A small subclass of `Exception` so callers can distinguish scraper-side problems from network errors. Lives at the top of `app/cm_bot.py`. + +```python +class ScraperError(Exception): + """A cm99.net response did not contain the field we expected. + + The raw response is saved to logs/scraper-failures/ before this is + raised; the message identifies which method failed and what was + being looked for. + """ +``` + +### `_dump_html` instance method + +Writes the response content to `logs/scraper-failures/-.html`. Creates the directory if it doesn't exist. Prints a single line to stdout so the operator notices in the container logs. + +```python +def _dump_html(self, context: str, content: bytes) -> str: + ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + out_dir = os.path.join("logs", "scraper-failures") + os.makedirs(out_dir, exist_ok=True) + path = os.path.join(out_dir, f"{context}-{ts}.html") + with open(path, "wb") as f: + f.write(content if isinstance(content, (bytes, bytearray)) else content.encode("utf-8", "replace")) + print(f"[scraper-failure] dumped {context} response to {path}") + return path +``` + +The return value is the path so the helper above can include it in the exception message. + +### `_find_input_value` instance method + +The dominant pattern in cm_bot.py is `` extraction. This helper covers it. + +```python +def _find_input_value(self, soup, name: str, *, context: str, raw: bytes) -> str: + el = soup.find("input", {"name": name}) + if el is None or "value" not in el.attrs: + path = self._dump_html(context, raw) + raise ScraperError( + f"{context}: input[name={name!r}] missing or has no value attribute " + f"(response saved to {path})" + ) + return el["value"] +``` + +The caller passes `raw=response.content` so the dumped file is the actual bytes received, not the BeautifulSoup-stringified version. + +### Call sites that adopt the helper + +Five existing methods in `app/cm_bot.py` use the bare `soup.find('input', {'name': 'token'})['value']` pattern. Each gets converted in one mechanical edit: + +| Method | Before | After | +|---|---|---| +| `get_register_form_token` (line ~344) | `soup.find('input', {'name': 'token'})['value']` | `self._find_input_value(soup, 'token', context='register_form_token', raw=response.content)` | +| `get_security_pin_form_token` (line ~357) | same | `self._find_input_value(soup, 'token', context='security_pin_form_token', raw=response.content)` | +| `get_transfer_token` (line ~463) | same | `self._find_input_value(soup, 'token', context='transfer_token', raw=response.content)` | +| `transfer_credit` (line ~434-436) | three `soup.find('input', ...)` lines for `name`, `token`, `toUserId` | three calls to `_find_input_value` with distinct `context` values | +| `get_register_link` (line ~402-405) | `soup.find('form', {'id': 'qrCodeForm'}).find('a')['href']` | dedicated handling: dump on miss, raise informative ScraperError (this one is not an input lookup) | + +The `get_user_credit` method's text-parsing path (`soup.find('table', {...}).find(text=...).parent.parent.find_all('td')[2].text`) is structurally different — it doesn't fit the input-value helper. Its existing `try/except` block already returns 0 on failure. We extend it: in the `except`, dump the response (unconditional, not commented) and `print` the path so the operator sees it. We do not change the return contract (still returns 0 on failure) because callers depend on that. + +`get_register_link` (the QR code form) is the other non-input case. We add an explicit `if soup.find(...) is None: self._dump_html(...); raise ScraperError(...)` block — short, no new helper. + +### Why no token caching + +Struts2 anti-CSRF tokens (the `struts.token.name`/`token` pair) are typically regenerated per form-load: load form → token A → submit → token A is invalidated → load form → token B. Re-using token A for a second submit will fail with "Invalid Token". The current code re-fetches per operation precisely because of this contract. Caching might work if this Struts deployment treats the token as session-scoped, but verifying that requires the operator to test it manually against cm99.net, which: + +- Costs real cm99.net traffic +- Is not in this scope +- Saves at most ~100ms per operation, which is not the bot's bottleneck + +If a future cycle wants to optimize bot throughput, it can characterize the token contract first. For now: skip. + +## Files Created / Modified + +| File | Operation | +|---|---| +| `app/cm_bot.py` | Modify — add `ScraperError`, `_dump_html`, `_find_input_value`; convert five call sites + the `get_register_link` and `get_user_credit` failure paths. | +| `tests/test_cm_bot_scraper.py` | Create — unit tests for the three new helpers. | +| `.gitignore` | Already covers `logs/` (existing entry). No change needed. | + +No new dependencies. No new tests for existing call sites — those are integration-level and would require live cm99.net or HTML fixtures we don't currently maintain. + +## Verification + +1. **Unit tests pass.** `.venv/bin/python -m unittest tests.test_cm_bot_scraper -v` exercises: + - `_find_input_value` returns `value` when input exists. + - `_find_input_value` raises `ScraperError` when input is missing, and the error message contains the saved path. + - `_find_input_value` raises `ScraperError` when input has no `value` attribute. + - `_dump_html` creates the directory if missing and writes the bytes verbatim. + - `_dump_html` returns the path it wrote to. +2. **Whole suite still green.** `.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v` → `OK`. +3. **Real-call smoke (deferred to operator).** Trigger a bot operation against cm99.net. If anything fails, an HTML file appears under `logs/scraper-failures/-.html` and the container logs print the path. Existing happy paths are unchanged. + +## Risk + +Low. + +- **The new helpers' behavior is byte-equivalent to the existing code on the happy path** (input exists with `value`). The `value` is returned exactly the same way. +- **Failure mode changes** from `TypeError: 'NoneType' object is not subscriptable` to `ScraperError: : input[name='token'] missing or has no value (response saved to logs/...)`. Existing exception handlers in `cm_bot_hal.py` (`except Exception as e: ...`, `except: ...`) catch both equally. +- **Disk usage:** failing HTML responses go into `logs/scraper-failures/`. cm99.net pages are ~10–30KB each. At a few failures per day this is negligible. If cm99.net misbehaves badly (every request fails) the operator has bigger problems anyway and will see the spam in container logs. + +## Out-of-Scope Follow-Ups + +- **Struts token caching after characterizing the contract.** Not now. +- **Periodic cleanup of `logs/scraper-failures/`** (e.g., a `find logs/ -name '*.html' -mtime +14 -delete` cron). YAGNI; revisit if disk usage becomes a concern. +- **HTML fixture-based regression tests** for the bot operations themselves. Useful long-term; large enough scope to be its own cycle.