cm_bot_v2/docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md

140 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# R3: cm_bot.py Scraper Resilience Design
**Date:** 2026-05-02
**Status:** Approved (design)
**Scope:** Tiny one-shot. Self-contained inside `app/cm_bot.py` plus tests.
**Out of scope:** Caching Struts CSRF tokens — see "Why no token caching" below. Replacing the scraper with an HTTP-API client (cm99.net doesn't expose one — see Playwright probe earlier in this session).
## Problem
`app/cm_bot.py` integrates with cm99.net by parsing HTML responses with BeautifulSoup. When the site re-skins or returns an unexpected page (login expired, session timeout, server error), the existing code fails with cryptic errors like `TypeError: 'NoneType' object is not subscriptable` because lines such as `soup.find('input', {'name': 'token'})['value']` index `None`. The actual response body that caused the failure is lost — there is commented-out HTML-dump code (`with open('credit-{now}.html', 'wb') as f: f.write(response.content)`) hinting that the operator has done this manually before, but it is not active.
The user's broader question earlier in this session was whether to migrate cm99.net integration to a real API; we confirmed it can't because cm99.net is a 2017-era Java/Struts2 monolith that returns server-rendered HTML for everything. So we make the scraper robust instead.
## Goal
Two narrow improvements:
1. **Extract-or-raise helper.** A `_find_input_value(soup, name, context)` instance method that replaces the bare `soup.find('input', {'name': name})['value']` pattern. On miss, it dumps the failing response to `logs/scraper-failures/<context>-<timestamp>.html` and raises a `ScraperError` with the context and what it was looking for. Existing exception handlers in `cm_bot_hal.py` continue to catch and report.
2. **HTML dump on parse failure.** A `_dump_html(context, content)` instance method that writes the raw response bytes to `logs/scraper-failures/`. Called from the helper above, plus from one already-attempted-but-commented site (`get_user_credit`'s text-parsing block) where the failure mode is more involved than a missing input.
## Non-Goals
- Caching the Struts CSRF token across operations. The tokens are anti-CSRF tokens that the Struts2 framework typically regenerates per form-load. Caching one and re-using it for a second POST risks "Invalid token" errors that are silent unless something checks the response body. Without confirmed evidence that this Struts deployment treats tokens as session-scoped (the user has not characterized this), caching is a correctness risk for a perf win we don't measurably need. Skipped intentionally.
- Replacing the bot's `requests`-based transport. Already validated as the right shape (no API exists).
- Touching `cm_telegram.py:87`'s `result['f_username']` line — that was already fixed in commit `231ae69` (the HAL now returns a dict).
- Adding a retry loop on transient HTTP errors. The HAL has its own retry logic (DB) and the bot's exception model is "raise → telegram bot or transfer worker reports". Adding scraper retries would change behavior in ways the operator hasn't asked for.
## Architecture
### `ScraperError` exception class
A small subclass of `Exception` so callers can distinguish scraper-side problems from network errors. Lives at the top of `app/cm_bot.py`.
```python
class ScraperError(Exception):
"""A cm99.net response did not contain the field we expected.
The raw response is saved to logs/scraper-failures/ before this is
raised; the message identifies which method failed and what was
being looked for.
"""
```
### `_dump_html` instance method
Writes the response content to `logs/scraper-failures/<context>-<YYYYMMDD-HHMMSS>.html`. Creates the directory if it doesn't exist. Prints a single line to stdout so the operator notices in the container logs.
```python
def _dump_html(self, context: str, content: bytes) -> str:
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
out_dir = os.path.join("logs", "scraper-failures")
os.makedirs(out_dir, exist_ok=True)
path = os.path.join(out_dir, f"{context}-{ts}.html")
with open(path, "wb") as f:
f.write(content if isinstance(content, (bytes, bytearray)) else content.encode("utf-8", "replace"))
print(f"[scraper-failure] dumped {context} response to {path}")
return path
```
The return value is the path so the helper above can include it in the exception message.
### `_find_input_value` instance method
The dominant pattern in cm_bot.py is `<input name="X" value="...">` extraction. This helper covers it.
```python
def _find_input_value(self, soup, name: str, *, context: str, raw: bytes) -> str:
el = soup.find("input", {"name": name})
if el is None or "value" not in el.attrs:
path = self._dump_html(context, raw)
raise ScraperError(
f"{context}: input[name={name!r}] missing or has no value attribute "
f"(response saved to {path})"
)
return el["value"]
```
The caller passes `raw=response.content` so the dumped file is the actual bytes received, not the BeautifulSoup-stringified version.
### Call sites that adopt the helper
Five existing methods in `app/cm_bot.py` use the bare `soup.find('input', {'name': 'token'})['value']` pattern. Each gets converted in one mechanical edit:
| Method | Before | After |
|---|---|---|
| `get_register_form_token` (line ~344) | `soup.find('input', {'name': 'token'})['value']` | `self._find_input_value(soup, 'token', context='register_form_token', raw=response.content)` |
| `get_security_pin_form_token` (line ~357) | same | `self._find_input_value(soup, 'token', context='security_pin_form_token', raw=response.content)` |
| `get_transfer_token` (line ~463) | same | `self._find_input_value(soup, 'token', context='transfer_token', raw=response.content)` |
| `transfer_credit` (line ~434-436) | three `soup.find('input', ...)` lines for `name`, `token`, `toUserId` | three calls to `_find_input_value` with distinct `context` values |
| `get_register_link` (line ~402-405) | `soup.find('form', {'id': 'qrCodeForm'}).find('a')['href']` | dedicated handling: dump on miss, raise informative ScraperError (this one is not an input lookup) |
The `get_user_credit` method's text-parsing path (`soup.find('table', {...}).find(text=...).parent.parent.find_all('td')[2].text`) is structurally different — it doesn't fit the input-value helper. Its existing `try/except` block already returns 0 on failure. We extend it: in the `except`, dump the response (unconditional, not commented) and `print` the path so the operator sees it. We do not change the return contract (still returns 0 on failure) because callers depend on that.
`get_register_link` (the QR code form) is the other non-input case. We add an explicit `if soup.find(...) is None: self._dump_html(...); raise ScraperError(...)` block — short, no new helper.
### Why no token caching
Struts2 anti-CSRF tokens (the `struts.token.name`/`token` pair) are typically regenerated per form-load: load form → token A → submit → token A is invalidated → load form → token B. Re-using token A for a second submit will fail with "Invalid Token". The current code re-fetches per operation precisely because of this contract. Caching might work if this Struts deployment treats the token as session-scoped, but verifying that requires the operator to test it manually against cm99.net, which:
- Costs real cm99.net traffic
- Is not in this scope
- Saves at most ~100ms per operation, which is not the bot's bottleneck
If a future cycle wants to optimize bot throughput, it can characterize the token contract first. For now: skip.
## Files Created / Modified
| File | Operation |
|---|---|
| `app/cm_bot.py` | Modify — add `ScraperError`, `_dump_html`, `_find_input_value`; convert five call sites + the `get_register_link` and `get_user_credit` failure paths. |
| `tests/test_cm_bot_scraper.py` | Create — unit tests for the three new helpers. |
| `.gitignore` | Already covers `logs/` (existing entry). No change needed. |
No new dependencies. No new tests for existing call sites — those are integration-level and would require live cm99.net or HTML fixtures we don't currently maintain.
## Verification
1. **Unit tests pass.** `.venv/bin/python -m unittest tests.test_cm_bot_scraper -v` exercises:
- `_find_input_value` returns `value` when input exists.
- `_find_input_value` raises `ScraperError` when input is missing, and the error message contains the saved path.
- `_find_input_value` raises `ScraperError` when input has no `value` attribute.
- `_dump_html` creates the directory if missing and writes the bytes verbatim.
- `_dump_html` returns the path it wrote to.
2. **Whole suite still green.** `.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v``OK`.
3. **Real-call smoke (deferred to operator).** Trigger a bot operation against cm99.net. If anything fails, an HTML file appears under `logs/scraper-failures/<context>-<timestamp>.html` and the container logs print the path. Existing happy paths are unchanged.
## Risk
Low.
- **The new helpers' behavior is byte-equivalent to the existing code on the happy path** (input exists with `value`). The `value` is returned exactly the same way.
- **Failure mode changes** from `TypeError: 'NoneType' object is not subscriptable` to `ScraperError: <context>: input[name='token'] missing or has no value (response saved to logs/...)`. Existing exception handlers in `cm_bot_hal.py` (`except Exception as e: ...`, `except: ...`) catch both equally.
- **Disk usage:** failing HTML responses go into `logs/scraper-failures/`. cm99.net pages are ~1030KB each. At a few failures per day this is negligible. If cm99.net misbehaves badly (every request fails) the operator has bigger problems anyway and will see the spam in container logs.
## Out-of-Scope Follow-Ups
- **Struts token caching after characterizing the contract.** Not now.
- **Periodic cleanup of `logs/scraper-failures/`** (e.g., a `find logs/ -name '*.html' -mtime +14 -delete` cron). YAGNI; revisit if disk usage becomes a concern.
- **HTML fixture-based regression tests** for the bot operations themselves. Useful long-term; large enough scope to be its own cycle.