10 KiB
R3: cm_bot.py Scraper Resilience Design
Date: 2026-05-02
Status: Approved (design)
Scope: Tiny one-shot. Self-contained inside app/cm_bot.py plus tests.
Out of scope: Caching Struts CSRF tokens — see "Why no token caching" below. Replacing the scraper with an HTTP-API client (cm99.net doesn't expose one — see Playwright probe earlier in this session).
Problem
app/cm_bot.py integrates with cm99.net by parsing HTML responses with BeautifulSoup. When the site re-skins or returns an unexpected page (login expired, session timeout, server error), the existing code fails with cryptic errors like TypeError: 'NoneType' object is not subscriptable because lines such as soup.find('input', {'name': 'token'})['value'] index None. The actual response body that caused the failure is lost — there is commented-out HTML-dump code (with open('credit-{now}.html', 'wb') as f: f.write(response.content)) hinting that the operator has done this manually before, but it is not active.
The user's broader question earlier in this session was whether to migrate cm99.net integration to a real API; we confirmed it can't because cm99.net is a 2017-era Java/Struts2 monolith that returns server-rendered HTML for everything. So we make the scraper robust instead.
Goal
Two narrow improvements:
- Extract-or-raise helper. A
_find_input_value(soup, name, context)instance method that replaces the baresoup.find('input', {'name': name})['value']pattern. On miss, it dumps the failing response tologs/scraper-failures/<context>-<timestamp>.htmland raises aScraperErrorwith the context and what it was looking for. Existing exception handlers incm_bot_hal.pycontinue to catch and report. - HTML dump on parse failure. A
_dump_html(context, content)instance method that writes the raw response bytes tologs/scraper-failures/. Called from the helper above, plus from one already-attempted-but-commented site (get_user_credit's text-parsing block) where the failure mode is more involved than a missing input.
Non-Goals
- Caching the Struts CSRF token across operations. The tokens are anti-CSRF tokens that the Struts2 framework typically regenerates per form-load. Caching one and re-using it for a second POST risks "Invalid token" errors that are silent unless something checks the response body. Without confirmed evidence that this Struts deployment treats tokens as session-scoped (the user has not characterized this), caching is a correctness risk for a perf win we don't measurably need. Skipped intentionally.
- Replacing the bot's
requests-based transport. Already validated as the right shape (no API exists). - Touching
cm_telegram.py:87'sresult['f_username']line — that was already fixed in commit231ae69(the HAL now returns a dict). - Adding a retry loop on transient HTTP errors. The HAL has its own retry logic (DB) and the bot's exception model is "raise → telegram bot or transfer worker reports". Adding scraper retries would change behavior in ways the operator hasn't asked for.
Architecture
ScraperError exception class
A small subclass of Exception so callers can distinguish scraper-side problems from network errors. Lives at the top of app/cm_bot.py.
class ScraperError(Exception):
"""A cm99.net response did not contain the field we expected.
The raw response is saved to logs/scraper-failures/ before this is
raised; the message identifies which method failed and what was
being looked for.
"""
_dump_html instance method
Writes the response content to logs/scraper-failures/<context>-<YYYYMMDD-HHMMSS>.html. Creates the directory if it doesn't exist. Prints a single line to stdout so the operator notices in the container logs.
def _dump_html(self, context: str, content: bytes) -> str:
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
out_dir = os.path.join("logs", "scraper-failures")
os.makedirs(out_dir, exist_ok=True)
path = os.path.join(out_dir, f"{context}-{ts}.html")
with open(path, "wb") as f:
f.write(content if isinstance(content, (bytes, bytearray)) else content.encode("utf-8", "replace"))
print(f"[scraper-failure] dumped {context} response to {path}")
return path
The return value is the path so the helper above can include it in the exception message.
_find_input_value instance method
The dominant pattern in cm_bot.py is <input name="X" value="..."> extraction. This helper covers it.
def _find_input_value(self, soup, name: str, *, context: str, raw: bytes) -> str:
el = soup.find("input", {"name": name})
if el is None or "value" not in el.attrs:
path = self._dump_html(context, raw)
raise ScraperError(
f"{context}: input[name={name!r}] missing or has no value attribute "
f"(response saved to {path})"
)
return el["value"]
The caller passes raw=response.content so the dumped file is the actual bytes received, not the BeautifulSoup-stringified version.
Call sites that adopt the helper
Five existing methods in app/cm_bot.py use the bare soup.find('input', {'name': 'token'})['value'] pattern. Each gets converted in one mechanical edit:
| Method | Before | After |
|---|---|---|
get_register_form_token (line ~344) |
soup.find('input', {'name': 'token'})['value'] |
self._find_input_value(soup, 'token', context='register_form_token', raw=response.content) |
get_security_pin_form_token (line ~357) |
same | self._find_input_value(soup, 'token', context='security_pin_form_token', raw=response.content) |
get_transfer_token (line ~463) |
same | self._find_input_value(soup, 'token', context='transfer_token', raw=response.content) |
transfer_credit (line ~434-436) |
three soup.find('input', ...) lines for name, token, toUserId |
three calls to _find_input_value with distinct context values |
get_register_link (line ~402-405) |
soup.find('form', {'id': 'qrCodeForm'}).find('a')['href'] |
dedicated handling: dump on miss, raise informative ScraperError (this one is not an input lookup) |
The get_user_credit method's text-parsing path (soup.find('table', {...}).find(text=...).parent.parent.find_all('td')[2].text) is structurally different — it doesn't fit the input-value helper. Its existing try/except block already returns 0 on failure. We extend it: in the except, dump the response (unconditional, not commented) and print the path so the operator sees it. We do not change the return contract (still returns 0 on failure) because callers depend on that.
get_register_link (the QR code form) is the other non-input case. We add an explicit if soup.find(...) is None: self._dump_html(...); raise ScraperError(...) block — short, no new helper.
Why no token caching
Struts2 anti-CSRF tokens (the struts.token.name/token pair) are typically regenerated per form-load: load form → token A → submit → token A is invalidated → load form → token B. Re-using token A for a second submit will fail with "Invalid Token". The current code re-fetches per operation precisely because of this contract. Caching might work if this Struts deployment treats the token as session-scoped, but verifying that requires the operator to test it manually against cm99.net, which:
- Costs real cm99.net traffic
- Is not in this scope
- Saves at most ~100ms per operation, which is not the bot's bottleneck
If a future cycle wants to optimize bot throughput, it can characterize the token contract first. For now: skip.
Files Created / Modified
| File | Operation |
|---|---|
app/cm_bot.py |
Modify — add ScraperError, _dump_html, _find_input_value; convert five call sites + the get_register_link and get_user_credit failure paths. |
tests/test_cm_bot_scraper.py |
Create — unit tests for the three new helpers. |
.gitignore |
Already covers logs/ (existing entry). No change needed. |
No new dependencies. No new tests for existing call sites — those are integration-level and would require live cm99.net or HTML fixtures we don't currently maintain.
Verification
- Unit tests pass.
.venv/bin/python -m unittest tests.test_cm_bot_scraper -vexercises:_find_input_valuereturnsvaluewhen input exists._find_input_valueraisesScraperErrorwhen input is missing, and the error message contains the saved path._find_input_valueraisesScraperErrorwhen input has novalueattribute._dump_htmlcreates the directory if missing and writes the bytes verbatim._dump_htmlreturns the path it wrote to.
- Whole suite still green.
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v→OK. - Real-call smoke (deferred to operator). Trigger a bot operation against cm99.net. If anything fails, an HTML file appears under
logs/scraper-failures/<context>-<timestamp>.htmland the container logs print the path. Existing happy paths are unchanged.
Risk
Low.
- The new helpers' behavior is byte-equivalent to the existing code on the happy path (input exists with
value). Thevalueis returned exactly the same way. - Failure mode changes from
TypeError: 'NoneType' object is not subscriptabletoScraperError: <context>: input[name='token'] missing or has no value (response saved to logs/...). Existing exception handlers incm_bot_hal.py(except Exception as e: ...,except: ...) catch both equally. - Disk usage: failing HTML responses go into
logs/scraper-failures/. cm99.net pages are ~10–30KB each. At a few failures per day this is negligible. If cm99.net misbehaves badly (every request fails) the operator has bigger problems anyway and will see the spam in container logs.
Out-of-Scope Follow-Ups
- Struts token caching after characterizing the contract. Not now.
- Periodic cleanup of
logs/scraper-failures/(e.g., afind logs/ -name '*.html' -mtime +14 -deletecron). YAGNI; revisit if disk usage becomes a concern. - HTML fixture-based regression tests for the bot operations themselves. Useful long-term; large enough scope to be its own cycle.