cm_bot_v2/docs/superpowers/plans/2026-05-02-r3-scraper-resilience.md

640 lines
25 KiB
Markdown

# R3: Scraper Resilience Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Replace the bare `soup.find(...)['value']` pattern in `app/cm_bot.py` with a helper that raises a typed `ScraperError` and dumps the failing HTML to `logs/scraper-failures/` for postmortem.
**Architecture:** Add `ScraperError`, `_dump_html`, and `_find_input_value` to the `CM_BOT` class; convert five existing call sites that use the `<input name="X" value="...">` pattern; extend `get_register_link` and `get_user_credit` failure paths to dump HTML. Tests live in a new `tests/test_cm_bot_scraper.py`.
**Tech Stack:** Python 3.9 (containers) / 3.12 (local venv), `unittest` + `unittest.mock` (stdlib), `BeautifulSoup` (existing dep). No new dependencies.
**Spec:** [docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md](../specs/2026-05-02-r3-scraper-resilience-design.md)
---
## File Map
| File | Operation | Purpose |
|---|---|---|
| `tests/test_cm_bot_scraper.py` | Create | Unit tests for `ScraperError`, `_dump_html`, `_find_input_value`. |
| `app/cm_bot.py` | Modify | Add `ScraperError`, helpers; convert five `'token'` extractions; extend `get_register_link` and `get_user_credit`. |
The helpers are added to the `CM_BOT` class so they have access to `self` for consistency with the existing class-based methods, even though `_dump_html` and `_find_input_value` don't actually need any instance state. Sticking to instance methods keeps the API uniform with everything else in `CM_BOT`.
---
## Task 1: Add `ScraperError`, `_dump_html`, `_find_input_value` (TDD)
**Files:**
- Create: `tests/test_cm_bot_scraper.py`
- Modify: `app/cm_bot.py`
- [ ] **Step 1: Write the failing tests**
Create `tests/test_cm_bot_scraper.py`:
```python
"""Tests for the cm_bot scraper resilience helpers.
The CM_BOT class currently uses bare `soup.find(...)['value']` calls
that throw cryptic TypeErrors when cm99.net returns an unexpected
response. R3 introduces three pieces:
- ScraperError: typed exception so callers can distinguish scraper
failures from network errors.
- _dump_html(context, content): writes the failing response to
logs/scraper-failures/<context>-<ts>.html and returns the path.
- _find_input_value(soup, name, *, context, raw): the dominant
extraction pattern. Returns the value on success, dumps + raises
ScraperError on miss.
These tests do NOT exercise the live cm99.net integration. They use
small inline HTML fixtures and patch filesystem side effects so the
tests stay hermetic.
"""
import io
import os
import shutil
import tempfile
import unittest
from unittest import mock
from bs4 import BeautifulSoup
from app.cm_bot import CM_BOT, ScraperError
# CM_BOT.__init__ reads CM_BOT_BASE_URL from the env (raises otherwise).
# Set a placeholder so the class is instantiable in tests; nothing here
# actually touches the network.
@mock.patch.dict(os.environ, {"CM_BOT_BASE_URL": "https://example.invalid"})
class ScraperHelpersTests(unittest.TestCase):
def setUp(self):
# Each test gets a fresh tmpdir so the dump helper writes
# somewhere predictable. We chdir into it for the duration of
# the test because _dump_html writes to a relative
# logs/scraper-failures path.
self._old_cwd = os.getcwd()
self._tmp = tempfile.mkdtemp(prefix="r3-test-")
os.chdir(self._tmp)
self.bot = CM_BOT()
def tearDown(self):
os.chdir(self._old_cwd)
shutil.rmtree(self._tmp, ignore_errors=True)
# ---- _dump_html ----
def test_dump_html_creates_dir_and_writes_bytes(self):
path = self.bot._dump_html("ctx-test", b"<html>hi</html>")
self.assertTrue(os.path.isfile(path), f"file should exist: {path}")
with open(path, "rb") as f:
self.assertEqual(f.read(), b"<html>hi</html>")
# The directory was created.
self.assertTrue(path.startswith(os.path.join("logs", "scraper-failures")))
def test_dump_html_accepts_str_content(self):
path = self.bot._dump_html("ctx-test", "<html>hi</html>")
with open(path, "rb") as f:
self.assertEqual(f.read(), b"<html>hi</html>")
def test_dump_html_includes_context_and_timestamp_in_filename(self):
path = self.bot._dump_html("register_form_token", b"x")
basename = os.path.basename(path)
self.assertTrue(basename.startswith("register_form_token-"), basename)
self.assertTrue(basename.endswith(".html"), basename)
# ---- _find_input_value ----
def test_find_input_value_returns_value_when_present(self):
html = '<form><input name="token" value="abc123"></form>'
soup = BeautifulSoup(html, "html.parser")
result = self.bot._find_input_value(
soup, "token", context="happy_path", raw=html.encode()
)
self.assertEqual(result, "abc123")
def test_find_input_value_raises_and_dumps_when_missing(self):
html = '<form><input name="other" value="x"></form>'
soup = BeautifulSoup(html, "html.parser")
with self.assertRaises(ScraperError) as cm:
self.bot._find_input_value(
soup, "token", context="missing_input", raw=html.encode()
)
msg = str(cm.exception)
self.assertIn("missing_input", msg)
self.assertIn("token", msg)
# The path mentioned in the message must actually exist.
# The path appears in parentheses at the end: "(response saved to <path>)"
# We check by listing the dump dir.
dumped = os.listdir(os.path.join("logs", "scraper-failures"))
self.assertEqual(len(dumped), 1, f"expected one dump, got {dumped}")
self.assertTrue(dumped[0].startswith("missing_input-"))
def test_find_input_value_raises_when_input_has_no_value_attr(self):
html = '<form><input name="token"></form>'
soup = BeautifulSoup(html, "html.parser")
with self.assertRaises(ScraperError):
self.bot._find_input_value(
soup, "token", context="no_value_attr", raw=html.encode()
)
def test_find_input_value_does_not_dump_on_success(self):
html = '<form><input name="token" value="abc"></form>'
soup = BeautifulSoup(html, "html.parser")
self.bot._find_input_value(
soup, "token", context="should_not_dump", raw=html.encode()
)
# logs/scraper-failures may not even exist on the happy path.
self.assertFalse(
os.path.isdir(os.path.join("logs", "scraper-failures")),
"happy path should not create the failure dir",
)
if __name__ == "__main__":
unittest.main()
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10
```
Expected: `ImportError: cannot import name 'ScraperError' from 'app.cm_bot'` (or similar). The whole class is missing.
- [ ] **Step 3: Add `ScraperError`, `_dump_html`, `_find_input_value` to `app/cm_bot.py`**
In `app/cm_bot.py`, the top of the file currently has:
```python
import datetime
import requests, re
from bs4 import BeautifulSoup
import os
```
Add `ScraperError` immediately after the imports (before `class CM_BOT:`):
```python
class ScraperError(Exception):
"""A cm99.net response did not contain the field we expected.
The raw response is saved to logs/scraper-failures/ before this is
raised; the message identifies which method failed and what was
being looked for.
"""
```
Then add the two helper methods inside `class CM_BOT:`. A natural placement is right after `_setup_headers` and before `get_register_data` (around line 204):
```python
def _dump_html(self, context: str, content) -> str:
"""Save a failing cm99.net response to logs/scraper-failures/.
Returns the path written to so callers can include it in error
messages.
"""
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
out_dir = os.path.join("logs", "scraper-failures")
os.makedirs(out_dir, exist_ok=True)
path = os.path.join(out_dir, f"{context}-{ts}.html")
if isinstance(content, (bytes, bytearray)):
data = bytes(content)
else:
data = str(content).encode("utf-8", "replace")
with open(path, "wb") as f:
f.write(data)
print(f"[scraper-failure] dumped {context} response to {path}")
return path
def _find_input_value(self, soup, name: str, *, context: str, raw) -> str:
"""Extract <input name=NAME value=...>'s value or raise ScraperError.
Saves the raw response to logs/scraper-failures/ before raising
so the operator can postmortem.
"""
el = soup.find("input", {"name": name})
if el is None or "value" not in el.attrs:
path = self._dump_html(context, raw)
raise ScraperError(
f"{context}: input[name={name!r}] missing or has no value attribute "
f"(response saved to {path})"
)
return el["value"]
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10
```
Expected: 6 tests, `OK`.
- [ ] **Step 5: Confirm prior tests still pass**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8
```
Expected: combined `OK`. Total: 2 (debug) + 28 (bot_cli) + 6 (scraper) = 36 tests passing.
- [ ] **Step 6: Commit**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
git add tests/test_cm_bot_scraper.py app/cm_bot.py && \
git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \
commit -m "feat(scraper): add ScraperError + _dump_html + _find_input_value helpers"
```
---
## Task 2: Convert the five `<input name="token">` extractions to use the helper
**Files:**
- Modify: `app/cm_bot.py` (`get_register_form_token`, `get_security_pin_form_token`, `get_transfer_token`, `transfer_credit` — three lines inside this method)
The dominant pattern in cm_bot.py is `soup.find('input', {'name': 'token'})['value']`. Replacing each call site is mechanical: keep the request, change the extraction.
- [ ] **Step 1: Convert `get_register_form_token`**
Find (around line 344-354):
```python
def get_register_form_token(self):
try:
response = self.session.post(
f'{self.base_url}/cm/loadUserAccount',
headers=self.get_register_form_headers
)
soup = BeautifulSoup(response.content, 'html.parser')
return soup.find('input', {'name' : "token"})['value']
except requests.exceptions.RequestException as e:
print(f"Error getting register form: {e}")
return None
```
Replace the `soup.find(...)['value']` line with the helper:
```python
def get_register_form_token(self):
try:
response = self.session.post(
f'{self.base_url}/cm/loadUserAccount',
headers=self.get_register_form_headers
)
soup = BeautifulSoup(response.content, 'html.parser')
return self._find_input_value(
soup, "token",
context="register_form_token",
raw=response.content,
)
except requests.exceptions.RequestException as e:
print(f"Error getting register form: {e}")
return None
```
The `except requests.exceptions.RequestException` only catches network errors. `ScraperError` (which inherits from `Exception`) propagates up to whatever `cm_bot_hal.py` is catching, which is `except Exception as e` — same as before, just with a useful message instead of a TypeError.
- [ ] **Step 2: Convert `get_security_pin_form_token`**
Find (around line 357-360):
```python
def get_security_pin_form_token(self):
response = self.session.get(f'{self.base_url}/cm/setSecurityPin')
soup = BeautifulSoup(response.content, 'html.parser')
return soup.find('input', {'name' : "token"})['value']
```
Replace with:
```python
def get_security_pin_form_token(self):
response = self.session.get(f'{self.base_url}/cm/setSecurityPin')
soup = BeautifulSoup(response.content, 'html.parser')
return self._find_input_value(
soup, "token",
context="security_pin_form_token",
raw=response.content,
)
```
- [ ] **Step 3: Convert `get_transfer_token`**
Find (around line 463-466):
```python
def get_transfer_token(self):
response = self.session.get(f'{self.base_url}/cm/transfer')
soup = BeautifulSoup(response.content, 'html.parser')
return soup.find('input', {'name' : "token"})['value']
```
Replace with:
```python
def get_transfer_token(self):
response = self.session.get(f'{self.base_url}/cm/transfer')
soup = BeautifulSoup(response.content, 'html.parser')
return self._find_input_value(
soup, "token",
context="transfer_token",
raw=response.content,
)
```
- [ ] **Step 4: Convert the three extractions inside `transfer_credit`**
Find (around line 426-446):
```python
def transfer_credit(self, t_username: str, t_password: str, amount: float):
token = self.get_transfer_token()
transfer_search_data = self.get_transfer_search_data(token, t_username)
response = self.session.post(
f'{self.base_url}/cm/searchTransferUser',
data=transfer_search_data,
headers=self.transfer_search_headers
)
soup = BeautifulSoup(response.content, 'html.parser')
name = soup.find('input', {'id': "name"})['value']
token = soup.find('input', {'name': "token"})['value']
toUserId = soup.find('input', {'id': "toUserId"})['value']
```
This block uses two different finders: `{'name': X}` for `token`, and `{'id': X}` for `name` and `toUserId`. The `_find_input_value` helper as written only handles `{'name': X}`. We have two options:
**Option A — extend the helper.** Add an optional `by` parameter (`'name'` or `'id'`).
**Option B — keep `_find_input_value` narrow, write inline checks for the `id`-based ones.**
We pick Option A. It's a one-parameter widening with a default of `"name"`, so existing call sites are unchanged.
In `app/cm_bot.py`, update the helper signature:
```python
def _find_input_value(self, soup, ident: str, *, context: str, raw, by: str = "name") -> str:
"""Extract <input {by}=IDENT value=...>'s value or raise ScraperError.
`by` selects between matching <input name=...> (default) and
<input id=...>. Saves the raw response to logs/scraper-failures/
before raising so the operator can postmortem.
"""
el = soup.find("input", {by: ident})
if el is None or "value" not in el.attrs:
path = self._dump_html(context, raw)
raise ScraperError(
f"{context}: input[{by}={ident!r}] missing or has no value attribute "
f"(response saved to {path})"
)
return el["value"]
```
Update the test for the existing happy-path — the `name` parameter is now called `ident`. Also add a test for the `by="id"` path. Append to `tests/test_cm_bot_scraper.py` inside `ScraperHelpersTests`:
```python
def test_find_input_value_supports_by_id(self):
html = '<form><input id="toUserId" value="42"></form>'
soup = BeautifulSoup(html, "html.parser")
result = self.bot._find_input_value(
soup, "toUserId", context="by_id", raw=html.encode(), by="id",
)
self.assertEqual(result, "42")
```
The five existing test methods that use `name="token"` keep working because the rename `name → ident` is a positional argument; tests pass it positionally.
Now replace the body of `transfer_credit`:
```python
def transfer_credit(self, t_username: str, t_password: str, amount: float):
token = self.get_transfer_token()
transfer_search_data = self.get_transfer_search_data(token, t_username)
response = self.session.post(
f'{self.base_url}/cm/searchTransferUser',
data=transfer_search_data,
headers=self.transfer_search_headers
)
soup = BeautifulSoup(response.content, 'html.parser')
name = self._find_input_value(
soup, "name", context="transfer_search_name", raw=response.content, by="id",
)
token = self._find_input_value(
soup, "token", context="transfer_search_token", raw=response.content,
)
toUserId = self._find_input_value(
soup, "toUserId", context="transfer_search_toUserId", raw=response.content, by="id",
)
transfer_data = self.get_transfer_data(token, t_username, name, toUserId, amount, t_password)
response = self.session.post(
f'{self.base_url}/cm/saveTransfer',
data=transfer_data,
headers=self.transfer_credit_headers
)
return True if re.search(r'Successfully saved the record\.', response.text) else False
```
The rest of `transfer_credit` (the second POST and the success-string check) stays identical. The commented-out `# with open('transfer_credit.html', ...)` block at the end can be deleted as part of this edit (the dump now happens automatically on a parse miss).
- [ ] **Step 5: Run tests to verify everything still passes**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10
```
Expected: 7 tests, `OK` (six original + one new for `by="id"`).
- [ ] **Step 6: Confirm full suite still green**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8
```
Expected: total 37 tests, `OK`.
- [ ] **Step 7: Commit**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
git add tests/test_cm_bot_scraper.py app/cm_bot.py && \
git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \
commit -m "refactor(scraper): convert input-value extractions to helper"
```
---
## Task 3: Make `get_register_link` and `get_user_credit` failure paths informative
**Files:**
- Modify: `app/cm_bot.py` (`get_register_link`, `get_user_credit`)
These two methods don't fit the input-value helper. `get_register_link` extracts an `<a href="...">` from a specific form; `get_user_credit` does multi-step text-content navigation through a table. We add explicit dump+raise / dump+log behavior at each.
- [ ] **Step 1: Update `get_register_link`**
Find (around line 402-406):
```python
def get_register_link(self):
response = self.session.get(f"{self.base_url}/cm/showQrCode")
soup = BeautifulSoup(response.content, 'html.parser')
soup = soup.find('form', {'id': 'qrCodeForm'})
return soup.find('a')['href']
```
Replace with:
```python
def get_register_link(self):
response = self.session.get(f"{self.base_url}/cm/showQrCode")
soup = BeautifulSoup(response.content, 'html.parser')
form = soup.find('form', {'id': 'qrCodeForm'})
if form is None:
path = self._dump_html("register_link_form", response.content)
raise ScraperError(
f"register_link: form#qrCodeForm not found "
f"(response saved to {path})"
)
anchor = form.find('a')
if anchor is None or 'href' not in anchor.attrs:
path = self._dump_html("register_link_anchor", response.content)
raise ScraperError(
f"register_link: <a href> inside form#qrCodeForm not found "
f"(response saved to {path})"
)
return anchor['href']
```
- [ ] **Step 2: Update `get_user_credit`'s except block**
Find (around line 448-461):
```python
def get_user_credit(self):
response = self.session.post(
f'{self.base_url}/cm/userProfile',
headers=self.get_user_credit_headers
)
soup = BeautifulSoup(response.content, 'html.parser')
try:
return round(float(soup.find('table', {'class': 'generalContent'}).find(text=re.compile('Credit Available')).parent.parent.find_all('td')[2].text.replace(",","")), 2)
except:
print(f"Error getting credit.")
now = datetime.datetime.now().strftime('%Y%m%d_%H%M')
# with open(f'credit-{now}.html', 'wb') as f:
# f.write(response.content)
return 0
```
Replace the `except:` block so it actively dumps the HTML (uncomment the previously-commented dump and route it through the helper):
```python
def get_user_credit(self):
response = self.session.post(
f'{self.base_url}/cm/userProfile',
headers=self.get_user_credit_headers
)
soup = BeautifulSoup(response.content, 'html.parser')
try:
return round(float(soup.find('table', {'class': 'generalContent'}).find(text=re.compile('Credit Available')).parent.parent.find_all('td')[2].text.replace(",","")), 2)
except Exception as exc:
self._dump_html("get_user_credit", response.content)
print(f"Error getting credit: {exc}")
return 0
```
Three changes inside the `except`: catch `Exception as exc` (was bare `except`), call `_dump_html` (was a commented-out `with open(...)`), drop the now-unused `now = datetime.datetime.now()...` line. The bare-except → `Exception as exc` widening is intentional — the original bare except also caught `KeyboardInterrupt` and `SystemExit`, which we should not be swallowing in a credit-read.
The function still returns `0` on failure to preserve the existing contract (callers in `cm_bot_hal.py:transfer_credit_api` check `amount <= 0.01` and short-circuit). We do not change that.
- [ ] **Step 3: Run all tests**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8
```
Expected: 37 tests, `OK`. (No new tests in this task — the changed methods are integration-level and would need live cm99.net or HTML fixtures to exercise. The two methods' happy paths are unchanged; their failure paths are dump+raise/log, which is independently exercised by Task 1's helper tests.)
- [ ] **Step 4: Commit**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
git add app/cm_bot.py && \
git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \
commit -m "refactor(scraper): make get_register_link and get_user_credit dump on failure"
```
---
## Task 4: Final verification
**Files:** none modified.
- [ ] **Step 1: All tests green**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8
```
Expected: 37 tests, `OK`.
- [ ] **Step 2: Sanity-grep for the old pattern**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
grep -n "soup.find('input'.*\['value'\]" app/cm_bot.py && echo "STILL THERE" || echo "OK: no bare input-value extractions"
```
Expected: `OK: no bare input-value extractions`.
- [ ] **Step 3: ScraperError is exported from `app.cm_bot`**
```bash
cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -c "
from app.cm_bot import CM_BOT, ScraperError
assert issubclass(ScraperError, Exception)
assert hasattr(CM_BOT, '_dump_html')
assert hasattr(CM_BOT, '_find_input_value')
print('ScraperError + helpers OK')
"
```
Expected: `ScraperError + helpers OK`.
- [ ] **Step 4: Real-call smoke (deferred to operator)**
Trigger an actual bot operation against cm99.net (e.g., from the dev tier with real agent creds: `bash scripts/bot_cli.sh credit <username> <password>`). On success: behavior unchanged. On a parse failure that previously would have TypeError'd: a `ScraperError` propagates with a clear message and a file appears under `logs/scraper-failures/<context>-<timestamp>.html`.
---
## Spec Coverage Check (self-review)
| Spec requirement | Task |
|---|---|
| `ScraperError` class | Task 1 |
| `_dump_html` instance method | Task 1 |
| `_find_input_value` instance method, default `by="name"` | Task 1 |
| `_find_input_value` extension to support `by="id"` for `transfer_credit` | Task 2 |
| Convert `get_register_form_token` | Task 2 step 1 |
| Convert `get_security_pin_form_token` | Task 2 step 2 |
| Convert `get_transfer_token` | Task 2 step 3 |
| Convert three extractions inside `transfer_credit` (`name`, `token`, `toUserId`) | Task 2 step 4 |
| `get_register_link` failure path dumps + raises | Task 3 step 1 |
| `get_user_credit` failure path dumps + logs (returns 0 unchanged) | Task 3 step 2 |
| Unit tests in `tests/test_cm_bot_scraper.py` | Task 1 + Task 2 |
| `logs/` already gitignored, no .gitignore change | (existing — verified pre-flight) |
| No CSRF token caching | (intentionally not in plan) |
No gaps. No placeholders. `ScraperError`, `_dump_html`, `_find_input_value` names consistent across tasks. `by` parameter introduced in Task 2 with a default that preserves Task 1's API contract.