diff --git a/docs/superpowers/plans/2026-05-02-r3-scraper-resilience.md b/docs/superpowers/plans/2026-05-02-r3-scraper-resilience.md new file mode 100644 index 0000000..a6c6d6e --- /dev/null +++ b/docs/superpowers/plans/2026-05-02-r3-scraper-resilience.md @@ -0,0 +1,639 @@ +# R3: Scraper Resilience Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Replace the bare `soup.find(...)['value']` pattern in `app/cm_bot.py` with a helper that raises a typed `ScraperError` and dumps the failing HTML to `logs/scraper-failures/` for postmortem. + +**Architecture:** Add `ScraperError`, `_dump_html`, and `_find_input_value` to the `CM_BOT` class; convert five existing call sites that use the `` pattern; extend `get_register_link` and `get_user_credit` failure paths to dump HTML. Tests live in a new `tests/test_cm_bot_scraper.py`. + +**Tech Stack:** Python 3.9 (containers) / 3.12 (local venv), `unittest` + `unittest.mock` (stdlib), `BeautifulSoup` (existing dep). No new dependencies. + +**Spec:** [docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md](../specs/2026-05-02-r3-scraper-resilience-design.md) + +--- + +## File Map + +| File | Operation | Purpose | +|---|---|---| +| `tests/test_cm_bot_scraper.py` | Create | Unit tests for `ScraperError`, `_dump_html`, `_find_input_value`. | +| `app/cm_bot.py` | Modify | Add `ScraperError`, helpers; convert five `'token'` extractions; extend `get_register_link` and `get_user_credit`. | + +The helpers are added to the `CM_BOT` class so they have access to `self` for consistency with the existing class-based methods, even though `_dump_html` and `_find_input_value` don't actually need any instance state. Sticking to instance methods keeps the API uniform with everything else in `CM_BOT`. + +--- + +## Task 1: Add `ScraperError`, `_dump_html`, `_find_input_value` (TDD) + +**Files:** +- Create: `tests/test_cm_bot_scraper.py` +- Modify: `app/cm_bot.py` + +- [ ] **Step 1: Write the failing tests** + +Create `tests/test_cm_bot_scraper.py`: + +```python +"""Tests for the cm_bot scraper resilience helpers. + +The CM_BOT class currently uses bare `soup.find(...)['value']` calls +that throw cryptic TypeErrors when cm99.net returns an unexpected +response. R3 introduces three pieces: + - ScraperError: typed exception so callers can distinguish scraper + failures from network errors. + - _dump_html(context, content): writes the failing response to + logs/scraper-failures/-.html and returns the path. + - _find_input_value(soup, name, *, context, raw): the dominant + extraction pattern. Returns the value on success, dumps + raises + ScraperError on miss. + +These tests do NOT exercise the live cm99.net integration. They use +small inline HTML fixtures and patch filesystem side effects so the +tests stay hermetic. +""" + +import io +import os +import shutil +import tempfile +import unittest +from unittest import mock + +from bs4 import BeautifulSoup + +from app.cm_bot import CM_BOT, ScraperError + + +# CM_BOT.__init__ reads CM_BOT_BASE_URL from the env (raises otherwise). +# Set a placeholder so the class is instantiable in tests; nothing here +# actually touches the network. +@mock.patch.dict(os.environ, {"CM_BOT_BASE_URL": "https://example.invalid"}) +class ScraperHelpersTests(unittest.TestCase): + def setUp(self): + # Each test gets a fresh tmpdir so the dump helper writes + # somewhere predictable. We chdir into it for the duration of + # the test because _dump_html writes to a relative + # logs/scraper-failures path. + self._old_cwd = os.getcwd() + self._tmp = tempfile.mkdtemp(prefix="r3-test-") + os.chdir(self._tmp) + self.bot = CM_BOT() + + def tearDown(self): + os.chdir(self._old_cwd) + shutil.rmtree(self._tmp, ignore_errors=True) + + # ---- _dump_html ---- + + def test_dump_html_creates_dir_and_writes_bytes(self): + path = self.bot._dump_html("ctx-test", b"hi") + self.assertTrue(os.path.isfile(path), f"file should exist: {path}") + with open(path, "rb") as f: + self.assertEqual(f.read(), b"hi") + # The directory was created. + self.assertTrue(path.startswith(os.path.join("logs", "scraper-failures"))) + + def test_dump_html_accepts_str_content(self): + path = self.bot._dump_html("ctx-test", "hi") + with open(path, "rb") as f: + self.assertEqual(f.read(), b"hi") + + def test_dump_html_includes_context_and_timestamp_in_filename(self): + path = self.bot._dump_html("register_form_token", b"x") + basename = os.path.basename(path) + self.assertTrue(basename.startswith("register_form_token-"), basename) + self.assertTrue(basename.endswith(".html"), basename) + + # ---- _find_input_value ---- + + def test_find_input_value_returns_value_when_present(self): + html = '
' + soup = BeautifulSoup(html, "html.parser") + result = self.bot._find_input_value( + soup, "token", context="happy_path", raw=html.encode() + ) + self.assertEqual(result, "abc123") + + def test_find_input_value_raises_and_dumps_when_missing(self): + html = '
' + soup = BeautifulSoup(html, "html.parser") + with self.assertRaises(ScraperError) as cm: + self.bot._find_input_value( + soup, "token", context="missing_input", raw=html.encode() + ) + msg = str(cm.exception) + self.assertIn("missing_input", msg) + self.assertIn("token", msg) + # The path mentioned in the message must actually exist. + # The path appears in parentheses at the end: "(response saved to )" + # We check by listing the dump dir. + dumped = os.listdir(os.path.join("logs", "scraper-failures")) + self.assertEqual(len(dumped), 1, f"expected one dump, got {dumped}") + self.assertTrue(dumped[0].startswith("missing_input-")) + + def test_find_input_value_raises_when_input_has_no_value_attr(self): + html = '
' + soup = BeautifulSoup(html, "html.parser") + with self.assertRaises(ScraperError): + self.bot._find_input_value( + soup, "token", context="no_value_attr", raw=html.encode() + ) + + def test_find_input_value_does_not_dump_on_success(self): + html = '
' + soup = BeautifulSoup(html, "html.parser") + self.bot._find_input_value( + soup, "token", context="should_not_dump", raw=html.encode() + ) + # logs/scraper-failures may not even exist on the happy path. + self.assertFalse( + os.path.isdir(os.path.join("logs", "scraper-failures")), + "happy path should not create the failure dir", + ) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10 +``` + +Expected: `ImportError: cannot import name 'ScraperError' from 'app.cm_bot'` (or similar). The whole class is missing. + +- [ ] **Step 3: Add `ScraperError`, `_dump_html`, `_find_input_value` to `app/cm_bot.py`** + +In `app/cm_bot.py`, the top of the file currently has: + +```python +import datetime +import requests, re +from bs4 import BeautifulSoup +import os +``` + +Add `ScraperError` immediately after the imports (before `class CM_BOT:`): + +```python +class ScraperError(Exception): + """A cm99.net response did not contain the field we expected. + + The raw response is saved to logs/scraper-failures/ before this is + raised; the message identifies which method failed and what was + being looked for. + """ +``` + +Then add the two helper methods inside `class CM_BOT:`. A natural placement is right after `_setup_headers` and before `get_register_data` (around line 204): + +```python + def _dump_html(self, context: str, content) -> str: + """Save a failing cm99.net response to logs/scraper-failures/. + + Returns the path written to so callers can include it in error + messages. + """ + ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + out_dir = os.path.join("logs", "scraper-failures") + os.makedirs(out_dir, exist_ok=True) + path = os.path.join(out_dir, f"{context}-{ts}.html") + if isinstance(content, (bytes, bytearray)): + data = bytes(content) + else: + data = str(content).encode("utf-8", "replace") + with open(path, "wb") as f: + f.write(data) + print(f"[scraper-failure] dumped {context} response to {path}") + return path + + def _find_input_value(self, soup, name: str, *, context: str, raw) -> str: + """Extract 's value or raise ScraperError. + + Saves the raw response to logs/scraper-failures/ before raising + so the operator can postmortem. + """ + el = soup.find("input", {"name": name}) + if el is None or "value" not in el.attrs: + path = self._dump_html(context, raw) + raise ScraperError( + f"{context}: input[name={name!r}] missing or has no value attribute " + f"(response saved to {path})" + ) + return el["value"] +``` + +- [ ] **Step 4: Run tests to verify they pass** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10 +``` + +Expected: 6 tests, `OK`. + +- [ ] **Step 5: Confirm prior tests still pass** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8 +``` + +Expected: combined `OK`. Total: 2 (debug) + 28 (bot_cli) + 6 (scraper) = 36 tests passing. + +- [ ] **Step 6: Commit** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +git add tests/test_cm_bot_scraper.py app/cm_bot.py && \ +git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \ + commit -m "feat(scraper): add ScraperError + _dump_html + _find_input_value helpers" +``` + +--- + +## Task 2: Convert the five `` extractions to use the helper + +**Files:** +- Modify: `app/cm_bot.py` (`get_register_form_token`, `get_security_pin_form_token`, `get_transfer_token`, `transfer_credit` — three lines inside this method) + +The dominant pattern in cm_bot.py is `soup.find('input', {'name': 'token'})['value']`. Replacing each call site is mechanical: keep the request, change the extraction. + +- [ ] **Step 1: Convert `get_register_form_token`** + +Find (around line 344-354): + +```python + def get_register_form_token(self): + try: + response = self.session.post( + f'{self.base_url}/cm/loadUserAccount', + headers=self.get_register_form_headers + ) + soup = BeautifulSoup(response.content, 'html.parser') + return soup.find('input', {'name' : "token"})['value'] + except requests.exceptions.RequestException as e: + print(f"Error getting register form: {e}") + return None +``` + +Replace the `soup.find(...)['value']` line with the helper: + +```python + def get_register_form_token(self): + try: + response = self.session.post( + f'{self.base_url}/cm/loadUserAccount', + headers=self.get_register_form_headers + ) + soup = BeautifulSoup(response.content, 'html.parser') + return self._find_input_value( + soup, "token", + context="register_form_token", + raw=response.content, + ) + except requests.exceptions.RequestException as e: + print(f"Error getting register form: {e}") + return None +``` + +The `except requests.exceptions.RequestException` only catches network errors. `ScraperError` (which inherits from `Exception`) propagates up to whatever `cm_bot_hal.py` is catching, which is `except Exception as e` — same as before, just with a useful message instead of a TypeError. + +- [ ] **Step 2: Convert `get_security_pin_form_token`** + +Find (around line 357-360): + +```python + def get_security_pin_form_token(self): + response = self.session.get(f'{self.base_url}/cm/setSecurityPin') + soup = BeautifulSoup(response.content, 'html.parser') + return soup.find('input', {'name' : "token"})['value'] +``` + +Replace with: + +```python + def get_security_pin_form_token(self): + response = self.session.get(f'{self.base_url}/cm/setSecurityPin') + soup = BeautifulSoup(response.content, 'html.parser') + return self._find_input_value( + soup, "token", + context="security_pin_form_token", + raw=response.content, + ) +``` + +- [ ] **Step 3: Convert `get_transfer_token`** + +Find (around line 463-466): + +```python + def get_transfer_token(self): + response = self.session.get(f'{self.base_url}/cm/transfer') + soup = BeautifulSoup(response.content, 'html.parser') + return soup.find('input', {'name' : "token"})['value'] +``` + +Replace with: + +```python + def get_transfer_token(self): + response = self.session.get(f'{self.base_url}/cm/transfer') + soup = BeautifulSoup(response.content, 'html.parser') + return self._find_input_value( + soup, "token", + context="transfer_token", + raw=response.content, + ) +``` + +- [ ] **Step 4: Convert the three extractions inside `transfer_credit`** + +Find (around line 426-446): + +```python + def transfer_credit(self, t_username: str, t_password: str, amount: float): + token = self.get_transfer_token() + transfer_search_data = self.get_transfer_search_data(token, t_username) + response = self.session.post( + f'{self.base_url}/cm/searchTransferUser', + data=transfer_search_data, + headers=self.transfer_search_headers + ) + soup = BeautifulSoup(response.content, 'html.parser') + name = soup.find('input', {'id': "name"})['value'] + token = soup.find('input', {'name': "token"})['value'] + toUserId = soup.find('input', {'id': "toUserId"})['value'] +``` + +This block uses two different finders: `{'name': X}` for `token`, and `{'id': X}` for `name` and `toUserId`. The `_find_input_value` helper as written only handles `{'name': X}`. We have two options: + +**Option A — extend the helper.** Add an optional `by` parameter (`'name'` or `'id'`). +**Option B — keep `_find_input_value` narrow, write inline checks for the `id`-based ones.** + +We pick Option A. It's a one-parameter widening with a default of `"name"`, so existing call sites are unchanged. + +In `app/cm_bot.py`, update the helper signature: + +```python + def _find_input_value(self, soup, ident: str, *, context: str, raw, by: str = "name") -> str: + """Extract 's value or raise ScraperError. + + `by` selects between matching (default) and + . Saves the raw response to logs/scraper-failures/ + before raising so the operator can postmortem. + """ + el = soup.find("input", {by: ident}) + if el is None or "value" not in el.attrs: + path = self._dump_html(context, raw) + raise ScraperError( + f"{context}: input[{by}={ident!r}] missing or has no value attribute " + f"(response saved to {path})" + ) + return el["value"] +``` + +Update the test for the existing happy-path — the `name` parameter is now called `ident`. Also add a test for the `by="id"` path. Append to `tests/test_cm_bot_scraper.py` inside `ScraperHelpersTests`: + +```python + def test_find_input_value_supports_by_id(self): + html = '
' + soup = BeautifulSoup(html, "html.parser") + result = self.bot._find_input_value( + soup, "toUserId", context="by_id", raw=html.encode(), by="id", + ) + self.assertEqual(result, "42") +``` + +The five existing test methods that use `name="token"` keep working because the rename `name → ident` is a positional argument; tests pass it positionally. + +Now replace the body of `transfer_credit`: + +```python + def transfer_credit(self, t_username: str, t_password: str, amount: float): + token = self.get_transfer_token() + transfer_search_data = self.get_transfer_search_data(token, t_username) + response = self.session.post( + f'{self.base_url}/cm/searchTransferUser', + data=transfer_search_data, + headers=self.transfer_search_headers + ) + soup = BeautifulSoup(response.content, 'html.parser') + name = self._find_input_value( + soup, "name", context="transfer_search_name", raw=response.content, by="id", + ) + token = self._find_input_value( + soup, "token", context="transfer_search_token", raw=response.content, + ) + toUserId = self._find_input_value( + soup, "toUserId", context="transfer_search_toUserId", raw=response.content, by="id", + ) + transfer_data = self.get_transfer_data(token, t_username, name, toUserId, amount, t_password) + response = self.session.post( + f'{self.base_url}/cm/saveTransfer', + data=transfer_data, + headers=self.transfer_credit_headers + ) + return True if re.search(r'Successfully saved the record\.', response.text) else False +``` + +The rest of `transfer_credit` (the second POST and the success-string check) stays identical. The commented-out `# with open('transfer_credit.html', ...)` block at the end can be deleted as part of this edit (the dump now happens automatically on a parse miss). + +- [ ] **Step 5: Run tests to verify everything still passes** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10 +``` + +Expected: 7 tests, `OK` (six original + one new for `by="id"`). + +- [ ] **Step 6: Confirm full suite still green** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8 +``` + +Expected: total 37 tests, `OK`. + +- [ ] **Step 7: Commit** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +git add tests/test_cm_bot_scraper.py app/cm_bot.py && \ +git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \ + commit -m "refactor(scraper): convert input-value extractions to helper" +``` + +--- + +## Task 3: Make `get_register_link` and `get_user_credit` failure paths informative + +**Files:** +- Modify: `app/cm_bot.py` (`get_register_link`, `get_user_credit`) + +These two methods don't fit the input-value helper. `get_register_link` extracts an `` from a specific form; `get_user_credit` does multi-step text-content navigation through a table. We add explicit dump+raise / dump+log behavior at each. + +- [ ] **Step 1: Update `get_register_link`** + +Find (around line 402-406): + +```python + def get_register_link(self): + response = self.session.get(f"{self.base_url}/cm/showQrCode") + soup = BeautifulSoup(response.content, 'html.parser') + soup = soup.find('form', {'id': 'qrCodeForm'}) + return soup.find('a')['href'] +``` + +Replace with: + +```python + def get_register_link(self): + response = self.session.get(f"{self.base_url}/cm/showQrCode") + soup = BeautifulSoup(response.content, 'html.parser') + form = soup.find('form', {'id': 'qrCodeForm'}) + if form is None: + path = self._dump_html("register_link_form", response.content) + raise ScraperError( + f"register_link: form#qrCodeForm not found " + f"(response saved to {path})" + ) + anchor = form.find('a') + if anchor is None or 'href' not in anchor.attrs: + path = self._dump_html("register_link_anchor", response.content) + raise ScraperError( + f"register_link: inside form#qrCodeForm not found " + f"(response saved to {path})" + ) + return anchor['href'] +``` + +- [ ] **Step 2: Update `get_user_credit`'s except block** + +Find (around line 448-461): + +```python + def get_user_credit(self): + response = self.session.post( + f'{self.base_url}/cm/userProfile', + headers=self.get_user_credit_headers + ) + soup = BeautifulSoup(response.content, 'html.parser') + try: + return round(float(soup.find('table', {'class': 'generalContent'}).find(text=re.compile('Credit Available')).parent.parent.find_all('td')[2].text.replace(",","")), 2) + except: + print(f"Error getting credit.") + now = datetime.datetime.now().strftime('%Y%m%d_%H%M') + # with open(f'credit-{now}.html', 'wb') as f: + # f.write(response.content) + return 0 +``` + +Replace the `except:` block so it actively dumps the HTML (uncomment the previously-commented dump and route it through the helper): + +```python + def get_user_credit(self): + response = self.session.post( + f'{self.base_url}/cm/userProfile', + headers=self.get_user_credit_headers + ) + soup = BeautifulSoup(response.content, 'html.parser') + try: + return round(float(soup.find('table', {'class': 'generalContent'}).find(text=re.compile('Credit Available')).parent.parent.find_all('td')[2].text.replace(",","")), 2) + except Exception as exc: + self._dump_html("get_user_credit", response.content) + print(f"Error getting credit: {exc}") + return 0 +``` + +Three changes inside the `except`: catch `Exception as exc` (was bare `except`), call `_dump_html` (was a commented-out `with open(...)`), drop the now-unused `now = datetime.datetime.now()...` line. The bare-except → `Exception as exc` widening is intentional — the original bare except also caught `KeyboardInterrupt` and `SystemExit`, which we should not be swallowing in a credit-read. + +The function still returns `0` on failure to preserve the existing contract (callers in `cm_bot_hal.py:transfer_credit_api` check `amount <= 0.01` and short-circuit). We do not change that. + +- [ ] **Step 3: Run all tests** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8 +``` + +Expected: 37 tests, `OK`. (No new tests in this task — the changed methods are integration-level and would need live cm99.net or HTML fixtures to exercise. The two methods' happy paths are unchanged; their failure paths are dump+raise/log, which is independently exercised by Task 1's helper tests.) + +- [ ] **Step 4: Commit** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +git add app/cm_bot.py && \ +git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \ + commit -m "refactor(scraper): make get_register_link and get_user_credit dump on failure" +``` + +--- + +## Task 4: Final verification + +**Files:** none modified. + +- [ ] **Step 1: All tests green** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8 +``` + +Expected: 37 tests, `OK`. + +- [ ] **Step 2: Sanity-grep for the old pattern** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +grep -n "soup.find('input'.*\['value'\]" app/cm_bot.py && echo "STILL THERE" || echo "OK: no bare input-value extractions" +``` + +Expected: `OK: no bare input-value extractions`. + +- [ ] **Step 3: ScraperError is exported from `app.cm_bot`** + +```bash +cd /home/yiekheng/projects/cm_bot_v2 && \ +.venv/bin/python -c " +from app.cm_bot import CM_BOT, ScraperError +assert issubclass(ScraperError, Exception) +assert hasattr(CM_BOT, '_dump_html') +assert hasattr(CM_BOT, '_find_input_value') +print('ScraperError + helpers OK') +" +``` + +Expected: `ScraperError + helpers OK`. + +- [ ] **Step 4: Real-call smoke (deferred to operator)** + +Trigger an actual bot operation against cm99.net (e.g., from the dev tier with real agent creds: `bash scripts/bot_cli.sh credit `). On success: behavior unchanged. On a parse failure that previously would have TypeError'd: a `ScraperError` propagates with a clear message and a file appears under `logs/scraper-failures/-.html`. + +--- + +## Spec Coverage Check (self-review) + +| Spec requirement | Task | +|---|---| +| `ScraperError` class | Task 1 | +| `_dump_html` instance method | Task 1 | +| `_find_input_value` instance method, default `by="name"` | Task 1 | +| `_find_input_value` extension to support `by="id"` for `transfer_credit` | Task 2 | +| Convert `get_register_form_token` | Task 2 step 1 | +| Convert `get_security_pin_form_token` | Task 2 step 2 | +| Convert `get_transfer_token` | Task 2 step 3 | +| Convert three extractions inside `transfer_credit` (`name`, `token`, `toUserId`) | Task 2 step 4 | +| `get_register_link` failure path dumps + raises | Task 3 step 1 | +| `get_user_credit` failure path dumps + logs (returns 0 unchanged) | Task 3 step 2 | +| Unit tests in `tests/test_cm_bot_scraper.py` | Task 1 + Task 2 | +| `logs/` already gitignored, no .gitignore change | (existing — verified pre-flight) | +| No CSRF token caching | (intentionally not in plan) | + +No gaps. No placeholders. `ScraperError`, `_dump_html`, `_find_input_value` names consistent across tasks. `by` parameter introduced in Task 2 with a default that preserves Task 1's API contract.