yiekheng 9ec0d2ade4 Add implementation plan for R3 (scraper resilience)

2026-05-02 17:52:58 +08:00

25 KiB

Raw Blame History

R3: Scraper Resilience Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace the bare soup.find(...)['value'] pattern in app/cm_bot.py with a helper that raises a typed ScraperError and dumps the failing HTML to logs/scraper-failures/ for postmortem.

Architecture: Add ScraperError, _dump_html, and _find_input_value to the CM_BOT class; convert five existing call sites that use the <input name="X" value="..."> pattern; extend get_register_link and get_user_credit failure paths to dump HTML. Tests live in a new tests/test_cm_bot_scraper.py.

Tech Stack: Python 3.9 (containers) / 3.12 (local venv), unittest + unittest.mock (stdlib), BeautifulSoup (existing dep). No new dependencies.

Spec: docs/superpowers/specs/2026-05-02-r3-scraper-resilience-design.md

File Map

File	Operation	Purpose
`tests/test_cm_bot_scraper.py`	Create	Unit tests for `ScraperError`, `_dump_html`, `_find_input_value`.
`app/cm_bot.py`	Modify	Add `ScraperError`, helpers; convert five `'token'` extractions; extend `get_register_link` and `get_user_credit`.

The helpers are added to the CM_BOT class so they have access to self for consistency with the existing class-based methods, even though _dump_html and _find_input_value don't actually need any instance state. Sticking to instance methods keeps the API uniform with everything else in CM_BOT.

Task 1: Add `ScraperError`, `_dump_html`, `_find_input_value` (TDD)

Files:

Create: tests/test_cm_bot_scraper.py
Modify: app/cm_bot.py
Step 1: Write the failing tests

Create tests/test_cm_bot_scraper.py:

"""Tests for the cm_bot scraper resilience helpers.

The CM_BOT class currently uses bare `soup.find(...)['value']` calls
that throw cryptic TypeErrors when cm99.net returns an unexpected
response. R3 introduces three pieces:
  - ScraperError: typed exception so callers can distinguish scraper
    failures from network errors.
  - _dump_html(context, content): writes the failing response to
    logs/scraper-failures/<context>-<ts>.html and returns the path.
  - _find_input_value(soup, name, *, context, raw): the dominant
    extraction pattern. Returns the value on success, dumps + raises
    ScraperError on miss.

These tests do NOT exercise the live cm99.net integration. They use
small inline HTML fixtures and patch filesystem side effects so the
tests stay hermetic.
"""

import io
import os
import shutil
import tempfile
import unittest
from unittest import mock

from bs4 import BeautifulSoup

from app.cm_bot import CM_BOT, ScraperError


# CM_BOT.__init__ reads CM_BOT_BASE_URL from the env (raises otherwise).
# Set a placeholder so the class is instantiable in tests; nothing here
# actually touches the network.
@mock.patch.dict(os.environ, {"CM_BOT_BASE_URL": "https://example.invalid"})
class ScraperHelpersTests(unittest.TestCase):
    def setUp(self):
        # Each test gets a fresh tmpdir so the dump helper writes
        # somewhere predictable. We chdir into it for the duration of
        # the test because _dump_html writes to a relative
        # logs/scraper-failures path.
        self._old_cwd = os.getcwd()
        self._tmp = tempfile.mkdtemp(prefix="r3-test-")
        os.chdir(self._tmp)
        self.bot = CM_BOT()

    def tearDown(self):
        os.chdir(self._old_cwd)
        shutil.rmtree(self._tmp, ignore_errors=True)

    # ---- _dump_html ----

    def test_dump_html_creates_dir_and_writes_bytes(self):
        path = self.bot._dump_html("ctx-test", b"<html>hi</html>")
        self.assertTrue(os.path.isfile(path), f"file should exist: {path}")
        with open(path, "rb") as f:
            self.assertEqual(f.read(), b"<html>hi</html>")
        # The directory was created.
        self.assertTrue(path.startswith(os.path.join("logs", "scraper-failures")))

    def test_dump_html_accepts_str_content(self):
        path = self.bot._dump_html("ctx-test", "<html>hi</html>")
        with open(path, "rb") as f:
            self.assertEqual(f.read(), b"<html>hi</html>")

    def test_dump_html_includes_context_and_timestamp_in_filename(self):
        path = self.bot._dump_html("register_form_token", b"x")
        basename = os.path.basename(path)
        self.assertTrue(basename.startswith("register_form_token-"), basename)
        self.assertTrue(basename.endswith(".html"), basename)

    # ---- _find_input_value ----

    def test_find_input_value_returns_value_when_present(self):
        html = '<form><input name="token" value="abc123"></form>'
        soup = BeautifulSoup(html, "html.parser")
        result = self.bot._find_input_value(
            soup, "token", context="happy_path", raw=html.encode()
        )
        self.assertEqual(result, "abc123")

    def test_find_input_value_raises_and_dumps_when_missing(self):
        html = '<form><input name="other" value="x"></form>'
        soup = BeautifulSoup(html, "html.parser")
        with self.assertRaises(ScraperError) as cm:
            self.bot._find_input_value(
                soup, "token", context="missing_input", raw=html.encode()
            )
        msg = str(cm.exception)
        self.assertIn("missing_input", msg)
        self.assertIn("token", msg)
        # The path mentioned in the message must actually exist.
        # The path appears in parentheses at the end: "(response saved to <path>)"
        # We check by listing the dump dir.
        dumped = os.listdir(os.path.join("logs", "scraper-failures"))
        self.assertEqual(len(dumped), 1, f"expected one dump, got {dumped}")
        self.assertTrue(dumped[0].startswith("missing_input-"))

    def test_find_input_value_raises_when_input_has_no_value_attr(self):
        html = '<form><input name="token"></form>'
        soup = BeautifulSoup(html, "html.parser")
        with self.assertRaises(ScraperError):
            self.bot._find_input_value(
                soup, "token", context="no_value_attr", raw=html.encode()
            )

    def test_find_input_value_does_not_dump_on_success(self):
        html = '<form><input name="token" value="abc"></form>'
        soup = BeautifulSoup(html, "html.parser")
        self.bot._find_input_value(
            soup, "token", context="should_not_dump", raw=html.encode()
        )
        # logs/scraper-failures may not even exist on the happy path.
        self.assertFalse(
            os.path.isdir(os.path.join("logs", "scraper-failures")),
            "happy path should not create the failure dir",
        )


if __name__ == "__main__":
    unittest.main()

Step 2: Run tests to verify they fail

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10

Expected: ImportError: cannot import name 'ScraperError' from 'app.cm_bot' (or similar). The whole class is missing.

Step 3: Add ScraperError, _dump_html, _find_input_value to app/cm_bot.py

In app/cm_bot.py, the top of the file currently has:

import datetime
import requests, re
from bs4 import BeautifulSoup
import os

Add ScraperError immediately after the imports (before class CM_BOT:):

class ScraperError(Exception):
    """A cm99.net response did not contain the field we expected.

    The raw response is saved to logs/scraper-failures/ before this is
    raised; the message identifies which method failed and what was
    being looked for.
    """

Then add the two helper methods inside class CM_BOT:. A natural placement is right after _setup_headers and before get_register_data (around line 204):

    def _dump_html(self, context: str, content) -> str:
        """Save a failing cm99.net response to logs/scraper-failures/.

        Returns the path written to so callers can include it in error
        messages.
        """
        ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
        out_dir = os.path.join("logs", "scraper-failures")
        os.makedirs(out_dir, exist_ok=True)
        path = os.path.join(out_dir, f"{context}-{ts}.html")
        if isinstance(content, (bytes, bytearray)):
            data = bytes(content)
        else:
            data = str(content).encode("utf-8", "replace")
        with open(path, "wb") as f:
            f.write(data)
        print(f"[scraper-failure] dumped {context} response to {path}")
        return path

    def _find_input_value(self, soup, name: str, *, context: str, raw) -> str:
        """Extract <input name=NAME value=...>'s value or raise ScraperError.

        Saves the raw response to logs/scraper-failures/ before raising
        so the operator can postmortem.
        """
        el = soup.find("input", {"name": name})
        if el is None or "value" not in el.attrs:
            path = self._dump_html(context, raw)
            raise ScraperError(
                f"{context}: input[name={name!r}] missing or has no value attribute "
                f"(response saved to {path})"
            )
        return el["value"]

Step 4: Run tests to verify they pass

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10

Expected: 6 tests, OK.

Step 5: Confirm prior tests still pass

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8

Expected: combined OK. Total: 2 (debug) + 28 (bot_cli) + 6 (scraper) = 36 tests passing.

Step 6: Commit

cd /home/yiekheng/projects/cm_bot_v2 && \
git add tests/test_cm_bot_scraper.py app/cm_bot.py && \
git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \
  commit -m "feat(scraper): add ScraperError + _dump_html + _find_input_value helpers"

Task 2: Convert the five `<input name="token">` extractions to use the helper

Files:

Modify: app/cm_bot.py (get_register_form_token, get_security_pin_form_token, get_transfer_token, transfer_credit — three lines inside this method)

The dominant pattern in cm_bot.py is soup.find('input', {'name': 'token'})['value']. Replacing each call site is mechanical: keep the request, change the extraction.

Step 1: Convert get_register_form_token

Find (around line 344-354):

    def get_register_form_token(self):
        try:
            response = self.session.post(
                f'{self.base_url}/cm/loadUserAccount', 
                headers=self.get_register_form_headers
            )
            soup = BeautifulSoup(response.content, 'html.parser')
            return soup.find('input', {'name' : "token"})['value']
        except requests.exceptions.RequestException as e:
            print(f"Error getting register form: {e}")
            return None

Replace the soup.find(...)['value'] line with the helper:

    def get_register_form_token(self):
        try:
            response = self.session.post(
                f'{self.base_url}/cm/loadUserAccount', 
                headers=self.get_register_form_headers
            )
            soup = BeautifulSoup(response.content, 'html.parser')
            return self._find_input_value(
                soup, "token",
                context="register_form_token",
                raw=response.content,
            )
        except requests.exceptions.RequestException as e:
            print(f"Error getting register form: {e}")
            return None

The except requests.exceptions.RequestException only catches network errors. ScraperError (which inherits from Exception) propagates up to whatever cm_bot_hal.py is catching, which is except Exception as e — same as before, just with a useful message instead of a TypeError.

Step 2: Convert get_security_pin_form_token

Find (around line 357-360):

    def get_security_pin_form_token(self):
        response = self.session.get(f'{self.base_url}/cm/setSecurityPin')
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup.find('input', {'name' : "token"})['value']

Replace with:

    def get_security_pin_form_token(self):
        response = self.session.get(f'{self.base_url}/cm/setSecurityPin')
        soup = BeautifulSoup(response.content, 'html.parser')
        return self._find_input_value(
            soup, "token",
            context="security_pin_form_token",
            raw=response.content,
        )

Step 3: Convert get_transfer_token

Find (around line 463-466):

    def get_transfer_token(self):
        response = self.session.get(f'{self.base_url}/cm/transfer')
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup.find('input', {'name' : "token"})['value']

Replace with:

    def get_transfer_token(self):
        response = self.session.get(f'{self.base_url}/cm/transfer')
        soup = BeautifulSoup(response.content, 'html.parser')
        return self._find_input_value(
            soup, "token",
            context="transfer_token",
            raw=response.content,
        )

Step 4: Convert the three extractions inside transfer_credit

Find (around line 426-446):

    def transfer_credit(self, t_username: str, t_password: str, amount: float):
        token = self.get_transfer_token()
        transfer_search_data = self.get_transfer_search_data(token, t_username)
        response = self.session.post(
            f'{self.base_url}/cm/searchTransferUser',
            data=transfer_search_data,
            headers=self.transfer_search_headers
        )
        soup = BeautifulSoup(response.content, 'html.parser')
        name = soup.find('input', {'id': "name"})['value']
        token = soup.find('input', {'name': "token"})['value']
        toUserId = soup.find('input', {'id': "toUserId"})['value']

This block uses two different finders: {'name': X} for token, and {'id': X} for name and toUserId. The _find_input_value helper as written only handles {'name': X}. We have two options:

Option A — extend the helper. Add an optional by parameter ('name' or 'id'). Option B — keep _find_input_value narrow, write inline checks for the id-based ones.

We pick Option A. It's a one-parameter widening with a default of "name", so existing call sites are unchanged.

In app/cm_bot.py, update the helper signature:

    def _find_input_value(self, soup, ident: str, *, context: str, raw, by: str = "name") -> str:
        """Extract <input {by}=IDENT value=...>'s value or raise ScraperError.

        `by` selects between matching <input name=...> (default) and
        <input id=...>. Saves the raw response to logs/scraper-failures/
        before raising so the operator can postmortem.
        """
        el = soup.find("input", {by: ident})
        if el is None or "value" not in el.attrs:
            path = self._dump_html(context, raw)
            raise ScraperError(
                f"{context}: input[{by}={ident!r}] missing or has no value attribute "
                f"(response saved to {path})"
            )
        return el["value"]

Update the test for the existing happy-path — the name parameter is now called ident. Also add a test for the by="id" path. Append to tests/test_cm_bot_scraper.py inside ScraperHelpersTests:

    def test_find_input_value_supports_by_id(self):
        html = '<form><input id="toUserId" value="42"></form>'
        soup = BeautifulSoup(html, "html.parser")
        result = self.bot._find_input_value(
            soup, "toUserId", context="by_id", raw=html.encode(), by="id",
        )
        self.assertEqual(result, "42")

The five existing test methods that use name="token" keep working because the rename name → ident is a positional argument; tests pass it positionally.

Now replace the body of transfer_credit:

    def transfer_credit(self, t_username: str, t_password: str, amount: float):
        token = self.get_transfer_token()
        transfer_search_data = self.get_transfer_search_data(token, t_username)
        response = self.session.post(
            f'{self.base_url}/cm/searchTransferUser',
            data=transfer_search_data,
            headers=self.transfer_search_headers
        )
        soup = BeautifulSoup(response.content, 'html.parser')
        name = self._find_input_value(
            soup, "name", context="transfer_search_name", raw=response.content, by="id",
        )
        token = self._find_input_value(
            soup, "token", context="transfer_search_token", raw=response.content,
        )
        toUserId = self._find_input_value(
            soup, "toUserId", context="transfer_search_toUserId", raw=response.content, by="id",
        )
        transfer_data = self.get_transfer_data(token, t_username, name, toUserId, amount, t_password)
        response = self.session.post(
            f'{self.base_url}/cm/saveTransfer',
            data=transfer_data,
            headers=self.transfer_credit_headers
        )
        return True if re.search(r'Successfully saved the record\.', response.text) else False

The rest of transfer_credit (the second POST and the success-string check) stays identical. The commented-out # with open('transfer_credit.html', ...) block at the end can be deleted as part of this edit (the dump now happens automatically on a parse miss).

Step 5: Run tests to verify everything still passes

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_cm_bot_scraper -v 2>&1 | tail -10

Expected: 7 tests, OK (six original + one new for by="id").

Step 6: Confirm full suite still green

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8

Expected: total 37 tests, OK.

Step 7: Commit

cd /home/yiekheng/projects/cm_bot_v2 && \
git add tests/test_cm_bot_scraper.py app/cm_bot.py && \
git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \
  commit -m "refactor(scraper): convert input-value extractions to helper"

Task 3: Make `get_register_link` and `get_user_credit` failure paths informative

Files:

Modify: app/cm_bot.py (get_register_link, get_user_credit)

These two methods don't fit the input-value helper. get_register_link extracts an <a href="..."> from a specific form; get_user_credit does multi-step text-content navigation through a table. We add explicit dump+raise / dump+log behavior at each.

Step 1: Update get_register_link

Find (around line 402-406):

    def get_register_link(self):
        response = self.session.get(f"{self.base_url}/cm/showQrCode")
        soup = BeautifulSoup(response.content, 'html.parser')
        soup = soup.find('form', {'id': 'qrCodeForm'})
        return soup.find('a')['href']

Replace with:

    def get_register_link(self):
        response = self.session.get(f"{self.base_url}/cm/showQrCode")
        soup = BeautifulSoup(response.content, 'html.parser')
        form = soup.find('form', {'id': 'qrCodeForm'})
        if form is None:
            path = self._dump_html("register_link_form", response.content)
            raise ScraperError(
                f"register_link: form#qrCodeForm not found "
                f"(response saved to {path})"
            )
        anchor = form.find('a')
        if anchor is None or 'href' not in anchor.attrs:
            path = self._dump_html("register_link_anchor", response.content)
            raise ScraperError(
                f"register_link: <a href> inside form#qrCodeForm not found "
                f"(response saved to {path})"
            )
        return anchor['href']

Step 2: Update get_user_credit's except block

Find (around line 448-461):

    def get_user_credit(self):
        response = self.session.post(
            f'{self.base_url}/cm/userProfile',
            headers=self.get_user_credit_headers
        )
        soup = BeautifulSoup(response.content, 'html.parser')
        try:
            return round(float(soup.find('table', {'class': 'generalContent'}).find(text=re.compile('Credit Available')).parent.parent.find_all('td')[2].text.replace(",","")), 2)
        except:
            print(f"Error getting credit.")
            now = datetime.datetime.now().strftime('%Y%m%d_%H%M')
            # with open(f'credit-{now}.html', 'wb') as f:
            #     f.write(response.content)
            return 0

Replace the except: block so it actively dumps the HTML (uncomment the previously-commented dump and route it through the helper):

    def get_user_credit(self):
        response = self.session.post(
            f'{self.base_url}/cm/userProfile',
            headers=self.get_user_credit_headers
        )
        soup = BeautifulSoup(response.content, 'html.parser')
        try:
            return round(float(soup.find('table', {'class': 'generalContent'}).find(text=re.compile('Credit Available')).parent.parent.find_all('td')[2].text.replace(",","")), 2)
        except Exception as exc:
            self._dump_html("get_user_credit", response.content)
            print(f"Error getting credit: {exc}")
            return 0

Three changes inside the except: catch Exception as exc (was bare except), call _dump_html (was a commented-out with open(...)), drop the now-unused now = datetime.datetime.now()... line. The bare-except → Exception as exc widening is intentional — the original bare except also caught KeyboardInterrupt and SystemExit, which we should not be swallowing in a credit-read.

The function still returns 0 on failure to preserve the existing contract (callers in cm_bot_hal.py:transfer_credit_api check amount <= 0.01 and short-circuit). We do not change that.

Step 3: Run all tests

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8

Expected: 37 tests, OK. (No new tests in this task — the changed methods are integration-level and would need live cm99.net or HTML fixtures to exercise. The two methods' happy paths are unchanged; their failure paths are dump+raise/log, which is independently exercised by Task 1's helper tests.)

Step 4: Commit

cd /home/yiekheng/projects/cm_bot_v2 && \
git add app/cm_bot.py && \
git -c user.name='yiekheng' -c user.email='yiekheng@04080616.xyz' \
  commit -m "refactor(scraper): make get_register_link and get_user_credit dump on failure"

Task 4: Final verification

Files: none modified.

Step 1: All tests green

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -m unittest tests.test_debug_enabled tests.test_bot_cli tests.test_cm_bot_scraper -v 2>&1 | tail -8

Expected: 37 tests, OK.

Step 2: Sanity-grep for the old pattern

cd /home/yiekheng/projects/cm_bot_v2 && \
grep -n "soup.find('input'.*\['value'\]" app/cm_bot.py && echo "STILL THERE" || echo "OK: no bare input-value extractions"

Expected: OK: no bare input-value extractions.

Step 3: ScraperError is exported from app.cm_bot

cd /home/yiekheng/projects/cm_bot_v2 && \
.venv/bin/python -c "
from app.cm_bot import CM_BOT, ScraperError
assert issubclass(ScraperError, Exception)
assert hasattr(CM_BOT, '_dump_html')
assert hasattr(CM_BOT, '_find_input_value')
print('ScraperError + helpers OK')
"

Expected: ScraperError + helpers OK.

Step 4: Real-call smoke (deferred to operator)

Trigger an actual bot operation against cm99.net (e.g., from the dev tier with real agent creds: bash scripts/bot_cli.sh credit <username> <password>). On success: behavior unchanged. On a parse failure that previously would have TypeError'd: a ScraperError propagates with a clear message and a file appears under logs/scraper-failures/<context>-<timestamp>.html.

Spec Coverage Check (self-review)

Spec requirement	Task
`ScraperError` class	Task 1
`_dump_html` instance method	Task 1
`_find_input_value` instance method, default `by="name"`	Task 1
`_find_input_value` extension to support `by="id"` for `transfer_credit`	Task 2
Convert `get_register_form_token`	Task 2 step 1
Convert `get_security_pin_form_token`	Task 2 step 2
Convert `get_transfer_token`	Task 2 step 3
Convert three extractions inside `transfer_credit` (`name`, `token`, `toUserId`)	Task 2 step 4
`get_register_link` failure path dumps + raises	Task 3 step 1
`get_user_credit` failure path dumps + logs (returns 0 unchanged)	Task 3 step 2
Unit tests in `tests/test_cm_bot_scraper.py`	Task 1 + Task 2
`logs/` already gitignored, no .gitignore change	(existing — verified pre-flight)
No CSRF token caching	(intentionally not in plan)

No gaps. No placeholders. ScraperError, _dump_html, _find_input_value names consistent across tasks. by parameter introduced in Task 2 with a default that preserves Task 1's API contract.

25 KiB Raw Blame History

R3: Scraper Resilience Implementation Plan

File Map

Task 1: Add ScraperError, _dump_html, _find_input_value (TDD)

Task 2: Convert the five <input name="token"> extractions to use the helper

Task 3: Make get_register_link and get_user_credit failure paths informative

Task 4: Final verification

Spec Coverage Check (self-review)

25 KiB

Raw Blame History

Task 1: Add `ScraperError`, `_dump_html`, `_find_input_value` (TDD)

Task 2: Convert the five `<input name="token">` extractions to use the helper

Task 3: Make `get_register_link` and `get_user_credit` failure paths informative