yiekheng 9cb9b8c7fd Lazy config, cross-platform support, session recovery, doc accuracy
Code:
- Defer boto3 client and DATABASE_URL reads to first use via
  _ensure_config(). Missing .env now prints a friendly "Missing env
  vars" list and exits instead of KeyError on import.
- Auto-detect Chrome binary from CHROME_CANDIDATES (macOS/Linux/Windows
  paths). Friendly error listing tried paths if none found.
- Guard termios/tty imports; EscListener becomes a no-op on Windows.
- hide_chrome() is a no-op on non-macOS (osascript only works on Darwin).
- with_browser catches target-closed/disconnected errors, resets the
  session singleton, and retries once before raising.

Docs:
- Fix claim that page.goto is never used — manga listing uses
  page.goto, only reader pages use window.location.href.
- Correct AppleScript command (full tell-application form).
- Clarify "Check missing pages" flow — re-upload is inline; dim-only
  fix reads bytes from R2 without re-upload.
- Add CREATE TABLE statements for Manga/Chapter/Page so schema contract
  is explicit.
- Add "Where to change what" table mapping tasks to code locations.
- Document lazy config, cross-platform constraints, and anti-patterns
  (headless, thread parallelism).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:32:20 +08:00

161 lines
9.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Single-file interactive toolkit (`manga.py`) that downloads manga from m.happymh.com, stores images in Cloudflare R2 as WebP, and writes metadata to PostgreSQL. Runs as an arrow-key TUI backed by a persistent Chrome session.
## Commands
```bash
pip install -r requirements.txt # playwright, boto3, psycopg2-binary, Pillow, python-dotenv, simple-term-menu
python manga.py # launch the TUI (no CLI args)
```
No tests, no lint config, no build step. Requires Google Chrome or Chromium installed. The script auto-detects from `CHROME_CANDIDATES` (macOS/Linux/Windows paths). R2 and DB credentials load lazily — see `.env` section below.
## Architecture
### Anti-bot: real Chrome + CDP + persistent profile
Cloudflare fingerprints both the TLS handshake and the browser process. The anti-detection chain matters — changing any link breaks downloads:
1. **`subprocess.Popen(CHROME_PATH, ...)`** launches the user's real Chrome binary, not Playwright's Chromium. This gives a genuine TLS fingerprint.
2. **`connect_over_cdp`** attaches Playwright to Chrome via DevTools Protocol. Playwright never *launches* Chrome — only sends CDP commands to a separately-running process.
3. **Persistent `--user-data-dir=.browser-data`** preserves `cf_clearance` cookies between runs. After the user solves Cloudflare once (Setup menu), subsequent runs skip the challenge.
4. **Single session (`_session_singleton`)** — Chrome is lazy-started on first operation and reused across all commands in one `python manga.py` run. Closed only on Quit. `with_browser(func)` catches "target closed" / "disconnected" errors, resets the singleton, and retries once.
5. **`hide_chrome()`** runs `osascript -e 'tell application "System Events" to set visible of process "Google Chrome" to false'` after launch so the window doesn't steal focus. No-op on non-macOS.
**Do not switch to headless mode.** Tried — Cloudflare blocks it because the fingerprint differs from real Chrome. **Do not parallelize chapter work across threads** with Playwright's sync API — each thread would need its own event loop and crashes with "no running event loop".
### Cloudflare handling
`wait_for_cloudflare(session)` polls `page.title()` and `page.url` for the "Just a moment" / `/challenge` markers. Recovery is manual: the user is shown the browser window and solves CAPTCHA. The Setup menu (`cmd_setup`) is the dedicated flow for this. During sync/check-missing, if the reading API returns 403, the script prints "CF blocked — run Setup" and stops.
### Navigation: `page.goto` vs JS assignment
- **Manga listing page** (`/manga/<slug>`) uses `page.goto(..., wait_until="commit")`. Works because Cloudflare on this route is lenient.
- **Reader page** (`/mangaread/<slug>/<id>`) uses `page.evaluate("window.location.href = '...'")` — bypasses CF's detection of CDP `Page.navigate` for the stricter reader route.
### Image pipeline (happymh)
Per chapter (in `_try_get_chapter_images`):
1. Register a response listener that matches `/apis/manga/reading` **AND** `cid=<chapter_id>` in the URL **AND** validates `data.id` in the response body matches. Drops pre-fetched neighbouring chapters.
2. Navigate the reader URL via `window.location.href` assignment.
3. DOM-count sanity check: `[class*="imgContainer"]` total minus `[class*="imgNext"]` gives the current chapter's actual page count. Trim captured list if it includes next-chapter previews.
4. `fetch_image_bytes(page, img)` runs `fetch(url)` via `page.evaluate` inside a `page.expect_response(...)` block. The body is read via CDP (`response.body()`) — zero base64 overhead. Fallback strips the `?q=50` query if the original URL fails.
5. `fetch_all_pages(page, images, max_attempts=3)` retries each failed page up to 3 times with 2s backoff between rounds. Returns `{page_num: bytes}` for successful fetches.
### R2 + DB write ordering
**Page rows are inserted into the DB only after the R2 upload succeeds.** This prevents orphan DB records pointing to missing R2 objects. Every `INSERT INTO "Page"` includes `width` and `height` read from the JPEG/WebP bytes via PIL (`Image.open(...).width`).
### Storage layouts
```
# Local (download command)
manga-content/<slug>/detail.json # title, author, genres, description, mg-cover URL
manga-content/<slug>/cover.jpg # captured from page load traffic
manga-content/<slug>/<N> <chapter>/<page>.jpg
# R2 (upload / sync)
manga/<slug>/cover.webp
manga/<slug>/chapters/<N>/<page>.webp
```
Chapter order is the API's ascending index (1-based). Chapter names can repeat (announcements, extras) so the DB `Chapter.number` column uses this index, not parsed chapter titles.
### Menu actions
- **Setup** (`cmd_setup`) → brings Chrome to front, user solves CF, validates `cf_clearance` cookie.
- **Download** (`cmd_download`) → picks URL from `manga.json`, optional chapter multi-select; saves JPGs locally.
- **Upload** (`cmd_upload``upload_manga_to_r2`) → converts local JPGs → WebP, uploads to R2, writes DB rows.
- **Sync** (`cmd_sync`) → combined download+upload via RAM (no local files), refreshes `Manga` row metadata, only inserts chapters missing from DB.
- **R2 / DB management** submenu (`tui_r2_manage`):
- **Status** — single-pass R2 object count grouped by slug, plus DB row counts
- **Edit manga info** (`tui_edit_manga`) — title/description/genre/status/coverUrl
- **Delete specific manga** — R2 prefix + cascade DB delete
- **Delete specific chapter** (`tui_delete_chapter`) — multi-select or "All chapters"
- **Check missing pages** (`tui_check_missing_pages`) — for each chapter: if site page count ≠ R2 count, re-upload **inline** (browser still on that reader page); if counts match but DB `width`/`height` are NULL or 0, fix by reading WebP bytes from R2 (no re-upload)
- **Clear ALL (R2 + DB)**
- **Recompress manga** (`r2_recompress`) — re-encodes every WebP under `manga/<slug>/` at quality=65, overwrites in place
### WebP encoding
`_to_webp_bytes(img, quality=WEBP_QUALITY=75, method=6)` — method=6 is the slowest/smallest preset. Covers use quality 80 via `make_cover` (crops to 400×560 aspect, then resizes). Resize-during-encode was explicitly removed — page originals' dimensions are preserved.
### ESC to stop
`EscListener` puts stdin in cbreak mode (POSIX `termios`+`tty`) and runs a daemon thread listening for `\x1b`. Download/Upload/Sync check `esc.stop.is_set()` between chapters and cleanly exit. Restores terminal mode on `__exit__`. No-op on Windows (no termios) and when stdin isn't a TTY.
### Lazy config loading
`_ensure_config()` is called at the start of each R2/DB helper. It reads required env vars and constructs the boto3 client on first use. If env vars are missing, it prints the missing list and `sys.exit(1)` — no KeyError traceback on import. `s3`, `BUCKET`, `PUBLIC_URL`, `DATABASE_URL` are module globals set by that call.
## Environment variables (.env)
```
R2_ACCOUNT_ID= # cloudflare account id
R2_ACCESS_KEY=
R2_SECRET_KEY=
R2_BUCKET=
R2_PUBLIC_URL= # e.g. https://pub-xxx.r2.dev (trailing slash stripped)
DATABASE_URL= # postgresql://user:pass@host:port/dbname
```
Missing any of these produces a friendly error on first R2/DB operation, not on import.
## DB schema expectations
The script reads/writes but does **not** create tables. Create them externally:
```sql
CREATE TABLE "Manga" (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
description TEXT,
"coverUrl" TEXT,
genre TEXT, -- comma-joined list of all genres
status TEXT NOT NULL, -- PUBLISHED | DRAFT | HIDDEN
"createdAt" TIMESTAMPTZ NOT NULL,
"updatedAt" TIMESTAMPTZ NOT NULL
);
CREATE TABLE "Chapter" (
id SERIAL PRIMARY KEY,
"mangaId" INTEGER NOT NULL REFERENCES "Manga"(id),
number INTEGER NOT NULL, -- 1-based index from the API order
title TEXT NOT NULL,
UNIQUE ("mangaId", number)
);
CREATE TABLE "Page" (
id SERIAL PRIMARY KEY,
"chapterId" INTEGER NOT NULL REFERENCES "Chapter"(id),
number INTEGER NOT NULL, -- 1-based page number
"imageUrl" TEXT NOT NULL,
width INTEGER,
height INTEGER,
UNIQUE ("chapterId", number)
);
```
Column identifiers are camelCase with double quotes — matches Prisma default naming.
## Where to change what
| Task | Location |
|---|---|
| Add a new site | Extract happymh-specific bits: `fetch_chapters_via_api`, `fetch_chapters_from_dom`, `fetch_metadata`, `_try_get_chapter_images`, the `/mcover/` cover capture in `load_manga_page`, the reader URL shape. Keep Chrome/R2/DB/TUI as common. |
| New menu item | Add to `show_menu` list in `main` and dispatch in the `if idx == N:` ladder. For R2/DB ops, add to `tui_r2_manage`. |
| Tweak CF detection | `wait_for_cloudflare` / `_wait_for_cf_on_page` — edit the title/URL heuristics carefully, both ops check the same signals. |
| Change image quality | `WEBP_QUALITY` at top of file; cover quality is hard-coded 80 in `make_cover`. |
| Add a new Page-table column | Update all three `INSERT INTO "Page"` sites (`upload_manga_to_r2`, `cmd_sync`, `tui_check_missing_pages` re-upload branch) and the `SELECT ... FROM "Page"` in the dim-check query. |
| Change parallelism | `UPLOAD_WORKERS` for R2 uploads; do **not** introduce chapter-level threading (sync Playwright breaks). |
## Future: multi-site support
Current code is happymh-specific (selectors, API paths, URL patterns). To generalise, a site module would implement `fetch_chapters(page, slug)`, `get_chapter_images(page, slug, chapter_id)`, and `fetch_metadata(page)`, keeping the Chrome/R2/DB/TUI layer common.