yiekheng 9cb9b8c7fd Lazy config, cross-platform support, session recovery, doc accuracy

Code:
- Defer boto3 client and DATABASE_URL reads to first use via
  _ensure_config(). Missing .env now prints a friendly "Missing env
  vars" list and exits instead of KeyError on import.
- Auto-detect Chrome binary from CHROME_CANDIDATES (macOS/Linux/Windows
  paths). Friendly error listing tried paths if none found.
- Guard termios/tty imports; EscListener becomes a no-op on Windows.
- hide_chrome() is a no-op on non-macOS (osascript only works on Darwin).
- with_browser catches target-closed/disconnected errors, resets the
  session singleton, and retries once before raising.

Docs:
- Fix claim that page.goto is never used — manga listing uses
  page.goto, only reader pages use window.location.href.
- Correct AppleScript command (full tell-application form).
- Clarify "Check missing pages" flow — re-upload is inline; dim-only
  fix reads bytes from R2 without re-upload.
- Add CREATE TABLE statements for Manga/Chapter/Page so schema contract
  is explicit.
- Add "Where to change what" table mapping tasks to code locations.
- Document lazy config, cross-platform constraints, and anti-patterns
  (headless, thread parallelism).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 18:32:20 +08:00

9.7 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Single-file interactive toolkit (manga.py) that downloads manga from m.happymh.com, stores images in Cloudflare R2 as WebP, and writes metadata to PostgreSQL. Runs as an arrow-key TUI backed by a persistent Chrome session.

Commands

pip install -r requirements.txt   # playwright, boto3, psycopg2-binary, Pillow, python-dotenv, simple-term-menu
python manga.py                   # launch the TUI (no CLI args)

No tests, no lint config, no build step. Requires Google Chrome or Chromium installed. The script auto-detects from CHROME_CANDIDATES (macOS/Linux/Windows paths). R2 and DB credentials load lazily — see .env section below.

Architecture

Anti-bot: real Chrome + CDP + persistent profile

Cloudflare fingerprints both the TLS handshake and the browser process. The anti-detection chain matters — changing any link breaks downloads:

subprocess.Popen(CHROME_PATH, ...) launches the user's real Chrome binary, not Playwright's Chromium. This gives a genuine TLS fingerprint.
connect_over_cdp attaches Playwright to Chrome via DevTools Protocol. Playwright never launches Chrome — only sends CDP commands to a separately-running process.
Persistent --user-data-dir=.browser-data preserves cf_clearance cookies between runs. After the user solves Cloudflare once (Setup menu), subsequent runs skip the challenge.
Single session (_session_singleton) — Chrome is lazy-started on first operation and reused across all commands in one python manga.py run. Closed only on Quit. with_browser(func) catches "target closed" / "disconnected" errors, resets the singleton, and retries once.
hide_chrome() runs osascript -e 'tell application "System Events" to set visible of process "Google Chrome" to false' after launch so the window doesn't steal focus. No-op on non-macOS.

Do not switch to headless mode. Tried — Cloudflare blocks it because the fingerprint differs from real Chrome. Do not parallelize chapter work across threads with Playwright's sync API — each thread would need its own event loop and crashes with "no running event loop".

Cloudflare handling

wait_for_cloudflare(session) polls page.title() and page.url for the "Just a moment" / /challenge markers. Recovery is manual: the user is shown the browser window and solves CAPTCHA. The Setup menu (cmd_setup) is the dedicated flow for this. During sync/check-missing, if the reading API returns 403, the script prints "CF blocked — run Setup" and stops.

Navigation: `page.goto` vs JS assignment

Manga listing page (/manga/<slug>) uses page.goto(..., wait_until="commit"). Works because Cloudflare on this route is lenient.
Reader page (/mangaread/<slug>/<id>) uses page.evaluate("window.location.href = '...'") — bypasses CF's detection of CDP Page.navigate for the stricter reader route.

Image pipeline (happymh)

Per chapter (in _try_get_chapter_images):

Register a response listener that matches /apis/manga/reading AND cid=<chapter_id> in the URL AND validates data.id in the response body matches. Drops pre-fetched neighbouring chapters.
Navigate the reader URL via window.location.href assignment.
DOM-count sanity check: [class*="imgContainer"] total minus [class*="imgNext"] gives the current chapter's actual page count. Trim captured list if it includes next-chapter previews.
fetch_image_bytes(page, img) runs fetch(url) via page.evaluate inside a page.expect_response(...) block. The body is read via CDP (response.body()) — zero base64 overhead. Fallback strips the ?q=50 query if the original URL fails.
fetch_all_pages(page, images, max_attempts=3) retries each failed page up to 3 times with 2s backoff between rounds. Returns {page_num: bytes} for successful fetches.

R2 + DB write ordering

Page rows are inserted into the DB only after the R2 upload succeeds. This prevents orphan DB records pointing to missing R2 objects. Every INSERT INTO "Page" includes width and height read from the JPEG/WebP bytes via PIL (Image.open(...).width).

Storage layouts

# Local (download command)
manga-content/<slug>/detail.json       # title, author, genres, description, mg-cover URL
manga-content/<slug>/cover.jpg         # captured from page load traffic
manga-content/<slug>/<N> <chapter>/<page>.jpg

# R2 (upload / sync)
manga/<slug>/cover.webp
manga/<slug>/chapters/<N>/<page>.webp

Chapter order is the API's ascending index (1-based). Chapter names can repeat (announcements, extras) so the DB Chapter.number column uses this index, not parsed chapter titles.

Setup (cmd_setup) → brings Chrome to front, user solves CF, validates cf_clearance cookie.
Download (cmd_download) → picks URL from manga.json, optional chapter multi-select; saves JPGs locally.
Upload (cmd_upload → upload_manga_to_r2) → converts local JPGs → WebP, uploads to R2, writes DB rows.
Sync (cmd_sync) → combined download+upload via RAM (no local files), refreshes Manga row metadata, only inserts chapters missing from DB.
R2 / DB management submenu (tui_r2_manage):
- Status — single-pass R2 object count grouped by slug, plus DB row counts
- Edit manga info (tui_edit_manga) — title/description/genre/status/coverUrl
- Delete specific manga — R2 prefix + cascade DB delete
- Delete specific chapter (tui_delete_chapter) — multi-select or "All chapters"
- Check missing pages (tui_check_missing_pages) — for each chapter: if site page count ≠ R2 count, re-upload inline (browser still on that reader page); if counts match but DB width/height are NULL or 0, fix by reading WebP bytes from R2 (no re-upload)
- Clear ALL (R2 + DB)
- Recompress manga (r2_recompress) — re-encodes every WebP under manga/<slug>/ at quality=65, overwrites in place

WebP encoding

_to_webp_bytes(img, quality=WEBP_QUALITY=75, method=6) — method=6 is the slowest/smallest preset. Covers use quality 80 via make_cover (crops to 400×560 aspect, then resizes). Resize-during-encode was explicitly removed — page originals' dimensions are preserved.

ESC to stop

EscListener puts stdin in cbreak mode (POSIX termios+tty) and runs a daemon thread listening for \x1b. Download/Upload/Sync check esc.stop.is_set() between chapters and cleanly exit. Restores terminal mode on __exit__. No-op on Windows (no termios) and when stdin isn't a TTY.

Lazy config loading

_ensure_config() is called at the start of each R2/DB helper. It reads required env vars and constructs the boto3 client on first use. If env vars are missing, it prints the missing list and sys.exit(1) — no KeyError traceback on import. s3, BUCKET, PUBLIC_URL, DATABASE_URL are module globals set by that call.

Environment variables (.env)

R2_ACCOUNT_ID=        # cloudflare account id
R2_ACCESS_KEY=
R2_SECRET_KEY=
R2_BUCKET=
R2_PUBLIC_URL=        # e.g. https://pub-xxx.r2.dev (trailing slash stripped)
DATABASE_URL=         # postgresql://user:pass@host:port/dbname

Missing any of these produces a friendly error on first R2/DB operation, not on import.

DB schema expectations

The script reads/writes but does not create tables. Create them externally:

CREATE TABLE "Manga" (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  description   TEXT,
  "coverUrl"    TEXT,
  genre         TEXT,                        -- comma-joined list of all genres
  status        TEXT NOT NULL,               -- PUBLISHED | DRAFT | HIDDEN
  "createdAt"   TIMESTAMPTZ NOT NULL,
  "updatedAt"   TIMESTAMPTZ NOT NULL
);

CREATE TABLE "Chapter" (
  id          SERIAL PRIMARY KEY,
  "mangaId"   INTEGER NOT NULL REFERENCES "Manga"(id),
  number      INTEGER NOT NULL,              -- 1-based index from the API order
  title       TEXT NOT NULL,
  UNIQUE ("mangaId", number)
);

CREATE TABLE "Page" (
  id          SERIAL PRIMARY KEY,
  "chapterId" INTEGER NOT NULL REFERENCES "Chapter"(id),
  number      INTEGER NOT NULL,              -- 1-based page number
  "imageUrl"  TEXT NOT NULL,
  width       INTEGER,
  height      INTEGER,
  UNIQUE ("chapterId", number)
);

Column identifiers are camelCase with double quotes — matches Prisma default naming.

Where to change what

Task	Location
Add a new site	Extract happymh-specific bits: `fetch_chapters_via_api`, `fetch_chapters_from_dom`, `fetch_metadata`, `_try_get_chapter_images`, the `/mcover/` cover capture in `load_manga_page`, the reader URL shape. Keep Chrome/R2/DB/TUI as common.
New menu item	Add to `show_menu` list in `main` and dispatch in the `if idx == N:` ladder. For R2/DB ops, add to `tui_r2_manage`.
Tweak CF detection	`wait_for_cloudflare` / `_wait_for_cf_on_page` — edit the title/URL heuristics carefully, both ops check the same signals.
Change image quality	`WEBP_QUALITY` at top of file; cover quality is hard-coded 80 in `make_cover`.
Add a new Page-table column	Update all three `INSERT INTO "Page"` sites (`upload_manga_to_r2`, `cmd_sync`, `tui_check_missing_pages` re-upload branch) and the `SELECT ... FROM "Page"` in the dim-check query.
Change parallelism	`UPLOAD_WORKERS` for R2 uploads; do not introduce chapter-level threading (sync Playwright breaks).

Future: multi-site support

Current code is happymh-specific (selectors, API paths, URL patterns). To generalise, a site module would implement fetch_chapters(page, slug), get_chapter_images(page, slug, chapter_id), and fetch_metadata(page), keeping the Chrome/R2/DB/TUI layer common.

9.7 KiB Raw Permalink Blame History Unescape Escape