Code: - Defer boto3 client and DATABASE_URL reads to first use via _ensure_config(). Missing .env now prints a friendly "Missing env vars" list and exits instead of KeyError on import. - Auto-detect Chrome binary from CHROME_CANDIDATES (macOS/Linux/Windows paths). Friendly error listing tried paths if none found. - Guard termios/tty imports; EscListener becomes a no-op on Windows. - hide_chrome() is a no-op on non-macOS (osascript only works on Darwin). - with_browser catches target-closed/disconnected errors, resets the session singleton, and retries once before raising. Docs: - Fix claim that page.goto is never used — manga listing uses page.goto, only reader pages use window.location.href. - Correct AppleScript command (full tell-application form). - Clarify "Check missing pages" flow — re-upload is inline; dim-only fix reads bytes from R2 without re-upload. - Add CREATE TABLE statements for Manga/Chapter/Page so schema contract is explicit. - Add "Where to change what" table mapping tasks to code locations. - Document lazy config, cross-platform constraints, and anti-patterns (headless, thread parallelism). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.7 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Single-file interactive toolkit (manga.py) that downloads manga from m.happymh.com, stores images in Cloudflare R2 as WebP, and writes metadata to PostgreSQL. Runs as an arrow-key TUI backed by a persistent Chrome session.
Commands
pip install -r requirements.txt # playwright, boto3, psycopg2-binary, Pillow, python-dotenv, simple-term-menu
python manga.py # launch the TUI (no CLI args)
No tests, no lint config, no build step. Requires Google Chrome or Chromium installed. The script auto-detects from CHROME_CANDIDATES (macOS/Linux/Windows paths). R2 and DB credentials load lazily — see .env section below.
Architecture
Anti-bot: real Chrome + CDP + persistent profile
Cloudflare fingerprints both the TLS handshake and the browser process. The anti-detection chain matters — changing any link breaks downloads:
subprocess.Popen(CHROME_PATH, ...)launches the user's real Chrome binary, not Playwright's Chromium. This gives a genuine TLS fingerprint.connect_over_cdpattaches Playwright to Chrome via DevTools Protocol. Playwright never launches Chrome — only sends CDP commands to a separately-running process.- Persistent
--user-data-dir=.browser-datapreservescf_clearancecookies between runs. After the user solves Cloudflare once (Setup menu), subsequent runs skip the challenge. - Single session (
_session_singleton) — Chrome is lazy-started on first operation and reused across all commands in onepython manga.pyrun. Closed only on Quit.with_browser(func)catches "target closed" / "disconnected" errors, resets the singleton, and retries once. hide_chrome()runsosascript -e 'tell application "System Events" to set visible of process "Google Chrome" to false'after launch so the window doesn't steal focus. No-op on non-macOS.
Do not switch to headless mode. Tried — Cloudflare blocks it because the fingerprint differs from real Chrome. Do not parallelize chapter work across threads with Playwright's sync API — each thread would need its own event loop and crashes with "no running event loop".
Cloudflare handling
wait_for_cloudflare(session) polls page.title() and page.url for the "Just a moment" / /challenge markers. Recovery is manual: the user is shown the browser window and solves CAPTCHA. The Setup menu (cmd_setup) is the dedicated flow for this. During sync/check-missing, if the reading API returns 403, the script prints "CF blocked — run Setup" and stops.
Navigation: page.goto vs JS assignment
- Manga listing page (
/manga/<slug>) usespage.goto(..., wait_until="commit"). Works because Cloudflare on this route is lenient. - Reader page (
/mangaread/<slug>/<id>) usespage.evaluate("window.location.href = '...'")— bypasses CF's detection of CDPPage.navigatefor the stricter reader route.
Image pipeline (happymh)
Per chapter (in _try_get_chapter_images):
- Register a response listener that matches
/apis/manga/readingANDcid=<chapter_id>in the URL AND validatesdata.idin the response body matches. Drops pre-fetched neighbouring chapters. - Navigate the reader URL via
window.location.hrefassignment. - DOM-count sanity check:
[class*="imgContainer"]total minus[class*="imgNext"]gives the current chapter's actual page count. Trim captured list if it includes next-chapter previews. fetch_image_bytes(page, img)runsfetch(url)viapage.evaluateinside apage.expect_response(...)block. The body is read via CDP (response.body()) — zero base64 overhead. Fallback strips the?q=50query if the original URL fails.fetch_all_pages(page, images, max_attempts=3)retries each failed page up to 3 times with 2s backoff between rounds. Returns{page_num: bytes}for successful fetches.
R2 + DB write ordering
Page rows are inserted into the DB only after the R2 upload succeeds. This prevents orphan DB records pointing to missing R2 objects. Every INSERT INTO "Page" includes width and height read from the JPEG/WebP bytes via PIL (Image.open(...).width).
Storage layouts
# Local (download command)
manga-content/<slug>/detail.json # title, author, genres, description, mg-cover URL
manga-content/<slug>/cover.jpg # captured from page load traffic
manga-content/<slug>/<N> <chapter>/<page>.jpg
# R2 (upload / sync)
manga/<slug>/cover.webp
manga/<slug>/chapters/<N>/<page>.webp
Chapter order is the API's ascending index (1-based). Chapter names can repeat (announcements, extras) so the DB Chapter.number column uses this index, not parsed chapter titles.
Menu actions
- Setup (
cmd_setup) → brings Chrome to front, user solves CF, validatescf_clearancecookie. - Download (
cmd_download) → picks URL frommanga.json, optional chapter multi-select; saves JPGs locally. - Upload (
cmd_upload→upload_manga_to_r2) → converts local JPGs → WebP, uploads to R2, writes DB rows. - Sync (
cmd_sync) → combined download+upload via RAM (no local files), refreshesMangarow metadata, only inserts chapters missing from DB. - R2 / DB management submenu (
tui_r2_manage):- Status — single-pass R2 object count grouped by slug, plus DB row counts
- Edit manga info (
tui_edit_manga) — title/description/genre/status/coverUrl - Delete specific manga — R2 prefix + cascade DB delete
- Delete specific chapter (
tui_delete_chapter) — multi-select or "All chapters" - Check missing pages (
tui_check_missing_pages) — for each chapter: if site page count ≠ R2 count, re-upload inline (browser still on that reader page); if counts match but DBwidth/heightare NULL or 0, fix by reading WebP bytes from R2 (no re-upload) - Clear ALL (R2 + DB)
- Recompress manga (
r2_recompress) — re-encodes every WebP undermanga/<slug>/at quality=65, overwrites in place
WebP encoding
_to_webp_bytes(img, quality=WEBP_QUALITY=75, method=6) — method=6 is the slowest/smallest preset. Covers use quality 80 via make_cover (crops to 400×560 aspect, then resizes). Resize-during-encode was explicitly removed — page originals' dimensions are preserved.
ESC to stop
EscListener puts stdin in cbreak mode (POSIX termios+tty) and runs a daemon thread listening for \x1b. Download/Upload/Sync check esc.stop.is_set() between chapters and cleanly exit. Restores terminal mode on __exit__. No-op on Windows (no termios) and when stdin isn't a TTY.
Lazy config loading
_ensure_config() is called at the start of each R2/DB helper. It reads required env vars and constructs the boto3 client on first use. If env vars are missing, it prints the missing list and sys.exit(1) — no KeyError traceback on import. s3, BUCKET, PUBLIC_URL, DATABASE_URL are module globals set by that call.
Environment variables (.env)
R2_ACCOUNT_ID= # cloudflare account id
R2_ACCESS_KEY=
R2_SECRET_KEY=
R2_BUCKET=
R2_PUBLIC_URL= # e.g. https://pub-xxx.r2.dev (trailing slash stripped)
DATABASE_URL= # postgresql://user:pass@host:port/dbname
Missing any of these produces a friendly error on first R2/DB operation, not on import.
DB schema expectations
The script reads/writes but does not create tables. Create them externally:
CREATE TABLE "Manga" (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
description TEXT,
"coverUrl" TEXT,
genre TEXT, -- comma-joined list of all genres
status TEXT NOT NULL, -- PUBLISHED | DRAFT | HIDDEN
"createdAt" TIMESTAMPTZ NOT NULL,
"updatedAt" TIMESTAMPTZ NOT NULL
);
CREATE TABLE "Chapter" (
id SERIAL PRIMARY KEY,
"mangaId" INTEGER NOT NULL REFERENCES "Manga"(id),
number INTEGER NOT NULL, -- 1-based index from the API order
title TEXT NOT NULL,
UNIQUE ("mangaId", number)
);
CREATE TABLE "Page" (
id SERIAL PRIMARY KEY,
"chapterId" INTEGER NOT NULL REFERENCES "Chapter"(id),
number INTEGER NOT NULL, -- 1-based page number
"imageUrl" TEXT NOT NULL,
width INTEGER,
height INTEGER,
UNIQUE ("chapterId", number)
);
Column identifiers are camelCase with double quotes — matches Prisma default naming.
Where to change what
| Task | Location |
|---|---|
| Add a new site | Extract happymh-specific bits: fetch_chapters_via_api, fetch_chapters_from_dom, fetch_metadata, _try_get_chapter_images, the /mcover/ cover capture in load_manga_page, the reader URL shape. Keep Chrome/R2/DB/TUI as common. |
| New menu item | Add to show_menu list in main and dispatch in the if idx == N: ladder. For R2/DB ops, add to tui_r2_manage. |
| Tweak CF detection | wait_for_cloudflare / _wait_for_cf_on_page — edit the title/URL heuristics carefully, both ops check the same signals. |
| Change image quality | WEBP_QUALITY at top of file; cover quality is hard-coded 80 in make_cover. |
| Add a new Page-table column | Update all three INSERT INTO "Page" sites (upload_manga_to_r2, cmd_sync, tui_check_missing_pages re-upload branch) and the SELECT ... FROM "Page" in the dim-check query. |
| Change parallelism | UPLOAD_WORKERS for R2 uploads; do not introduce chapter-level threading (sync Playwright breaks). |
Future: multi-site support
Current code is happymh-specific (selectors, API paths, URL patterns). To generalise, a site module would implement fetch_chapters(page, slug), get_chapter_images(page, slug, chapter_id), and fetch_metadata(page), keeping the Chrome/R2/DB/TUI layer common.