Lazy config, cross-platform support, session recovery, doc accuracy

Code:
- Defer boto3 client and DATABASE_URL reads to first use via
  _ensure_config(). Missing .env now prints a friendly "Missing env
  vars" list and exits instead of KeyError on import.
- Auto-detect Chrome binary from CHROME_CANDIDATES (macOS/Linux/Windows
  paths). Friendly error listing tried paths if none found.
- Guard termios/tty imports; EscListener becomes a no-op on Windows.
- hide_chrome() is a no-op on non-macOS (osascript only works on Darwin).
- with_browser catches target-closed/disconnected errors, resets the
  session singleton, and retries once before raising.

Docs:
- Fix claim that page.goto is never used — manga listing uses
  page.goto, only reader pages use window.location.href.
- Correct AppleScript command (full tell-application form).
- Clarify "Check missing pages" flow — re-upload is inline; dim-only
  fix reads bytes from R2 without re-upload.
- Add CREATE TABLE statements for Manga/Chapter/Page so schema contract
  is explicit.
- Add "Where to change what" table mapping tasks to code locations.
- Document lazy config, cross-platform constraints, and anti-patterns
  (headless, thread parallelism).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
yiekheng 2026-04-12 18:32:20 +08:00
parent 051b2e191f
commit 9cb9b8c7fd
2 changed files with 226 additions and 61 deletions

173
CLAUDE.md
View File

@ -4,64 +4,157 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## Project Overview ## Project Overview
Manga downloader and uploader toolkit. Currently supports m.happymh.com, designed for future multi-site support. Single-file interactive toolkit (`manga.py`) that downloads manga from m.happymh.com, stores images in Cloudflare R2 as WebP, and writes metadata to PostgreSQL. Runs as an arrow-key TUI backed by a persistent Chrome session.
- `manga.py` — Single interactive CLI. Download, upload, and sync manga. Launches real Chrome via subprocess, connects via CDP, bypasses Cloudflare. Uploads to R2 + PostgreSQL. ## Commands
```bash
pip install -r requirements.txt # playwright, boto3, psycopg2-binary, Pillow, python-dotenv, simple-term-menu
python manga.py # launch the TUI (no CLI args)
```
No tests, no lint config, no build step. Requires Google Chrome or Chromium installed. The script auto-detects from `CHROME_CANDIDATES` (macOS/Linux/Windows paths). R2 and DB credentials load lazily — see `.env` section below.
## Architecture ## Architecture
### Anti-bot Strategy ### Anti-bot: real Chrome + CDP + persistent profile
- Chrome launched via `subprocess.Popen` (not Playwright) to avoid automation detection
- Playwright connects via CDP (`connect_over_cdp`) for scripting only
- Persistent browser profile in `.browser-data/` preserves Cloudflare sessions
- All navigation uses JS (`window.location.href`) or `page.goto` with `wait_until="commit"`
- Images downloaded via `response.body()` from network interception (no base64)
### Data Flow Cloudflare fingerprints both the TLS handshake and the browser process. The anti-detection chain matters — changing any link breaks downloads:
1. **Input**: `manga.json` — JSON array of manga URLs
2. **Download**: Chrome navigates to manga page → API fetches chapter list → navigates to reader pages → intercepts image URLs from API → downloads via browser fetch
3. **Local storage**: `manga-content/<slug>/` with cover.jpg, detail.json, and chapter folders
4. **Upload**: Converts JPG→WebP → uploads to R2 → creates DB records
### Key APIs (happymh) 1. **`subprocess.Popen(CHROME_PATH, ...)`** launches the user's real Chrome binary, not Playwright's Chromium. This gives a genuine TLS fingerprint.
- Chapter list: `GET /v2.0/apis/manga/chapterByPage?code=<slug>&lang=cn&order=asc&page=<n>` 2. **`connect_over_cdp`** attaches Playwright to Chrome via DevTools Protocol. Playwright never *launches* Chrome — only sends CDP commands to a separately-running process.
- Chapter images: `GET /v2.0/apis/manga/reading?code=<slug>&cid=<chapter_id>` (intercepted from reader page) 3. **Persistent `--user-data-dir=.browser-data`** preserves `cf_clearance` cookies between runs. After the user solves Cloudflare once (Setup menu), subsequent runs skip the challenge.
- Cover: Captured from page load traffic (`/mcover/` responses) 4. **Single session (`_session_singleton`)** — Chrome is lazy-started on first operation and reused across all commands in one `python manga.py` run. Closed only on Quit. `with_browser(func)` catches "target closed" / "disconnected" errors, resets the singleton, and retries once.
5. **`hide_chrome()`** runs `osascript -e 'tell application "System Events" to set visible of process "Google Chrome" to false'` after launch so the window doesn't steal focus. No-op on non-macOS.
## Directory Convention **Do not switch to headless mode.** Tried — Cloudflare blocks it because the fingerprint differs from real Chrome. **Do not parallelize chapter work across threads** with Playwright's sync API — each thread would need its own event loop and crashes with "no running event loop".
### Cloudflare handling
`wait_for_cloudflare(session)` polls `page.title()` and `page.url` for the "Just a moment" / `/challenge` markers. Recovery is manual: the user is shown the browser window and solves CAPTCHA. The Setup menu (`cmd_setup`) is the dedicated flow for this. During sync/check-missing, if the reading API returns 403, the script prints "CF blocked — run Setup" and stops.
### Navigation: `page.goto` vs JS assignment
- **Manga listing page** (`/manga/<slug>`) uses `page.goto(..., wait_until="commit")`. Works because Cloudflare on this route is lenient.
- **Reader page** (`/mangaread/<slug>/<id>`) uses `page.evaluate("window.location.href = '...'")` — bypasses CF's detection of CDP `Page.navigate` for the stricter reader route.
### Image pipeline (happymh)
Per chapter (in `_try_get_chapter_images`):
1. Register a response listener that matches `/apis/manga/reading` **AND** `cid=<chapter_id>` in the URL **AND** validates `data.id` in the response body matches. Drops pre-fetched neighbouring chapters.
2. Navigate the reader URL via `window.location.href` assignment.
3. DOM-count sanity check: `[class*="imgContainer"]` total minus `[class*="imgNext"]` gives the current chapter's actual page count. Trim captured list if it includes next-chapter previews.
4. `fetch_image_bytes(page, img)` runs `fetch(url)` via `page.evaluate` inside a `page.expect_response(...)` block. The body is read via CDP (`response.body()`) — zero base64 overhead. Fallback strips the `?q=50` query if the original URL fails.
5. `fetch_all_pages(page, images, max_attempts=3)` retries each failed page up to 3 times with 2s backoff between rounds. Returns `{page_num: bytes}` for successful fetches.
### R2 + DB write ordering
**Page rows are inserted into the DB only after the R2 upload succeeds.** This prevents orphan DB records pointing to missing R2 objects. Every `INSERT INTO "Page"` includes `width` and `height` read from the JPEG/WebP bytes via PIL (`Image.open(...).width`).
### Storage layouts
``` ```
manga-content/ # Local (download command)
<slug>/ manga-content/<slug>/detail.json # title, author, genres, description, mg-cover URL
detail.json # metadata (title, author, genres, description, cover URL) manga-content/<slug>/cover.jpg # captured from page load traffic
cover.jpg # cover image captured from page traffic manga-content/<slug>/<N> <chapter>/<page>.jpg
1 <chapter-name>/ # chapter folder (ordered by API sequence)
1.jpg
2.jpg
...
```
## R2 Storage Layout # R2 (upload / sync)
```
manga/<slug>/cover.webp manga/<slug>/cover.webp
manga/<slug>/chapters/<number>/<page>.webp manga/<slug>/chapters/<N>/<page>.webp
``` ```
## Environment Variables (.env) Chapter order is the API's ascending index (1-based). Chapter names can repeat (announcements, extras) so the DB `Chapter.number` column uses this index, not parsed chapter titles.
### Menu actions
- **Setup** (`cmd_setup`) → brings Chrome to front, user solves CF, validates `cf_clearance` cookie.
- **Download** (`cmd_download`) → picks URL from `manga.json`, optional chapter multi-select; saves JPGs locally.
- **Upload** (`cmd_upload``upload_manga_to_r2`) → converts local JPGs → WebP, uploads to R2, writes DB rows.
- **Sync** (`cmd_sync`) → combined download+upload via RAM (no local files), refreshes `Manga` row metadata, only inserts chapters missing from DB.
- **R2 / DB management** submenu (`tui_r2_manage`):
- **Status** — single-pass R2 object count grouped by slug, plus DB row counts
- **Edit manga info** (`tui_edit_manga`) — title/description/genre/status/coverUrl
- **Delete specific manga** — R2 prefix + cascade DB delete
- **Delete specific chapter** (`tui_delete_chapter`) — multi-select or "All chapters"
- **Check missing pages** (`tui_check_missing_pages`) — for each chapter: if site page count ≠ R2 count, re-upload **inline** (browser still on that reader page); if counts match but DB `width`/`height` are NULL or 0, fix by reading WebP bytes from R2 (no re-upload)
- **Clear ALL (R2 + DB)**
- **Recompress manga** (`r2_recompress`) — re-encodes every WebP under `manga/<slug>/` at quality=65, overwrites in place
### WebP encoding
`_to_webp_bytes(img, quality=WEBP_QUALITY=75, method=6)` — method=6 is the slowest/smallest preset. Covers use quality 80 via `make_cover` (crops to 400×560 aspect, then resizes). Resize-during-encode was explicitly removed — page originals' dimensions are preserved.
### ESC to stop
`EscListener` puts stdin in cbreak mode (POSIX `termios`+`tty`) and runs a daemon thread listening for `\x1b`. Download/Upload/Sync check `esc.stop.is_set()` between chapters and cleanly exit. Restores terminal mode on `__exit__`. No-op on Windows (no termios) and when stdin isn't a TTY.
### Lazy config loading
`_ensure_config()` is called at the start of each R2/DB helper. It reads required env vars and constructs the boto3 client on first use. If env vars are missing, it prints the missing list and `sys.exit(1)` — no KeyError traceback on import. `s3`, `BUCKET`, `PUBLIC_URL`, `DATABASE_URL` are module globals set by that call.
## Environment variables (.env)
``` ```
R2_ACCOUNT_ID= R2_ACCOUNT_ID= # cloudflare account id
R2_ACCESS_KEY= R2_ACCESS_KEY=
R2_SECRET_KEY= R2_SECRET_KEY=
R2_BUCKET= R2_BUCKET=
R2_PUBLIC_URL= R2_PUBLIC_URL= # e.g. https://pub-xxx.r2.dev (trailing slash stripped)
DATABASE_URL=postgresql://... DATABASE_URL= # postgresql://user:pass@host:port/dbname
``` ```
## Future: Multi-site Support Missing any of these produces a friendly error on first R2/DB operation, not on import.
Current code is specific to happymh.com. To add new sites: ## DB schema expectations
- Extract site-specific logic (chapter fetching, image URL extraction, CF handling) into per-site modules
- Keep shared infrastructure (Chrome management, image download, upload) in common modules The script reads/writes but does **not** create tables. Create them externally:
- Each site module implements: `fetch_chapters(page, slug)`, `get_chapter_images(page, slug, chapter_id)`, `fetch_metadata(page)`
```sql
CREATE TABLE "Manga" (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
description TEXT,
"coverUrl" TEXT,
genre TEXT, -- comma-joined list of all genres
status TEXT NOT NULL, -- PUBLISHED | DRAFT | HIDDEN
"createdAt" TIMESTAMPTZ NOT NULL,
"updatedAt" TIMESTAMPTZ NOT NULL
);
CREATE TABLE "Chapter" (
id SERIAL PRIMARY KEY,
"mangaId" INTEGER NOT NULL REFERENCES "Manga"(id),
number INTEGER NOT NULL, -- 1-based index from the API order
title TEXT NOT NULL,
UNIQUE ("mangaId", number)
);
CREATE TABLE "Page" (
id SERIAL PRIMARY KEY,
"chapterId" INTEGER NOT NULL REFERENCES "Chapter"(id),
number INTEGER NOT NULL, -- 1-based page number
"imageUrl" TEXT NOT NULL,
width INTEGER,
height INTEGER,
UNIQUE ("chapterId", number)
);
```
Column identifiers are camelCase with double quotes — matches Prisma default naming.
## Where to change what
| Task | Location |
|---|---|
| Add a new site | Extract happymh-specific bits: `fetch_chapters_via_api`, `fetch_chapters_from_dom`, `fetch_metadata`, `_try_get_chapter_images`, the `/mcover/` cover capture in `load_manga_page`, the reader URL shape. Keep Chrome/R2/DB/TUI as common. |
| New menu item | Add to `show_menu` list in `main` and dispatch in the `if idx == N:` ladder. For R2/DB ops, add to `tui_r2_manage`. |
| Tweak CF detection | `wait_for_cloudflare` / `_wait_for_cf_on_page` — edit the title/URL heuristics carefully, both ops check the same signals. |
| Change image quality | `WEBP_QUALITY` at top of file; cover quality is hard-coded 80 in `make_cover`. |
| Add a new Page-table column | Update all three `INSERT INTO "Page"` sites (`upload_manga_to_r2`, `cmd_sync`, `tui_check_missing_pages` re-upload branch) and the `SELECT ... FROM "Page"` in the dim-check query. |
| Change parallelism | `UPLOAD_WORKERS` for R2 uploads; do **not** introduce chapter-level threading (sync Playwright breaks). |
## Future: multi-site support
Current code is happymh-specific (selectors, API paths, URL patterns). To generalise, a site module would implement `fetch_chapters(page, slug)`, `get_chapter_images(page, slug, chapter_id)`, and `fetch_metadata(page)`, keeping the Chrome/R2/DB/TUI layer common.

102
manga.py
View File

@ -8,15 +8,26 @@ Usage:
import io import io
import json import json
import os import os
import platform
import re import re
import select import select
import sys import sys
import time import time
import socket import socket
import subprocess import subprocess
import termios
import threading import threading
import tty
IS_MACOS = platform.system() == "Darwin"
# POSIX-only TTY modules; EscListener is a no-op on Windows.
try:
import termios
import tty
_HAS_TERMIOS = True
except ImportError:
termios = None
tty = None
_HAS_TERMIOS = False
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path from pathlib import Path
from urllib.parse import urlparse from urllib.parse import urlparse
@ -40,19 +51,58 @@ BROWSER_DATA = ROOT_DIR / ".browser-data"
CDP_PORT = 9333 CDP_PORT = 9333
REQUEST_DELAY = 1.5 REQUEST_DELAY = 1.5
UPLOAD_WORKERS = 8 UPLOAD_WORKERS = 8
CHROME_PATH = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
# R2 CHROME_CANDIDATES = [
s3 = boto3.client( "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", # macOS
"/usr/bin/google-chrome", # Linux
"/usr/bin/google-chrome-stable",
"/usr/bin/chromium",
"/usr/bin/chromium-browser",
r"C:\Program Files\Google\Chrome\Application\chrome.exe", # Windows
r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe",
]
def _find_chrome():
for p in CHROME_CANDIDATES:
if Path(p).exists():
return p
return None
CHROME_PATH = _find_chrome()
# R2/DB config loaded lazily so missing .env gives a friendly error, not KeyError on import.
_REQUIRED_ENV = ("R2_ACCOUNT_ID", "R2_ACCESS_KEY", "R2_SECRET_KEY", "R2_BUCKET", "R2_PUBLIC_URL", "DATABASE_URL")
s3 = None
BUCKET = None
PUBLIC_URL = None
DATABASE_URL = None
_config_loaded = False
def _ensure_config():
global s3, BUCKET, PUBLIC_URL, DATABASE_URL, _config_loaded
if _config_loaded:
return
missing = [k for k in _REQUIRED_ENV if not os.environ.get(k)]
if missing:
print("Missing env vars (check .env):")
for k in missing:
print(f" {k}")
sys.exit(1)
s3 = boto3.client(
"s3", "s3",
endpoint_url=f"https://{os.environ['R2_ACCOUNT_ID']}.r2.cloudflarestorage.com", endpoint_url=f"https://{os.environ['R2_ACCOUNT_ID']}.r2.cloudflarestorage.com",
aws_access_key_id=os.environ["R2_ACCESS_KEY"], aws_access_key_id=os.environ["R2_ACCESS_KEY"],
aws_secret_access_key=os.environ["R2_SECRET_KEY"], aws_secret_access_key=os.environ["R2_SECRET_KEY"],
region_name="auto", region_name="auto",
) )
BUCKET = os.environ["R2_BUCKET"] BUCKET = os.environ["R2_BUCKET"]
PUBLIC_URL = os.environ["R2_PUBLIC_URL"].rstrip("/") PUBLIC_URL = os.environ["R2_PUBLIC_URL"].rstrip("/")
DATABASE_URL = os.environ["DATABASE_URL"] DATABASE_URL = os.environ["DATABASE_URL"]
_config_loaded = True
# ── ESC listener ─────────────────────────────────────────── # ── ESC listener ───────────────────────────────────────────
@ -68,7 +118,7 @@ class EscListener:
self._fd = None self._fd = None
def __enter__(self): def __enter__(self):
if not sys.stdin.isatty(): if not _HAS_TERMIOS or not sys.stdin.isatty():
return self return self
self._fd = sys.stdin.fileno() self._fd = sys.stdin.fileno()
try: try:
@ -105,7 +155,9 @@ class EscListener:
def hide_chrome(): def hide_chrome():
"""Hide Chrome window on macOS.""" """Hide Chrome window (macOS only; no-op elsewhere)."""
if not IS_MACOS:
return
try: try:
subprocess.Popen( subprocess.Popen(
["osascript", "-e", ["osascript", "-e",
@ -124,8 +176,11 @@ def is_port_open(port):
def launch_chrome(start_url=None): def launch_chrome(start_url=None):
if is_port_open(CDP_PORT): if is_port_open(CDP_PORT):
return None return None
if not Path(CHROME_PATH).exists(): if not CHROME_PATH or not Path(CHROME_PATH).exists():
print(f" Chrome not found at: {CHROME_PATH}") print(" Chrome not found. Install Google Chrome or Chromium.")
print(" Searched:")
for p in CHROME_CANDIDATES:
print(f" {p}")
return None return None
cmd = [ cmd = [
CHROME_PATH, CHROME_PATH,
@ -198,8 +253,18 @@ def close_session():
def with_browser(func): def with_browser(func):
"""Run func(session) using the persistent Chrome session.""" """Run func(session) using the persistent Chrome session.
If the session crashed (target closed etc.), reset and retry once."""
session = get_session()
try:
return func(session)
except Exception as e:
msg = str(e).lower()
if "target" in msg or "browser" in msg or "closed" in msg or "disconnected" in msg:
print(" Browser session lost, restarting...")
close_session()
return func(get_session()) return func(get_session())
raise
# ── Cloudflare ───────────────────────────────────────────── # ── Cloudflare ─────────────────────────────────────────────
@ -674,11 +739,13 @@ def make_cover(source, width=400, height=560):
def upload_to_r2(key, data, content_type="image/webp"): def upload_to_r2(key, data, content_type="image/webp"):
_ensure_config()
s3.put_object(Bucket=BUCKET, Key=key, Body=data, ContentType=content_type) s3.put_object(Bucket=BUCKET, Key=key, Body=data, ContentType=content_type)
return f"{PUBLIC_URL}/{key}" return f"{PUBLIC_URL}/{key}"
def r2_key_exists(key): def r2_key_exists(key):
_ensure_config()
try: try:
s3.head_object(Bucket=BUCKET, Key=key) s3.head_object(Bucket=BUCKET, Key=key)
return True return True
@ -687,6 +754,7 @@ def r2_key_exists(key):
def get_db(): def get_db():
_ensure_config()
conn = psycopg2.connect(DATABASE_URL) conn = psycopg2.connect(DATABASE_URL)
conn.set_client_encoding("UTF8") conn.set_client_encoding("UTF8")
return conn return conn
@ -1242,6 +1310,7 @@ def cmd_sync(manga_url=None):
def r2_list_prefixes(): def r2_list_prefixes():
"""List manga slugs in R2 by scanning top-level prefixes under manga/.""" """List manga slugs in R2 by scanning top-level prefixes under manga/."""
_ensure_config()
slugs = set() slugs = set()
paginator = s3.get_paginator("list_objects_v2") paginator = s3.get_paginator("list_objects_v2")
for pg in paginator.paginate(Bucket=BUCKET, Prefix="manga/", Delimiter="/"): for pg in paginator.paginate(Bucket=BUCKET, Prefix="manga/", Delimiter="/"):
@ -1255,6 +1324,7 @@ def r2_list_prefixes():
def r2_count_by_prefix(prefix): def r2_count_by_prefix(prefix):
"""Count objects under a prefix.""" """Count objects under a prefix."""
_ensure_config()
total = 0 total = 0
for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET, Prefix=prefix): for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET, Prefix=prefix):
total += len(pg.get("Contents", [])) total += len(pg.get("Contents", []))
@ -1263,6 +1333,7 @@ def r2_count_by_prefix(prefix):
def r2_delete_prefix(prefix): def r2_delete_prefix(prefix):
"""Delete all objects under a prefix.""" """Delete all objects under a prefix."""
_ensure_config()
total = 0 total = 0
batches = [] batches = []
for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET, Prefix=prefix): for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET, Prefix=prefix):
@ -1284,6 +1355,7 @@ def r2_delete_prefix(prefix):
def r2_recompress(slug, quality=65): def r2_recompress(slug, quality=65):
"""Download all webp images for a manga, re-encode at lower quality, re-upload.""" """Download all webp images for a manga, re-encode at lower quality, re-upload."""
_ensure_config()
prefix = f"manga/{slug}/" prefix = f"manga/{slug}/"
keys = [] keys = []
for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET, Prefix=prefix): for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET, Prefix=prefix):
@ -1865,7 +1937,7 @@ def tui_r2_manage():
break break
elif idx == 0: elif idx == 0:
# Count R2 objects in single pass _ensure_config()
slug_counts = {} slug_counts = {}
total = 0 total = 0
for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET): for pg in s3.get_paginator("list_objects_v2").paginate(Bucket=BUCKET):