Squashed 'manga-dl/' content from commit 9cb9b8c
git-subtree-dir: manga-dl git-subtree-split: 9cb9b8c7fdbc3622146c162c9e9ec5e7e3c518a6
This commit is contained in:
commit
6218daeff4
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
@ -0,0 +1,6 @@
|
||||
.env
|
||||
__pycache__/
|
||||
manga-content/
|
||||
.browser-data/
|
||||
cookies.txt
|
||||
.DS_Store
|
||||
160
CLAUDE.md
Normal file
160
CLAUDE.md
Normal file
@ -0,0 +1,160 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
Single-file interactive toolkit (`manga.py`) that downloads manga from m.happymh.com, stores images in Cloudflare R2 as WebP, and writes metadata to PostgreSQL. Runs as an arrow-key TUI backed by a persistent Chrome session.
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt # playwright, boto3, psycopg2-binary, Pillow, python-dotenv, simple-term-menu
|
||||
python manga.py # launch the TUI (no CLI args)
|
||||
```
|
||||
|
||||
No tests, no lint config, no build step. Requires Google Chrome or Chromium installed. The script auto-detects from `CHROME_CANDIDATES` (macOS/Linux/Windows paths). R2 and DB credentials load lazily — see `.env` section below.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Anti-bot: real Chrome + CDP + persistent profile
|
||||
|
||||
Cloudflare fingerprints both the TLS handshake and the browser process. The anti-detection chain matters — changing any link breaks downloads:
|
||||
|
||||
1. **`subprocess.Popen(CHROME_PATH, ...)`** launches the user's real Chrome binary, not Playwright's Chromium. This gives a genuine TLS fingerprint.
|
||||
2. **`connect_over_cdp`** attaches Playwright to Chrome via DevTools Protocol. Playwright never *launches* Chrome — only sends CDP commands to a separately-running process.
|
||||
3. **Persistent `--user-data-dir=.browser-data`** preserves `cf_clearance` cookies between runs. After the user solves Cloudflare once (Setup menu), subsequent runs skip the challenge.
|
||||
4. **Single session (`_session_singleton`)** — Chrome is lazy-started on first operation and reused across all commands in one `python manga.py` run. Closed only on Quit. `with_browser(func)` catches "target closed" / "disconnected" errors, resets the singleton, and retries once.
|
||||
5. **`hide_chrome()`** runs `osascript -e 'tell application "System Events" to set visible of process "Google Chrome" to false'` after launch so the window doesn't steal focus. No-op on non-macOS.
|
||||
|
||||
**Do not switch to headless mode.** Tried — Cloudflare blocks it because the fingerprint differs from real Chrome. **Do not parallelize chapter work across threads** with Playwright's sync API — each thread would need its own event loop and crashes with "no running event loop".
|
||||
|
||||
### Cloudflare handling
|
||||
|
||||
`wait_for_cloudflare(session)` polls `page.title()` and `page.url` for the "Just a moment" / `/challenge` markers. Recovery is manual: the user is shown the browser window and solves CAPTCHA. The Setup menu (`cmd_setup`) is the dedicated flow for this. During sync/check-missing, if the reading API returns 403, the script prints "CF blocked — run Setup" and stops.
|
||||
|
||||
### Navigation: `page.goto` vs JS assignment
|
||||
|
||||
- **Manga listing page** (`/manga/<slug>`) uses `page.goto(..., wait_until="commit")`. Works because Cloudflare on this route is lenient.
|
||||
- **Reader page** (`/mangaread/<slug>/<id>`) uses `page.evaluate("window.location.href = '...'")` — bypasses CF's detection of CDP `Page.navigate` for the stricter reader route.
|
||||
|
||||
### Image pipeline (happymh)
|
||||
|
||||
Per chapter (in `_try_get_chapter_images`):
|
||||
1. Register a response listener that matches `/apis/manga/reading` **AND** `cid=<chapter_id>` in the URL **AND** validates `data.id` in the response body matches. Drops pre-fetched neighbouring chapters.
|
||||
2. Navigate the reader URL via `window.location.href` assignment.
|
||||
3. DOM-count sanity check: `[class*="imgContainer"]` total minus `[class*="imgNext"]` gives the current chapter's actual page count. Trim captured list if it includes next-chapter previews.
|
||||
4. `fetch_image_bytes(page, img)` runs `fetch(url)` via `page.evaluate` inside a `page.expect_response(...)` block. The body is read via CDP (`response.body()`) — zero base64 overhead. Fallback strips the `?q=50` query if the original URL fails.
|
||||
5. `fetch_all_pages(page, images, max_attempts=3)` retries each failed page up to 3 times with 2s backoff between rounds. Returns `{page_num: bytes}` for successful fetches.
|
||||
|
||||
### R2 + DB write ordering
|
||||
|
||||
**Page rows are inserted into the DB only after the R2 upload succeeds.** This prevents orphan DB records pointing to missing R2 objects. Every `INSERT INTO "Page"` includes `width` and `height` read from the JPEG/WebP bytes via PIL (`Image.open(...).width`).
|
||||
|
||||
### Storage layouts
|
||||
|
||||
```
|
||||
# Local (download command)
|
||||
manga-content/<slug>/detail.json # title, author, genres, description, mg-cover URL
|
||||
manga-content/<slug>/cover.jpg # captured from page load traffic
|
||||
manga-content/<slug>/<N> <chapter>/<page>.jpg
|
||||
|
||||
# R2 (upload / sync)
|
||||
manga/<slug>/cover.webp
|
||||
manga/<slug>/chapters/<N>/<page>.webp
|
||||
```
|
||||
|
||||
Chapter order is the API's ascending index (1-based). Chapter names can repeat (announcements, extras) so the DB `Chapter.number` column uses this index, not parsed chapter titles.
|
||||
|
||||
### Menu actions
|
||||
|
||||
- **Setup** (`cmd_setup`) → brings Chrome to front, user solves CF, validates `cf_clearance` cookie.
|
||||
- **Download** (`cmd_download`) → picks URL from `manga.json`, optional chapter multi-select; saves JPGs locally.
|
||||
- **Upload** (`cmd_upload` → `upload_manga_to_r2`) → converts local JPGs → WebP, uploads to R2, writes DB rows.
|
||||
- **Sync** (`cmd_sync`) → combined download+upload via RAM (no local files), refreshes `Manga` row metadata, only inserts chapters missing from DB.
|
||||
- **R2 / DB management** submenu (`tui_r2_manage`):
|
||||
- **Status** — single-pass R2 object count grouped by slug, plus DB row counts
|
||||
- **Edit manga info** (`tui_edit_manga`) — title/description/genre/status/coverUrl
|
||||
- **Delete specific manga** — R2 prefix + cascade DB delete
|
||||
- **Delete specific chapter** (`tui_delete_chapter`) — multi-select or "All chapters"
|
||||
- **Check missing pages** (`tui_check_missing_pages`) — for each chapter: if site page count ≠ R2 count, re-upload **inline** (browser still on that reader page); if counts match but DB `width`/`height` are NULL or 0, fix by reading WebP bytes from R2 (no re-upload)
|
||||
- **Clear ALL (R2 + DB)**
|
||||
- **Recompress manga** (`r2_recompress`) — re-encodes every WebP under `manga/<slug>/` at quality=65, overwrites in place
|
||||
|
||||
### WebP encoding
|
||||
|
||||
`_to_webp_bytes(img, quality=WEBP_QUALITY=75, method=6)` — method=6 is the slowest/smallest preset. Covers use quality 80 via `make_cover` (crops to 400×560 aspect, then resizes). Resize-during-encode was explicitly removed — page originals' dimensions are preserved.
|
||||
|
||||
### ESC to stop
|
||||
|
||||
`EscListener` puts stdin in cbreak mode (POSIX `termios`+`tty`) and runs a daemon thread listening for `\x1b`. Download/Upload/Sync check `esc.stop.is_set()` between chapters and cleanly exit. Restores terminal mode on `__exit__`. No-op on Windows (no termios) and when stdin isn't a TTY.
|
||||
|
||||
### Lazy config loading
|
||||
|
||||
`_ensure_config()` is called at the start of each R2/DB helper. It reads required env vars and constructs the boto3 client on first use. If env vars are missing, it prints the missing list and `sys.exit(1)` — no KeyError traceback on import. `s3`, `BUCKET`, `PUBLIC_URL`, `DATABASE_URL` are module globals set by that call.
|
||||
|
||||
## Environment variables (.env)
|
||||
|
||||
```
|
||||
R2_ACCOUNT_ID= # cloudflare account id
|
||||
R2_ACCESS_KEY=
|
||||
R2_SECRET_KEY=
|
||||
R2_BUCKET=
|
||||
R2_PUBLIC_URL= # e.g. https://pub-xxx.r2.dev (trailing slash stripped)
|
||||
DATABASE_URL= # postgresql://user:pass@host:port/dbname
|
||||
```
|
||||
|
||||
Missing any of these produces a friendly error on first R2/DB operation, not on import.
|
||||
|
||||
## DB schema expectations
|
||||
|
||||
The script reads/writes but does **not** create tables. Create them externally:
|
||||
|
||||
```sql
|
||||
CREATE TABLE "Manga" (
|
||||
id SERIAL PRIMARY KEY,
|
||||
slug TEXT UNIQUE NOT NULL,
|
||||
title TEXT NOT NULL,
|
||||
description TEXT,
|
||||
"coverUrl" TEXT,
|
||||
genre TEXT, -- comma-joined list of all genres
|
||||
status TEXT NOT NULL, -- PUBLISHED | DRAFT | HIDDEN
|
||||
"createdAt" TIMESTAMPTZ NOT NULL,
|
||||
"updatedAt" TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE "Chapter" (
|
||||
id SERIAL PRIMARY KEY,
|
||||
"mangaId" INTEGER NOT NULL REFERENCES "Manga"(id),
|
||||
number INTEGER NOT NULL, -- 1-based index from the API order
|
||||
title TEXT NOT NULL,
|
||||
UNIQUE ("mangaId", number)
|
||||
);
|
||||
|
||||
CREATE TABLE "Page" (
|
||||
id SERIAL PRIMARY KEY,
|
||||
"chapterId" INTEGER NOT NULL REFERENCES "Chapter"(id),
|
||||
number INTEGER NOT NULL, -- 1-based page number
|
||||
"imageUrl" TEXT NOT NULL,
|
||||
width INTEGER,
|
||||
height INTEGER,
|
||||
UNIQUE ("chapterId", number)
|
||||
);
|
||||
```
|
||||
|
||||
Column identifiers are camelCase with double quotes — matches Prisma default naming.
|
||||
|
||||
## Where to change what
|
||||
|
||||
| Task | Location |
|
||||
|---|---|
|
||||
| Add a new site | Extract happymh-specific bits: `fetch_chapters_via_api`, `fetch_chapters_from_dom`, `fetch_metadata`, `_try_get_chapter_images`, the `/mcover/` cover capture in `load_manga_page`, the reader URL shape. Keep Chrome/R2/DB/TUI as common. |
|
||||
| New menu item | Add to `show_menu` list in `main` and dispatch in the `if idx == N:` ladder. For R2/DB ops, add to `tui_r2_manage`. |
|
||||
| Tweak CF detection | `wait_for_cloudflare` / `_wait_for_cf_on_page` — edit the title/URL heuristics carefully, both ops check the same signals. |
|
||||
| Change image quality | `WEBP_QUALITY` at top of file; cover quality is hard-coded 80 in `make_cover`. |
|
||||
| Add a new Page-table column | Update all three `INSERT INTO "Page"` sites (`upload_manga_to_r2`, `cmd_sync`, `tui_check_missing_pages` re-upload branch) and the `SELECT ... FROM "Page"` in the dim-check query. |
|
||||
| Change parallelism | `UPLOAD_WORKERS` for R2 uploads; do **not** introduce chapter-level threading (sync Playwright breaks). |
|
||||
|
||||
## Future: multi-site support
|
||||
|
||||
Current code is happymh-specific (selectors, API paths, URL patterns). To generalise, a site module would implement `fetch_chapters(page, slug)`, `get_chapter_images(page, slug, chapter_id)`, and `fetch_metadata(page)`, keeping the Chrome/R2/DB/TUI layer common.
|
||||
6
manga.json
Normal file
6
manga.json
Normal file
@ -0,0 +1,6 @@
|
||||
[
|
||||
"https://m.happymh.com/manga/fangkainagenvwu",
|
||||
"https://m.happymh.com/manga/jueduijiangan",
|
||||
"https://m.happymh.com/manga/xingjiandashi",
|
||||
"https://m.happymh.com/manga/moutianchengweimoshen"
|
||||
]
|
||||
6
requirements.txt
Normal file
6
requirements.txt
Normal file
@ -0,0 +1,6 @@
|
||||
playwright
|
||||
boto3
|
||||
psycopg2-binary
|
||||
Pillow
|
||||
python-dotenv
|
||||
simple-term-menu
|
||||
Loading…
x
Reference in New Issue
Block a user