Skip to content

Comments scraping + Substack-styled rendering + macOS driver fixes + README doc updates#46

Open
sanket-k wants to merge 4 commits into
timf34:mainfrom
sanket-k:mac-comment-refinements
Open

Comments scraping + Substack-styled rendering + macOS driver fixes + README doc updates#46
sanket-k wants to merge 4 commits into
timf34:mainfrom
sanket-k:mac-comment-refinements

Conversation

@sanket-k

Copy link
Copy Markdown

Summary

This PR adds three related improvements to the scraper:

1. Comment scraping (--comments)

  • New --comments flag fetches and caches each post's comment thread.
    Public threads are free; paid-only threads require --premium.
  • New --comments-sort flag (best | most_recent_first).
  • Comments render into the per-post HTML page, and comment_count is
    exposed to the index UI (with a new "Sort by Comments" option).
  • Ships backfill_comment_counts.py to repair older data files.

2. macOS driver support & reliability

  • Correctly locates Chrome/Edge under /Applications and resolves the
    right driver platform (mac-arm64 vs mac-x64, mac64_m1).
  • Driver recovery: recreates crashed sessions and periodically restarts
    to shed leaked state; suppresses noise via the quiet flag.

3. Substack-styled HTML rendering

  • Rewrites per-post HTML to match Substack's default look: Spectral serif
    body (19px/1.6), orange (#ff6719) links, left-aligned text, 728px column,
    and a centered cover/title/subtitle/byline header built from structured
    metadata rather than inlined markdown.
  • New render_posts.py standalone regenerate script (no network required).
  • New --render-only / --render-all CLI flags.

4. README.md — major overhaul :

  • adds Features overview, System Architecture (mermaid), Installation,
    Usage examples for comments & sorting,
  • Workflow sequence (mermaid), full CLI Reference table,
    Browser & Driver Support, Backfilling Comment Counts,
  • Substack-style Rendering docs, Output Layout, Online Version, and Viewing Markdown
    Files in Browser.
  • Corrects the stale "Microsoft Edge" browser note.

Changes

  • substack_scraper.py — comment helpers, structured render path, driver fixes
  • render_posts.py — new network-free regenerate script
  • backfill_comment_counts.py — new data repair utility
  • assets/css/essay-styles.css, assets/css/style.css, author_template.html — restyled
  • assets/js/populate-essays.js — "Sort by Comments" + comment_count handling
  • tests/test_substack_scraper.py — 13+ new tests (comments, rendering, regenerate)
  • README.md — CLI reference table, architecture/workflow mermaid diagrams

Testing

  • All new and existing tests pass: pytest tests/
  • Manually verified scraping + rendering on macOS (Apple Silicon).

Backward compatibility

The structured render path falls back to the flat (meta=None) path for legacy
compatibility, so existing data files keep working.

sanket-k added 4 commits June 19, 2026 14:32
- Add --comments flag to fetch and cache each post's comment thread
  (public threads free; paid-only threads via --premium). Renders threads
  into the per-post HTML page and exposes comment_count to the index.
- Add --comments-sort (best | most_recent_first).
- Fix macOS driver detection: locate Chrome/Edge under /Applications and
  resolve correct driver platform (mac-arm64 vs mac-x64, mac64_m1).
- Add driver recovery: recreate crashed sessions and periodic restarts to
  shed leaked state; suppress noise via quiet flag.
- Add Sort by Comments to the index UI + comment section CSS.
- Add backfill_comment_counts.py to repair older data files.
- Expand the test suite to cover comment helpers and HTML rendering.
- Document --comments / --comments-sort and the caching/rendering model.
- Add full CLI reference table covering all flags.
- Correct the browser requirement (Chrome default + macOS/Apple Silicon
  detection) instead of the stale "Microsoft Edge" note.
- Document backfill_comment_counts.py and the output directory layout.
- Add mermaid diagrams: system architecture, scrape workflow sequence,
  driver fallback, and on-disk outputs.
Rewrite per-post HTML to the classic default Substack look: Spectral serif
body (19px/1.6), orange (#ff6719) links, left-aligned text, white background,
728px column, and a centered cover/title/subtitle/byline header built from
structured metadata instead of inlined markdown.

- Rewrite assets/css/essay-styles.css with :root theme tokens and header rules
- Align author index (assets/css/style.css, author_template.html) to palette
- Add build_post_header, split_metadata_and_body, build_post_document, and
  render_post_to_html_file helpers in substack_scraper.py
- Route scrape_posts through the structured render path; keep flat (meta=None)
  path intact for legacy compatibility
- Add render_posts.py standalone regenerate script (no network) plus
  --render-only/--render-all CLI flags
- Add 13 tests covering header rendering, metadata splitting, structured
  rendering, and end-to-end regenerate
- Document rendering and new flags in README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant