test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549
Conversation
PR Review ReportOverall Assessment✅ Looks good No unmitigated P0/P1 findings. The PR is test-only for estimator behavior and does not introduce methodology changes. One P2 release-metadata regression should be cleaned up, but it is not a statistical correctness blocker. Executive Summary
MethodologyNo P0/P1 findings. Affected methods:
Registry cross-check:
Severity: None Code QualityFinding: Version metadata regression PerformanceNo findings. Severity: None MaintainabilityNo additional findings beyond the P2 version-metadata regression above. Severity: None Tech DebtNo findings. Severity: None SecurityNo findings. Severity: None Documentation/TestsNo blocker findings. Severity: P3 informational |
Pins the two load-bearing quantitative claims in docs/tutorials/24_staggered_vs_collapsed_power.ipynb against estimator-default / simulation drift, closing the deferred Testing/Docs TODO row (branch staggered-analysis-2x2): 1. Monotonic dilution fast -> slow: the collapsed-2x2 reports a monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and its CI coverage of the effect-on-treated collapses, while CS stays near nominal. Pinned deterministically (estimands are means of the noise-free true_effect column) so it runs in every CI leg. 2. CS-vs-2x2 MDE crossover / near-parity at slow rollout: the 2x2's MDE climbs (~0.37 -> ~0.60) while CS's barely moves (~0.55) so the power gap closes to parity. Pinned as robust orderings (the exact reversal is simulation-sensitive, per the prose). Structure mirrors the T25 split: deterministic structural pins + a rendered-surface quote cross-check + a notebook-kwargs sync guard run unmarked; the Monte Carlo sweeps (coverage collapse, MDE crossover, flat-vs-growing estimand targeting) are @pytest.mark.slow so they stay off the pure-Python budget and run in the Rust legs (-m '') at full count. Removes the resolved row from TODO.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a38113b to
805e8c0
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review ReportOverall Assessment✅ Looks good No unmitigated P0/P1 findings. This PR is test-only and removes a TODO row that the new drift guard directly addresses. Executive Summary
MethodologyNo findings. Severity: None Code QualityNo findings. Severity: None PerformanceNo findings. Severity: None MaintainabilityNo findings. Severity: None Tech DebtNo findings. Severity: None SecurityNo findings. Severity: None Documentation/TestsP3 informational. Severity: P3 |
Summary
TODO.md.Because
nbsphinx_execute = "never", the committed notebook outputs are what RTD renders, so the prose can silently drift from the live library. These asserts re-derive the load-bearing numbers from the same public generator (generate_staggered_data) + estimators the tutorial uses and check them against the committed surface.Methodology references
DifferenceInDifferences(collapsed 2×2) andCallawaySantAnna(control_group="never_treated")via tutorial-drift checks; no estimator/math/source changes.Validation
tests/test_t24_staggered_vs_collapsed_power_drift.py(9 tests). Structure mirrors the T25 split:E2/E1, exact); a rendered-surface quote cross-check (19 committed numbers); a notebook-kwargs sync guard.@pytest.mark.slow, Rust legs-m ''only — off the ~1h pure-Python budget): dilution coverage collapse; MDE crossover; flat-vs-growing estimand targeting. These assert robust orderings with wide margins, calibrated against real reduced-sim runs (not flaky exact pins).-m '', 6.6s);black --checkandruffclean. Local codex review: ✅ no P0/P1/P2 findings.Security / privacy
🤖 Generated with Claude Code