test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims) by igerber · Pull Request #549 · igerber/diff-diff

igerber · 2026-06-25T12:28:34Z

Summary

Add a drift guard for Tutorial 24 (staggered-rollout vs. collapsed-2×2 power decision guide), pinning its two load-bearing claims against estimator-default / simulation drift:
1. Monotonic dilution, fast → slow — the collapsed 2×2 reports a monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and its CI coverage of the effect-on-treated collapses, while Callaway–Sant'Anna stays near nominal.
2. CS-vs-2×2 MDE crossover / near-parity at slow rollout — the 2×2's MDE climbs (~0.37 → ~0.60) while CS's barely moves (~0.55), so the power gap closes to parity.
Remove the now-resolved Testing/Docs row from TODO.md.

Because nbsphinx_execute = "never", the committed notebook outputs are what RTD renders, so the prose can silently drift from the live library. These asserts re-derive the load-bearing numbers from the same public generator (generate_staggered_data) + estimators the tutorial uses and check them against the committed surface.

Methodology references

Method name(s): N/A — test-only. Exercises existing DifferenceInDifferences (collapsed 2×2) and CallawaySantAnna(control_group="never_treated") via tutorial-drift checks; no estimator/math/source changes.
Paper / source link(s): Callaway & Sant'Anna (2021) — the tutorial's subject; no new methodology introduced.
Intentional deviations from the source: None.

Validation

Tests added: tests/test_t24_staggered_vs_collapsed_power_drift.py (9 tests). Structure mirrors the T25 split:
- Deterministic (unmarked, run in every CI leg incl. pure-Python): panel composition; §1 estimand gap; monotonic dilution (structural E2/E1, exact); a rendered-surface quote cross-check (19 committed numbers); a notebook-kwargs sync guard.
- Monte Carlo (@pytest.mark.slow, Rust legs -m '' only — off the ~1h pure-Python budget): dilution coverage collapse; MDE crossover; flat-vs-growing estimand targeting. These assert robust orderings with wide margins, calibrated against real reduced-sim runs (not flaky exact pins).
Re-derivation reproduces the notebook's committed numbers exactly (E1=2.61, E2=2.10, 2×2=2.17, CS=2.67, 81%; dilution 93.5 / 80.9 / 61.8%).
All 9 tests pass locally (-m '', 6.6s); black --check and ruff clean. Local codex review: ✅ no P0/P1/P2 findings.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

github-actions · 2026-06-25T12:31:00Z

PR Review Report

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The PR is test-only for estimator behavior and does not introduce methodology changes. One P2 release-metadata regression should be cleaned up, but it is not a statistical correctness blocker.

Executive Summary

The affected methods are DifferenceInDifferences collapsed 2×2 and CallawaySantAnna(control_group="never_treated").
No estimator math, weighting, variance/SE, identification assumptions, or defaults are changed.
The new drift tests align with the registry’s DiD and CS targets and use public generate_staggered_data.
The removed TODO row is plausibly addressed by the new T24 drift guard.
P2: version metadata is downgraded from 3.5.3 to 3.5.2 across package, Rust, citation, and guide files.
I could not execute tests in this sandbox because pytest/numpy are not installed.

Methodology

No P0/P1 findings.

Affected methods:

DifferenceInDifferences, used for the collapsed 2×2 fit at tests/test_t24_staggered_vs_collapsed_power_drift.py:L93-L102.
CallawaySantAnna(control_group="never_treated"), used at tests/test_t24_staggered_vs_collapsed_power_drift.py:L105-L115.

Registry cross-check:

DiD interaction ATT and optional clustered SEs match docs/methodology/REGISTRY.md:L60-L70.
CS never-treated comparison, group-time ATT target, aggregation, and IF inference match docs/methodology/REGISTRY.md:L443-L499.
generate_staggered_data exposes first_treat, treated, treat, and true_effect, matching the test’s structural estimand calculations at diff_diff/prep_dgp.py:L179-L189 and diff_diff/prep_dgp.py:L299-L320.

Severity: None
Impact: No undocumented methodology deviation found.
Concrete fix: None.

Code Quality

Finding: Version metadata regression
Severity: P2
Impact: This test-only PR changes public/package metadata back to 3.5.2 while the diff shows the base was 3.5.3. A merged package would report the older version via Python metadata, diff_diff.__version__, Rust crate metadata, citation metadata, and bundled LLM guide text. See pyproject.toml:L5-L8, diff_diff/__init__.py:L301-L302, rust/Cargo.toml:L1-L3, CITATION.cff:L10-L11, and diff_diff/guides/llms-full.txt:L1-L5.
Concrete fix: Revert the version/date/link downgrades, or update all release metadata consistently according to the project’s release process. If 3.5.3 is already released on the base branch, restore the 3.5.3 values and changelog compare link.

Performance

No findings.

Severity: None
Impact: The heavier Monte Carlo tests are marked slow, consistent with pyproject.toml:L112-L116.
Concrete fix: None.

Maintainability

No additional findings beyond the P2 version-metadata regression above.

Severity: None
Impact: The new helper structure is localized to one drift-test file and reuses existing tutorial drift helpers.
Concrete fix: None.

Tech Debt

No findings.

Severity: None
Impact: Removing the T24 TODO row is acceptable because this PR adds a targeted drift guard for the named tutorial claims at tests/test_t24_staggered_vs_collapsed_power_drift.py:L1-L35.
Concrete fix: None.

Security

No findings.

Severity: None
Impact: No secrets, subprocess execution, network access, or unsafe deserialization introduced. Notebook JSON parsing is local-file only.
Concrete fix: None.

Documentation/Tests

No blocker findings.

Severity: P3 informational
Impact: I could not run the tests in this sandbox because pytest and numpy are unavailable, so this review is static plus direct notebook-surface inspection. The rendered quote strings checked by the test are present in the committed notebook outputs/prose.
Concrete fix: Ensure CI runs tests/test_t24_staggered_vs_collapsed_power_drift.py in both default and slow-test legs as intended.

Pins the two load-bearing quantitative claims in docs/tutorials/24_staggered_vs_collapsed_power.ipynb against estimator-default / simulation drift, closing the deferred Testing/Docs TODO row (branch staggered-analysis-2x2): 1. Monotonic dilution fast -> slow: the collapsed-2x2 reports a monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and its CI coverage of the effect-on-treated collapses, while CS stays near nominal. Pinned deterministically (estimands are means of the noise-free true_effect column) so it runs in every CI leg. 2. CS-vs-2x2 MDE crossover / near-parity at slow rollout: the 2x2's MDE climbs (~0.37 -> ~0.60) while CS's barely moves (~0.55) so the power gap closes to parity. Pinned as robust orderings (the exact reversal is simulation-sensitive, per the prose). Structure mirrors the T25 split: deterministic structural pins + a rendered-surface quote cross-check + a notebook-kwargs sync guard run unmarked; the Monte Carlo sweeps (coverage collapse, MDE crossover, flat-vs-growing estimand targeting) are @pytest.mark.slow so they stay off the pure-Python budget and run in the Rust legs (-m '') at full count. Removes the resolved row from TODO.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-25T12:34:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 805e8c0f2cded5666ed457095e86188af65a158f

PR Review Report

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. This PR is test-only and removes a TODO row that the new drift guard directly addresses.

Executive Summary

Affected methods exercised by the tests: DifferenceInDifferences collapsed 2x2 and CallawaySantAnna(control_group="never_treated").
No estimator math, weighting, variance/SE, assumptions, or defaults are changed.
The test’s DGP field usage matches generate_staggered_data: treated is observation-level treatment and treat is unit-level ever-treated.
Registry cross-check found no undocumented methodology deviation.
Prior P2 metadata regression is resolved; version surfaces currently remain 3.5.3.
I could not run pytest because pytest is not installed in this sandbox.

Methodology

No findings.

Severity: None
Impact: The new tests exercise existing behavior only. The DiD call at tests/test_t24_staggered_vs_collapsed_power_drift.py:L93-L102 matches the registry’s 2x2 interaction ATT with optional clustered SEs. The CS call at tests/test_t24_staggered_vs_collapsed_power_drift.py:L105-L115 matches the registry’s never-treated comparison and overall aggregation contract.
Concrete fix: None.

Code Quality

No findings.

Severity: None
Impact: The prior version-metadata downgrade finding is no longer present in this re-review diff; current metadata remains 3.5.3 in pyproject.toml, diff_diff/__init__.py, rust/Cargo.toml, and CITATION.cff. No new inline inference anti-patterns or public-parameter propagation issues were introduced.
Concrete fix: None.

Performance

No findings.

Severity: None
Impact: Monte Carlo checks are marked @pytest.mark.slow at tests/test_t24_staggered_vs_collapsed_power_drift.py:L295, L325, and L352, consistent with the project’s default slow-test exclusion.
Concrete fix: None.

Maintainability

No findings.

Severity: None
Impact: The test helper structure is local, readable, and reuses existing tutorial drift helpers at tests/test_t24_staggered_vs_collapsed_power_drift.py:L50.
Concrete fix: None.

Tech Debt

No findings.

Severity: None
Impact: Removing the Tutorial 24 TODO row is justified by the new coverage for the named drift claims at tests/test_t24_staggered_vs_collapsed_power_drift.py:L1-L35.
Concrete fix: None.

Security

No findings.

Severity: None
Impact: No secrets, network access, subprocess execution, or unsafe deserialization paths are introduced. Notebook parsing is local JSON reading through existing helpers.
Concrete fix: None.

Documentation/Tests

P3 informational.

Severity: P3
Impact: I could not execute the test file because pytest is unavailable in this sandbox. Static checks confirmed the notebook contains the pinned rendered values and locked config strings referenced by test_rendered_surface_quotes() and test_notebook_kwargs_match().
Concrete fix: Ensure CI runs tests/test_t24_staggered_vs_collapsed_power_drift.py in both default and slow-test legs as intended.

igerber force-pushed the test/tutorial24-drift-test branch from a38113b to 805e8c0 Compare June 25, 2026 12:32

igerber added the ready-for-ci Triggers CI test workflows label Jun 25, 2026

igerber merged commit a5bbceb into main Jun 25, 2026
25 of 26 checks passed

igerber deleted the test/tutorial24-drift-test branch June 25, 2026 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549

test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549
igerber merged 1 commit into
mainfrom
test/tutorial24-drift-test

igerber commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Jun 25, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Jun 25, 2026

PR Review Report

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented Jun 25, 2026

PR Review Report

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant