Skip to content

test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549

Merged
igerber merged 1 commit into
mainfrom
test/tutorial24-drift-test
Jun 25, 2026
Merged

test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549
igerber merged 1 commit into
mainfrom
test/tutorial24-drift-test

Conversation

@igerber

@igerber igerber commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add a drift guard for Tutorial 24 (staggered-rollout vs. collapsed-2×2 power decision guide), pinning its two load-bearing claims against estimator-default / simulation drift:
    1. Monotonic dilution, fast → slow — the collapsed 2×2 reports a monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and its CI coverage of the effect-on-treated collapses, while Callaway–Sant'Anna stays near nominal.
    2. CS-vs-2×2 MDE crossover / near-parity at slow rollout — the 2×2's MDE climbs (~0.37 → ~0.60) while CS's barely moves (~0.55), so the power gap closes to parity.
  • Remove the now-resolved Testing/Docs row from TODO.md.

Because nbsphinx_execute = "never", the committed notebook outputs are what RTD renders, so the prose can silently drift from the live library. These asserts re-derive the load-bearing numbers from the same public generator (generate_staggered_data) + estimators the tutorial uses and check them against the committed surface.

Methodology references

  • Method name(s): N/A — test-only. Exercises existing DifferenceInDifferences (collapsed 2×2) and CallawaySantAnna(control_group="never_treated") via tutorial-drift checks; no estimator/math/source changes.
  • Paper / source link(s): Callaway & Sant'Anna (2021) — the tutorial's subject; no new methodology introduced.
  • Intentional deviations from the source: None.

Validation

  • Tests added: tests/test_t24_staggered_vs_collapsed_power_drift.py (9 tests). Structure mirrors the T25 split:
    • Deterministic (unmarked, run in every CI leg incl. pure-Python): panel composition; §1 estimand gap; monotonic dilution (structural E2/E1, exact); a rendered-surface quote cross-check (19 committed numbers); a notebook-kwargs sync guard.
    • Monte Carlo (@pytest.mark.slow, Rust legs -m '' only — off the ~1h pure-Python budget): dilution coverage collapse; MDE crossover; flat-vs-growing estimand targeting. These assert robust orderings with wide margins, calibrated against real reduced-sim runs (not flaky exact pins).
  • Re-derivation reproduces the notebook's committed numbers exactly (E1=2.61, E2=2.10, 2×2=2.17, CS=2.67, 81%; dilution 93.5 / 80.9 / 61.8%).
  • All 9 tests pass locally (-m '', 6.6s); black --check and ruff clean. Local codex review: ✅ no P0/P1/P2 findings.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

PR Review Report

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The PR is test-only for estimator behavior and does not introduce methodology changes. One P2 release-metadata regression should be cleaned up, but it is not a statistical correctness blocker.

Executive Summary

  • The affected methods are DifferenceInDifferences collapsed 2×2 and CallawaySantAnna(control_group="never_treated").
  • No estimator math, weighting, variance/SE, identification assumptions, or defaults are changed.
  • The new drift tests align with the registry’s DiD and CS targets and use public generate_staggered_data.
  • The removed TODO row is plausibly addressed by the new T24 drift guard.
  • P2: version metadata is downgraded from 3.5.3 to 3.5.2 across package, Rust, citation, and guide files.
  • I could not execute tests in this sandbox because pytest/numpy are not installed.

Methodology

No P0/P1 findings.

Affected methods:

  • DifferenceInDifferences, used for the collapsed 2×2 fit at tests/test_t24_staggered_vs_collapsed_power_drift.py:L93-L102.
  • CallawaySantAnna(control_group="never_treated"), used at tests/test_t24_staggered_vs_collapsed_power_drift.py:L105-L115.

Registry cross-check:

  • DiD interaction ATT and optional clustered SEs match docs/methodology/REGISTRY.md:L60-L70.
  • CS never-treated comparison, group-time ATT target, aggregation, and IF inference match docs/methodology/REGISTRY.md:L443-L499.
  • generate_staggered_data exposes first_treat, treated, treat, and true_effect, matching the test’s structural estimand calculations at diff_diff/prep_dgp.py:L179-L189 and diff_diff/prep_dgp.py:L299-L320.

Severity: None
Impact: No undocumented methodology deviation found.
Concrete fix: None.

Code Quality

Finding: Version metadata regression
Severity: P2
Impact: This test-only PR changes public/package metadata back to 3.5.2 while the diff shows the base was 3.5.3. A merged package would report the older version via Python metadata, diff_diff.__version__, Rust crate metadata, citation metadata, and bundled LLM guide text. See pyproject.toml:L5-L8, diff_diff/__init__.py:L301-L302, rust/Cargo.toml:L1-L3, CITATION.cff:L10-L11, and diff_diff/guides/llms-full.txt:L1-L5.
Concrete fix: Revert the version/date/link downgrades, or update all release metadata consistently according to the project’s release process. If 3.5.3 is already released on the base branch, restore the 3.5.3 values and changelog compare link.

Performance

No findings.

Severity: None
Impact: The heavier Monte Carlo tests are marked slow, consistent with pyproject.toml:L112-L116.
Concrete fix: None.

Maintainability

No additional findings beyond the P2 version-metadata regression above.

Severity: None
Impact: The new helper structure is localized to one drift-test file and reuses existing tutorial drift helpers.
Concrete fix: None.

Tech Debt

No findings.

Severity: None
Impact: Removing the T24 TODO row is acceptable because this PR adds a targeted drift guard for the named tutorial claims at tests/test_t24_staggered_vs_collapsed_power_drift.py:L1-L35.
Concrete fix: None.

Security

No findings.

Severity: None
Impact: No secrets, subprocess execution, network access, or unsafe deserialization introduced. Notebook JSON parsing is local-file only.
Concrete fix: None.

Documentation/Tests

No blocker findings.

Severity: P3 informational
Impact: I could not run the tests in this sandbox because pytest and numpy are unavailable, so this review is static plus direct notebook-surface inspection. The rendered quote strings checked by the test are present in the committed notebook outputs/prose.
Concrete fix: Ensure CI runs tests/test_t24_staggered_vs_collapsed_power_drift.py in both default and slow-test legs as intended.

Pins the two load-bearing quantitative claims in
docs/tutorials/24_staggered_vs_collapsed_power.ipynb against
estimator-default / simulation drift, closing the deferred Testing/Docs
TODO row (branch staggered-analysis-2x2):

1. Monotonic dilution fast -> slow: the collapsed-2x2 reports a
   monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and
   its CI coverage of the effect-on-treated collapses, while CS stays near
   nominal. Pinned deterministically (estimands are means of the noise-free
   true_effect column) so it runs in every CI leg.
2. CS-vs-2x2 MDE crossover / near-parity at slow rollout: the 2x2's MDE
   climbs (~0.37 -> ~0.60) while CS's barely moves (~0.55) so the power gap
   closes to parity. Pinned as robust orderings (the exact reversal is
   simulation-sensitive, per the prose).

Structure mirrors the T25 split: deterministic structural pins + a
rendered-surface quote cross-check + a notebook-kwargs sync guard run
unmarked; the Monte Carlo sweeps (coverage collapse, MDE crossover,
flat-vs-growing estimand targeting) are @pytest.mark.slow so they stay off
the pure-Python budget and run in the Rust legs (-m '') at full count.

Removes the resolved row from TODO.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the test/tutorial24-drift-test branch from a38113b to 805e8c0 Compare June 25, 2026 12:32
@github-actions

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 805e8c0f2cded5666ed457095e86188af65a158f


PR Review Report

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. This PR is test-only and removes a TODO row that the new drift guard directly addresses.

Executive Summary

  • Affected methods exercised by the tests: DifferenceInDifferences collapsed 2x2 and CallawaySantAnna(control_group="never_treated").
  • No estimator math, weighting, variance/SE, assumptions, or defaults are changed.
  • The test’s DGP field usage matches generate_staggered_data: treated is observation-level treatment and treat is unit-level ever-treated.
  • Registry cross-check found no undocumented methodology deviation.
  • Prior P2 metadata regression is resolved; version surfaces currently remain 3.5.3.
  • I could not run pytest because pytest is not installed in this sandbox.

Methodology

No findings.

Severity: None
Impact: The new tests exercise existing behavior only. The DiD call at tests/test_t24_staggered_vs_collapsed_power_drift.py:L93-L102 matches the registry’s 2x2 interaction ATT with optional clustered SEs. The CS call at tests/test_t24_staggered_vs_collapsed_power_drift.py:L105-L115 matches the registry’s never-treated comparison and overall aggregation contract.
Concrete fix: None.

Code Quality

No findings.

Severity: None
Impact: The prior version-metadata downgrade finding is no longer present in this re-review diff; current metadata remains 3.5.3 in pyproject.toml, diff_diff/__init__.py, rust/Cargo.toml, and CITATION.cff. No new inline inference anti-patterns or public-parameter propagation issues were introduced.
Concrete fix: None.

Performance

No findings.

Severity: None
Impact: Monte Carlo checks are marked @pytest.mark.slow at tests/test_t24_staggered_vs_collapsed_power_drift.py:L295, L325, and L352, consistent with the project’s default slow-test exclusion.
Concrete fix: None.

Maintainability

No findings.

Severity: None
Impact: The test helper structure is local, readable, and reuses existing tutorial drift helpers at tests/test_t24_staggered_vs_collapsed_power_drift.py:L50.
Concrete fix: None.

Tech Debt

No findings.

Severity: None
Impact: Removing the Tutorial 24 TODO row is justified by the new coverage for the named drift claims at tests/test_t24_staggered_vs_collapsed_power_drift.py:L1-L35.
Concrete fix: None.

Security

No findings.

Severity: None
Impact: No secrets, network access, subprocess execution, or unsafe deserialization paths are introduced. Notebook parsing is local JSON reading through existing helpers.
Concrete fix: None.

Documentation/Tests

P3 informational.

Severity: P3
Impact: I could not execute the test file because pytest is unavailable in this sandbox. Static checks confirmed the notebook contains the pinned rendered values and locked config strings referenced by test_rendered_surface_quotes() and test_notebook_kwargs_match().
Concrete fix: Ensure CI runs tests/test_t24_staggered_vs_collapsed_power_drift.py in both default and slow-test legs as intended.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Jun 25, 2026
@igerber igerber merged commit a5bbceb into main Jun 25, 2026
25 of 26 checks passed
@igerber igerber deleted the test/tutorial24-drift-test branch June 25, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant