Skip to content

feat(knowledge): reclassify run hardening — concurrency, configurable model, outage snooze#211

Merged
mkreyman merged 3 commits into
masterfrom
reclassify-concurrent-batches
Jun 30, 2026
Merged

feat(knowledge): reclassify run hardening — concurrency, configurable model, outage snooze#211
mkreyman merged 3 commits into
masterfrom
reclassify-concurrent-batches

Conversation

@mkreyman

Copy link
Copy Markdown
Owner

Tuning/hardening the reclassification engine for the real 77k run (found via prod verification).

1. Concurrent batch classification

process_batch now classifies a batch concurrently via Task.async_stream (:knowledge_reclassify_max_concurrency, default 10; 30s per-call timeout, on_timeout: :kill_task). Writes stay serial on the worker process so the small BYPASSRLS admin pool is never hit by N concurrent updates. Turns a ~16h sequential 77k pass into ~1–2h.

2. Configurable classifier model

:knowledge_classifier_model (ANTHROPIC_CLASSIFIER_MODEL in prod) overrides the classifier model independently of content extraction — so a one-time high-accuracy reclassification can use a stronger model (e.g. Sonnet) without changing extraction. Falls back to the shared provider model (Haiku 4.5).

3. Outage resilience (snooze, never skip)

If a batch comes back with an error rate ≥ :knowledge_reclassify_snooze_error_rate (default 0.5) — i.e. the classifier upstream is unreachable — perform returns {:snooze, seconds} instead of advancing. Oban re-runs the same cursor later without consuming an attempt or logging audit, so an outage pauses the migration and it resumes cleanly — no skipped articles. (The migration runs on the Fly host, so a local operator outage never touches it; this covers the Anthropic/egress side.)

Tests

Concurrent path (Mox $callers covers the spawned tasks); 100%-error batch snoozes and writes/audits nothing; 1-of-3 error batch (below the rate) proceeds and reclassifies the good ones. Full local gate green (3007 tests, credo, dialyzer 0).

mkreyman added 3 commits June 29, 2026 18:02
…ifier model

Tuning the reclassification engine for the real 77k run:

- process_batch now classifies a batch CONCURRENTLY via Task.async_stream
  (max_concurrency :knowledge_reclassify_max_concurrency, default 10; 30s
  per-call timeout, on_timeout: :kill_task -> counted as a processed error and
  left for a later idempotent run). Writes stay SERIAL on the worker process so
  the small BYPASSRLS admin pool is never hit by N concurrent updates. At ~10x
  the per-batch throughput this turns a ~16h sequential 77k pass into ~1-2h.

- The classifier model is now independently configurable via
  :knowledge_classifier_model (ANTHROPIC_CLASSIFIER_MODEL in prod), falling back
  to the shared :anthropic_provider model (Haiku 4.5), so a one-time high-accuracy
  reclassification can use a stronger model WITHOUT changing content extraction.

Mox's $callers mechanism covers the Task.async_stream-spawned classify calls, so
the existing reclassify tests pass unchanged. Full local gate green.
… (no skipped articles)

Before: if the classifier upstream (Anthropic, or the host's egress) was
unreachable, every classify in a batch errored, the batch still 'succeeded', the
cursor ADVANCED, and those articles were silently SKIPPED — a permanent gap until
a full re-run.

Now: when a non-empty batch comes back with an error rate >= the configured
threshold (:knowledge_reclassify_snooze_error_rate, default 0.5), perform returns
{:snooze, seconds} instead of advancing. Oban re-runs the SAME job (same cursor)
later WITHOUT consuming a max_attempts slot and WITHOUT logging an audit event, so
an outage simply PAUSES the migration and it resumes cleanly when connectivity
returns — nothing skipped, no audit spam. A few genuine per-article errors (below
the rate) still proceed and are left unchanged (recoverable by an idempotent
re-run).

Note: the migration executes on the Fly host, so an outage at the operator's
location never touches it once kicked; this covers the Anthropic/egress-side case.

Tests: a 100%-error batch snoozes and writes/audits nothing; a 1-of-3 error batch
(below the rate) proceeds, reclassifying the two good articles and leaving the
errored one alone. Full local gate green.
… playbook-lean

A prod sample dry-run showed the classifier pulling principles/essays into
'playbook' (9 of 15 sampled proposals). Tightened the discriminators:

- Categories definitions (shared by the classifier AND extractor prompts):
  playbook now requires EXPLICIT, ORDERED STEPS (not a tip/principle/shape);
  pattern is the recurring shape vs a procedure; insight is the 'why'/principle
  not a how-to; idea is a thing to build or try.
- Classifier prompt gains a most-specific-fit tie-breaker that explicitly steers
  principles -> insight, shapes -> pattern, build/try -> idea, and tells it not to
  over-use playbook.

To be re-validated against the prod sample before the full 77k commit.
@mkreyman mkreyman merged commit 0694142 into master Jun 30, 2026
9 checks passed
@mkreyman mkreyman deleted the reclassify-concurrent-batches branch June 30, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant