feat(knowledge): reclassify run hardening — concurrency, configurable model, outage snooze by mkreyman · Pull Request #211 · mkreyman/loopctl

mkreyman · 2026-06-30T00:09:46Z

Tuning/hardening the reclassification engine for the real 77k run (found via prod verification).

1. Concurrent batch classification

process_batch now classifies a batch concurrently via Task.async_stream (:knowledge_reclassify_max_concurrency, default 10; 30s per-call timeout, on_timeout: :kill_task). Writes stay serial on the worker process so the small BYPASSRLS admin pool is never hit by N concurrent updates. Turns a ~16h sequential 77k pass into ~1–2h.

2. Configurable classifier model

:knowledge_classifier_model (ANTHROPIC_CLASSIFIER_MODEL in prod) overrides the classifier model independently of content extraction — so a one-time high-accuracy reclassification can use a stronger model (e.g. Sonnet) without changing extraction. Falls back to the shared provider model (Haiku 4.5).

3. Outage resilience (snooze, never skip)

If a batch comes back with an error rate ≥ :knowledge_reclassify_snooze_error_rate (default 0.5) — i.e. the classifier upstream is unreachable — perform returns {:snooze, seconds} instead of advancing. Oban re-runs the same cursor later without consuming an attempt or logging audit, so an outage pauses the migration and it resumes cleanly — no skipped articles. (The migration runs on the Fly host, so a local operator outage never touches it; this covers the Anthropic/egress side.)

Tests

Concurrent path (Mox $callers covers the spawned tasks); 100%-error batch snoozes and writes/audits nothing; 1-of-3 error batch (below the rate) proceeds and reclassifies the good ones. Full local gate green (3007 tests, credo, dialyzer 0).

…ifier model Tuning the reclassification engine for the real 77k run: - process_batch now classifies a batch CONCURRENTLY via Task.async_stream (max_concurrency :knowledge_reclassify_max_concurrency, default 10; 30s per-call timeout, on_timeout: :kill_task -> counted as a processed error and left for a later idempotent run). Writes stay SERIAL on the worker process so the small BYPASSRLS admin pool is never hit by N concurrent updates. At ~10x the per-batch throughput this turns a ~16h sequential 77k pass into ~1-2h. - The classifier model is now independently configurable via :knowledge_classifier_model (ANTHROPIC_CLASSIFIER_MODEL in prod), falling back to the shared :anthropic_provider model (Haiku 4.5), so a one-time high-accuracy reclassification can use a stronger model WITHOUT changing content extraction. Mox's $callers mechanism covers the Task.async_stream-spawned classify calls, so the existing reclassify tests pass unchanged. Full local gate green.

… (no skipped articles) Before: if the classifier upstream (Anthropic, or the host's egress) was unreachable, every classify in a batch errored, the batch still 'succeeded', the cursor ADVANCED, and those articles were silently SKIPPED — a permanent gap until a full re-run. Now: when a non-empty batch comes back with an error rate >= the configured threshold (:knowledge_reclassify_snooze_error_rate, default 0.5), perform returns {:snooze, seconds} instead of advancing. Oban re-runs the SAME job (same cursor) later WITHOUT consuming a max_attempts slot and WITHOUT logging an audit event, so an outage simply PAUSES the migration and it resumes cleanly when connectivity returns — nothing skipped, no audit spam. A few genuine per-article errors (below the rate) still proceed and are left unchanged (recoverable by an idempotent re-run). Note: the migration executes on the Fly host, so an outage at the operator's location never touches it once kicked; this covers the Anthropic/egress-side case. Tests: a 100%-error batch snoozes and writes/audits nothing; a 1-of-3 error batch (below the rate) proceeds, reclassifying the two good articles and leaving the errored one alone. Full local gate green.

… playbook-lean A prod sample dry-run showed the classifier pulling principles/essays into 'playbook' (9 of 15 sampled proposals). Tightened the discriminators: - Categories definitions (shared by the classifier AND extractor prompts): playbook now requires EXPLICIT, ORDERED STEPS (not a tip/principle/shape); pattern is the recurring shape vs a procedure; insight is the 'why'/principle not a how-to; idea is a thing to build or try. - Classifier prompt gains a most-specific-fit tie-breaker that explicitly steers principles -> insight, shapes -> pattern, build/try -> idea, and tells it not to over-use playbook. To be re-validated against the prod sample before the full 77k commit.

mkreyman added 3 commits June 29, 2026 18:02

mkreyman merged commit 0694142 into master Jun 30, 2026
9 checks passed

mkreyman deleted the reclassify-concurrent-batches branch June 30, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(knowledge): reclassify run hardening — concurrency, configurable model, outage snooze#211

feat(knowledge): reclassify run hardening — concurrency, configurable model, outage snooze#211
mkreyman merged 3 commits into
masterfrom
reclassify-concurrent-batches

mkreyman commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkreyman commented Jun 30, 2026

1. Concurrent batch classification

2. Configurable classifier model

3. Outage resilience (snooze, never skip)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant