Skip to content

feat(knowledge): nightly KnowledgeLintWorker — scheduled lint + orphan self-heal#204

Merged
mkreyman merged 2 commits into
masterfrom
knowledge-lint-worker
Jun 29, 2026
Merged

feat(knowledge): nightly KnowledgeLintWorker — scheduled lint + orphan self-heal#204
mkreyman merged 2 commits into
masterfrom
knowledge-lint-worker

Conversation

@mkreyman

Copy link
Copy Markdown
Owner

What

Adds Loopctl.Workers.KnowledgeLintWorker — the scheduled "nightly refinement" loop for the knowledge wiki. The lint engine (Knowledge.lint/2) already existed but was only ever invoked on-demand via the API. This wires it onto a cron and takes a safe, automated repair action on the one finding that can be auto-fixed: orphans.

Both the Karpathy llm-wiki pattern (lint-and-act) and the Dan Martell second-brain workflow (nightly Claude refinement) independently converge on this loop. The platform had the engine; it lacked the orchestration.

Why now

The second-brain corpus is at ~77k articles. Lint runs clean except for ~1.9% orphan articles (zero inbound/outbound links). Nothing was scheduled to act on them, and entropy (orphans, drift) is invisible under brute-force search.

How it works

  • :knowledge queue, max_attempts: 3, unique on {worker, args} within 60s.
  • all_tenants fan-out mirroring ComputeSthWorker — lint is inherently per-tenant, so each active tenant lints in its own job (independent retries, no cross-tenant coupling). Suspended/inactive tenants are skipped.
  • Per tenant: run Knowledge.lint/2, then re-enqueue the proven ArticleLinkingWorker for each orphan. Re-linking re-runs the pgvector similarity pass against the current corpus — so an article orphaned in January can find neighbors ingested months later. Deterministic, and makes no embedding-API calls (orphans missing an embedding simply no-op in the linking worker; backfilling those is a separate concern).
  • Surfaces all findings via an immutable knowledge.lint_completed audit event carrying the full lint summary. Contradictions / coverage gaps / broken sources / stale counts need human judgment, so they are made observable in the change feed rather than auto-repaired.
  • Scale-bounded: orphan re-link enqueues capped by :knowledge_lint_max_orphan_relink (default 500). When the true orphan count exceeds the cap, the gap is logged, never silently dropped — the remainder retries next run.
  • Scheduled nightly at 04:00, after the 02:00–03:00 maintenance window.

Tests

test/loopctl/workers/knowledge_lint_worker_test.exs (async, 5 cases):

  • per-tenant lint emits the knowledge.lint_completed audit event with the summary
  • orphan re-linking outcome (two similar orphans → a relates_to link is created; orphans_relinked == 2)
  • empty tenant (0 articles, 0 relinks)
  • all_tenants fan-out lints both active tenants and skips a suspended one
  • tenant isolation (lint summary counts only the caller tenant's articles)

Verification

Full pre-commit gate green locally: compile (--warnings-as-errors), mix format, credo --strict, dialyzer (0 errors), and the full suite (2977 tests, 0 failures).

Notes / follow-ups (not in this PR)

  • Orphans that lack an embedding are not re-linkable here — an embedding-backfill pass is a separate, distinct gap.
  • A genuine missing capability surfaced during review: ripple-on-ingest (updating neighbor article bodies when a new source extends/contradicts them) — ingest currently only creates ArticleLink edges. Tracked separately.

…als orphans

The knowledge-wiki lint engine (Knowledge.lint/2) already existed but was only
ever invoked on-demand via the API. Add the scheduled "nightly refinement" loop
that both the Karpathy llm-wiki pattern and the Dan Martell second-brain workflow
converge on: run lint on a cron and take a safe automated action on the one
finding that can be auto-repaired.

- KnowledgeLintWorker (:knowledge queue) with an all_tenants fan-out mirroring
  ComputeSthWorker -- lint is per-tenant, so each tenant lints in its own job.
- Per tenant: run Knowledge.lint/2, then re-enqueue ArticleLinkingWorker for each
  orphan (published article with zero links). Re-linking re-runs the proven
  pgvector similarity pass against the CURRENT corpus, so an article orphaned in
  January can find neighbors ingested months later. Deterministic, no embedding-
  API cost.
- Emit an immutable knowledge.lint_completed audit event carrying the full lint
  summary so contradictions / coverage gaps / broken sources / stale counts are
  observable in the change feed (those need human judgment, so they are surfaced
  not auto-repaired).
- Scale-bounded: orphan re-link enqueues capped by :knowledge_lint_max_orphan_relink
  (default 500); when the true orphan count exceeds the cap the gap is logged,
  never silently dropped.
- Scheduled nightly at 04:00 (after the 02:00-03:00 maintenance window).

Tests cover per-tenant lint + audit, orphan re-linking outcome, empty-tenant,
all_tenants fan-out (skips suspended tenants), and tenant isolation.
@mkreyman mkreyman merged commit 4ca782e into master Jun 29, 2026
8 checks passed
@mkreyman mkreyman deleted the knowledge-lint-worker branch June 29, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant