Skip to content

feat(knowledge): expand category taxonomy + single source of truth (P3 stage 1)#206

Merged
mkreyman merged 1 commit into
masterfrom
knowledge-taxonomy-expansion
Jun 29, 2026
Merged

feat(knowledge): expand category taxonomy + single source of truth (P3 stage 1)#206
mkreyman merged 1 commit into
masterfrom
knowledge-taxonomy-expansion

Conversation

@mkreyman

Copy link
Copy Markdown
Owner

What

P3 stage 1 of the second-brain taxonomy work: expand the category set and centralize it as a single source of truth.

The corpus only had 5 categories (pattern/convention/decision/finding/reference). This adds the richer taxonomy you approved and — just as importantly — kills the 6-way duplication of the category list that was silently drifting.

Taxonomy

New module Loopctl.Knowledge.Categories is canonical:

  • active/0pattern, decision, finding, reference, playbook, insight, entity, idea, quote, question (the 4 kept + 6 new). What new content is classified as.
  • retired/0convention. Still DB-valid (the category column is a plain string and ~13% of the corpus is convention; dropping the enum value before the reclassification backfill would make those rows fail to load), but not offered to new content.
  • definition/1 + prompt_fragment/0 — per-category definitions injected into the LLM extraction prompts so the model classifies into the full taxonomy correctly (e.g. playbook = the steps vs pattern = the shape).

Wired through (drift eliminated)

Before (own copy of the list) After
Article.@category_values Categories.all() (Ecto.Enum can't drift)
ClaudeContentExtractor prompt + @valid_categories prompt rebuilt from Categories at compile time; gate → Categories.all_strings()
LlmExtractor prompt + @valid_categories same
ReviewKnowledgeWorker.@valid_categories Categories.all()
ContentIngestionWorker.@valid_categories Categories.all()

scale_seed.ex is intentionally left alone — it's synthetic scale-test data whose category round-robin feeds index-selectivity gates; changing its list length is orthogonal to the product taxonomy and would risk scale-gate flakiness.

No DB migration: category is a :string column, so new enum values need no schema change.

Tests

  • New categories_test.exs: composition (all = active ++ retired, disjoint), the expanded active set, string helpers, definitions cover every active category, prompt fragment includes all active / excludes retired, and an enum-vs-canonical drift guard (Ecto.Enum.values(Article, :category) == Categories.all()).
  • article_test.exs: changeset accepts every new category.
  • knowledge_test.exs + knowledge_stats_controller_test.exs: the dense by_category stats maps grew from 5→11 keys (correct — stats derives the dense map from the enum). Updated the expectations to derive from Categories rather than re-hardcode, so a future taxonomy change can't silently break them again.

Full local gate green: format, credo --strict, dialyzer (0 errors), 2987 tests / 0 failures.

Next P3 stages (separate PRs)

  1. CategoryClassifier behaviour + config-DI (Claude-backed) + Mox.
  2. Resumable, cost-ceilinged, dry-run-first backfill worker over the 77k corpus: write-on-confident-change, always reclassify convention; emits a proposed-distribution report before committing.
  3. Cross-repo: synth.py / publish.py category enum (claude-config) + user CLAUDE.md category list.
  4. After the backfill drains convention to 0, a follow-up drops it from active/retired.

…ce of truth

P3 stage 1 of the second-brain taxonomy work. Adds the richer category set and,
crucially, ends the 6-way duplication of the category list that was silently
drifting (Article schema, both LLM extractor prompts + their @valid_categories
guards, and two ingestion workers each carried their own copy).

New module Loopctl.Knowledge.Categories is the canonical source:
- active/0  -> pattern, decision, finding, reference, playbook, insight, entity,
  idea, quote, question (the 4 kept + 6 new). What new content is classified as.
- retired/0 -> convention. Still DB-valid (the category column is a plain string
  and ~13% of the corpus is convention; dropping the enum value before the 77k
  reclassification backfill would make those rows fail to load), but not offered
  to new content.
- definition/1 + prompt_fragment/0 -> per-category definitions injected into the
  extraction prompts so the model classifies into the full taxonomy correctly
  (e.g. playbook = the steps vs pattern = the shape).

Wired through:
- Article.@category_values  -> Categories.all() (Ecto.Enum now can't drift)
- ClaudeContentExtractor / LlmExtractor: prompt rebuilt from Categories at compile
  time; @valid_categories -> Categories.all_strings() (lenient gate; prompt only
  offers active categories)
- ReviewKnowledgeWorker / ContentIngestionWorker: @valid_categories -> Categories.all()

scale_seed.ex is intentionally left alone: it is synthetic scale-test data whose
category round-robin feeds index-selectivity gates; changing its list length is
orthogonal to the product taxonomy and risks scale-gate flakiness.

No DB migration: category is a :string column, so new enum values need no schema
change.

Tests: Categories unit (composition, string helpers, definitions, prompt fragment,
and an enum-vs-canonical drift guard via Ecto.Enum.values/2) + Article changeset
accepts every new category. Full local gate green (format, credo --strict,
dialyzer 0 errors).

Next P3 stages (separate PRs): CategoryClassifier behaviour + DI; resumable,
cost-ceilinged, dry-run-first backfill worker over the 77k corpus
(write-on-confident-change, always reclassify convention); cross-repo synth.py /
publish.py + user CLAUDE.md category list.
@mkreyman mkreyman merged commit 41fdcbd into master Jun 29, 2026
9 checks passed
@mkreyman mkreyman deleted the knowledge-taxonomy-expansion branch June 29, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant