Forge is a lean software delivery framework for AI agents — from inception to production.
Where other agent frameworks hand you a coding assistant, Forge gives you an autonomous engineering organization: seven specialized agents, a shared delivery state machine, and a strict ATDD-first process built on battle-tested XP and Lean delivery practices.
Your agents don't just write code. They run discovery workshops, write INVEST-compliant user stories, make architecture decisions, implement vertical slices test-first, desk check each acceptance criterion, manage feature flags, and ship — continuously, to production, without you babysitting them.
Most agent frameworks start at "write code". Forge starts at "what are we building and why".
Every story in Forge traces back to a customer pain identified in an empathy map. Every acceptance criterion is testable through the UI alone — no backdoors, no database queries, no internal system access. Every feature ships behind a feature flag on trunk. Nothing goes to production that a QA agent and a PO agent haven't independently verified against the original acceptance criteria.
This is how high-performing software teams actually work. Forge encodes it as agent skills.
| Agent | Owns |
|---|---|
| po-agent | Inception, story writing, backlog management, story acceptance |
| ux-agent | Empathy mapping, UX specs, frontend acceptance criteria |
| architect-agent | Architecture Decision Records, service boundaries, tech debt |
| developer-agent | ATDD loops, TDD inner loops, contract tests, feature flag code |
| qa-agent | Acceptance test authoring, desk checks, regression suite |
| devops-agent | CI/CD, environments, Unleash feature flags, deployments |
| secops-agent | Threat modeling, security ACs, SAST/DAST pipeline gates |
Each agent has a defined role, a default skill set, and a strict boundary. The developer agent doesn't make architecture decisions. The architect agent doesn't write production code. Roles are enforced, not suggested.
The po-agent and ux-agent facilitate a structured discovery process:
- Lean Canvas — problem, solution, unique value proposition
- Empathy Mapping — understand the customer before writing a single requirement
- Trade-off Sliders — team-wide prioritization: quality vs. cost vs. UX vs. security (no ties allowed)
- Event Storming — interactive back-and-forth conversation mapping domain events, commands, policies, and UI elements. UI stickies become user stories. Policies and commands become acceptance criteria. The final phase produces
CONTEXT.md— the project's ubiquitous language. - Iteration Mapping — dependency graph → topological sort → iteration layers → Linear Projects + Cycles, automatically
Every story passes a four-gate review before it's playable:
- PO drafts — persona, customer value, rough acceptance criteria
- UX gate — is this valuable? traces to empathy map pain point?
- Developer gate — is this technically feasible? cost signal? estimable?
- QA gate — is every AC testable through the UI alone, as an outside customer would test it?
Stories that fail any gate go back to the PO. Stories that pass land in their iteration's Linear Project as ready-for-dev.
All stories follow the INVEST principle and Mike Cohn's User Stories Applied — no technical detail in stories. Backend shape, API design, database schema all live in ADRs, not stories.
Before story 1 starts:
- architect-agent writes ADRs for iteration 1 stories
- devops-agent provisions CI/CD pipeline, test environment, and Unleash feature flag server
- qa-agent authors the acceptance test scaffold — first dummy Acceptance Test must be green on CI
- secops-agent wires SAST/DAST and secret scanning into the pipeline
Iteration 1 cannot start until the acceptance test scaffold is green on CI and a test environment is live.
ATDD is a collaborative practice where acceptance criteria are expressed as executable tests before any implementation begins. The outer Acceptance Test is the sole definition of done — everything else serves it.
For each story, the developer agent runs a strict process:
Outer Acceptance Test → RED
For each sub-slice in the AC:
FE inner loop: component test RED → GREEN → REFACTOR
BE inner loop: CDC contract test RED → GREEN → REFACTOR
Sub-slice complete ✓ (never batch FE loops then BE loops)
Outer Acceptance Test → GREEN
Desk check with QA (local first, then deployed environment)
One sub-slice at a time. One AC at a time. No implementation code before the outer Acceptance Test is RED and visible. No moving to the next sub-slice until the current one is fully green.
Linear is the delivery state machine. There are two boards:
Board 1 — Iteration Map (Linear Roadmap/Projects) One Project per iteration. Stories are assigned to iterations based on a topological sort of their dependency graph — stories with no dependencies go into Iteration 1, stories depending on Iteration 1 go into Iteration 2, and so on. This board is the backlog. Stories with no iteration assignment are unscoped.
Board 2 — Delivery Board (Linear Cycle) The active iteration's stories move through the delivery state machine:
in-analysis → story refinement (all four gates)
ready-for-dev → any developer agent can pull (first available wins)
in-dev → ATDD loop + desk check per AC
ready-for-qa → story deployed to test environment; QA agent pulls
in-qa → QA full regression suite
ready-for-acceptance → PO agent pulls
in-acceptance → PO smoke tests all ACs
ready-to-deploy → HUMAN approves flag flip
(done) → feature flag on, Linear card closed
Bug cards found in in-qa or in-acceptance go directly to ready-for-dev — no refinement needed.
Stories are pulled, not assigned. When a developer agent starts a session with no active story, it queries Linear for the oldest story in ready-for-dev and atomically moves it to in-dev with self-assignment in a single API call. If two agents race, only one wins — the other pulls the next available story. No orchestrator needed.
Iteration completion. After completing a story, every agent checks: "are all stories in this iteration done?" If yes, the po-agent posts a completion notice on the Linear milestone and idles. A human PO reviews, then triggers the next iteration by activating the next Cycle. Iterations are variable duration — they end when done, not on a fixed date.
Forge replaces velocity-based sprint planning with dependency-driven iteration scoping:
| Human team concept | Forge equivalent |
|---|---|
| Story points | AC count (1 AC ≈ 1 ATDD loop) |
| Team velocity | Max concurrent agents (human sets this) |
| Fixed-length sprint | Variable duration — iteration ends when all stories reach done |
| Planning poker | Architect agent flags stories with >5 ACs as split candidates |
| Sprint planning | Topological sort of dependency graph → iteration layers |
Parallel execution within an iteration is determined by story independence — stories in the same layer with no inter-story dependencies run simultaneously with separate developer agents.
No feature branches. Every push goes to production if CI is green — but unfinished stories are feature-flagged off via Unleash (self-hosted, open source). When a story is accepted, the human PO approves the flag flip and the feature is live.
"With a ubiquitous language, conversations among developers and expressions of the code are all derived from the same domain model." — Eric Evans, Domain-Driven Design
Agents dropped into a project with no shared vocabulary use 20 words where 1 will do. Forge solves this by generating CONTEXT.md as the final phase of every event storming session.
CONTEXT.md defines:
- Domain terms — canonical names for every entity, event, command, and policy discovered during event storming, with explicit "avoid" aliases
- Bounded context boundaries — which terms belong to which service/domain
- Agent communication protocol — the shared vocabulary of the delivery process itself (e.g. "outer Acceptance Test" not "E2E test", "desk check" not "demo", "sub-slice" not "partial implementation")
- Flagged ambiguities — terms that look like synonyms but aren't
CONTEXT.md lives in the repo root and is a mandatory read for every agent at session start. Any agent that encounters a term not in CONTEXT.md must stop and propose an addition before using it. The po-agent owns it; all agents can propose updates.
Every agent, every session, in this order:
1. Query Linear → do I have an active story in-progress?
If yes → resume it (re-run outer Acceptance Test first)
If no → pull oldest story from ready-for-dev (atomic claim)
2. Read CONTEXT.md → speak the project's language
3. Read project.constraints.yaml → know the priorities
4. Begin — the Linear stage determines what happens next
No handoff notes. No plan files. No conversation summaries. State lives in Linear and the repo — not in context windows.
Two files live in the repo root and are read by all agents:
CONTEXT.md # ubiquitous language — generated from event storming
project.constraints.yaml # trade-off slider output — team priorities
Story content lives in Linear. A snapshot is committed to stories/ when a story moves to ready-for-dev — at that point the story is locked and the snapshot never drifts.
Forge uses an explicit three-level skill hierarchy to prevent the most common agent failure mode: rationalization over process.
L1 RIGID → running-atdd-sessions, running-tdd-loops
(these override everything; no exceptions)
L2 GUIDED → writing-stories, facilitating-inception, deciding-architecture
(structured process with human gates)
L3 MECH → finishing-stories, managing-feature-flags
(mechanical execution after L1/L2 complete)
An agent cannot rationalize "I'll just wire the handler first" — running-atdd-sessions is L1 and wins over plan files, conversation summaries, and implementation suggestions without exception.
All 21 Forge skills are continuously verified by loopkit — a static analysis tool for agent skill contracts. Every SKILL.md, LOOP.md, and .loopkit.yaml passes 0 errors, 0 warnings across all validators:
cargo install loopkit
loopkit . --verbose
# 21 skills checked. 0 error(s), 0 warning(s).Loopkit validates state transitions, enforced states, handoff references, desk check patterns, bug feedback loops, progressive disclosure, naming conventions, terminology, and 10 other dimensions of skill quality.
Meta
using-forge— precedence rules, agent roles, session start protocolresuming-sessions— query Linear + read CONTEXT.md before anything else
Discovery
facilitating-inception— lean canvas, empathy map, trade-off sliders, event storming, iteration mapfacilitating-event-storming— interactive domain event discovery; final phase producesCONTEXT.mdestablishing-ubiquitous-language— generates and maintainsCONTEXT.mdfrom event storming outputwriting-stories— INVEST-compliant story writing with four-gate reviewbuilding-iteration-map— topological sort of dependencies → Linear Projects + Cycles via MCP
Architecture
deciding-architecture— ADR authoring, service boundary definition
Iteration Zero
bootstrapping-project— CI/CD, repos, environments, Unleash setupvalidating-test-harness— gate skill: blocks iteration 1 until scaffold is green
Development (L1 Rigid)
running-atdd-sessions— outer Acceptance Test RED → sub-slice loops → GREENrunning-tdd-loops— FE component loop and BE CDC contract loop (CDC contracts live inside this loop)managing-feature-flags— Unleash REST API integration (lifecycle protocol called from delivery loops)
Quality
running-desk-checks— per-AC verification, generates desk-check artifact, pauses for humanwriting-acceptance-tests— Playwright/Cypress, UI-interactions only, no backdoorsrunning-regression-suite— full suite on test environment
Acceptance & Delivery
approving-stories— PO smoke test against ACsfinishing-stories— flag flip + Linear update + post-deploy security checkthreat-modeling— security ACs injected per story from project.constraints.yamlsecuring-pipeline— SAST/DAST/secret scanning gates
git clone https://github.com/loopworx/forge ~/.agents/forge
bash ~/.agents/forge/install.shThis symlinks all 21 skills into ~/.agents/skills/ where OpenCode, Claude Code, and Hermes discover them automatically.
To verify:
cargo install loopkit
loopkit ~/.agents/forge --verbose
# 21 skills checked. 0 error(s), 0 warning(s).- Customer value first — every story traces to an empathy map pain point; no story is written without it
- Shared language — agents and humans speak from the same domain model;
CONTEXT.mdis generated before the first story is written - INVEST or bust — stories with technical detail, untestable ACs, or no clear customer value never reach the backlog
- Test-first is non-negotiable — the first file edit in any session is always a test file; implementation code is forbidden until the outer Acceptance Test is RED
- Vertical slices — every story is deployable end-to-end; no frontend-only or backend-only stories
- Pull, don't push — agents pull work when ready; no assignment, no queue management, no orchestrator
- Trunk over branches — feature flags make branches unnecessary; continuous deployment makes them dangerous
- Agents have roles — a developer agent that makes architecture decisions is a liability; role separation is enforced by skill design
- State over memory — agents query Linear and read repo artifacts; delivery state survives context windows
Forge skills follow Anthropic's agent skill best practices:
- Description field is the primary selection mechanism — write it first
- SKILL.md stays under 500 lines — use linked reference files for detail
- Add deterministic contract tests for the skill's state machine and handoff routes
- Skills are state machines with explicit gates, not knowledge documents
See the Forge skill authoring guide for the full process.
- Fork the repository
- Draft the description field
- Write the skill body
- Add or update the contract test harness to cover new states/transitions
- Submit a PR with
cargo testresults fromtools/contract-tests
MIT — see LICENSE file.
Forge is inspired by XP and Lean delivery practices, Eric Evans' Domain-Driven Design, Mike Cohn's User Stories Applied, and the hard lessons learned from watching AI agents skip the outer RED.