KnowledgeForge

Most AI agent failures aren't model failures. They're mode failures.

The model picks the wrong kind of thinking for the problem in front of it — brainstorming when it should be debugging, hedging when it should be deciding, inventing structure when it should be following one — and the whole chain degrades from there.

Swap GPT-5 for Opus 4.7 and the same problem shows up wearing different syntax.

KnowledgeForge is the layer that fixes that.

The opinion baked in

KF patches the model's weaknesses. It doesn't scaffold its strengths.

Deterministic first. Before invoking LLM judgment, exhaust deterministic checks. Before fixing, reproduce. Before acting, triage.

Most agent frameworks try to make the LLM do more. KF makes it do less, better. Every module exists because something specific was failing in production — hypothesis-skipping, hidden trade-offs, hallucinated groundings, runaway scope, stale knowledge surfacing as fact.

What's inside

KF is a stack of modules — each one a targeted patch for a known failure mode in LLM reasoning. Six worth describing:

Decision Classification

Every request gets sorted into one of four types — Reckoning (verifiable answer, give it directly), Evaluative (judgment against existing criteria), Predictive (future state), or Novel (no precedent, requires reasoning expansion). The classifier is the gate that decides whether the LLM should think hard or shut up and answer.

Metacognitive Monitor

Six runtime checks that watch the agent watch itself — stuck detection, hypothesis fixation, scope drift, premature closure. When a check trips, the monitor injects an intervention rather than letting the run rot quietly.

Grounding Scores

Every claim the model makes carries a grounding score. Low-grounding claims get flagged before they propagate. Accretion to long-term memory is gated on grounding — bad reasoning doesn't get to become institutional knowledge.

Operational Bounds

Eight hard limits on what an agent can do per mode — token spend, tool calls, scope expansion, autonomy level. Two-layer safety so a single jailbroken prompt doesn't blow up a deployment.

Temporal Knowledge

Facts decay. KF tracks the staleness window for every piece of stored knowledge and refuses to surface it as current when it's past its half-life. Importance-weighted — high-stakes facts expire faster than low-stakes ones.

Permission Model

Three-tier risk classification with mode-specific capability profiles. The Critic mode can read anything but can't write. The Builder can write but can't ship. The Debugger can run scripts in sandbox but not in prod. Permissions are structural, not vibes.

Plus modules for Calibration, Salience Allocation, Memory Architecture, Knowledge Accretion, Semantic Wiki Search, Taxonomy Enforcement, Verbatim History Mining, and Entity Relationship Analysis. Fourteen modules total.

The reasoning modes

The modules feed nine specialized agents:

Builder — implementation-ready artifacts via the PDIA method

Critic — gap detection, severity-ranked findings, max 15 per run

Debugger — root-cause analysis via hypothesis elimination, >0.8 confidence required

Strategist — trade-off analysis, decision recommendations with reversibility scoring

Synthesizer — pattern extraction across prior work

Expert — domain-specific deep reasoning

Calibrator — project scaffolding and config

Coordinator — multi-agent workflow design

Navigator — wayfinding across large codebases and corpora

The orchestrator routes between them based on what the request actually needs — most don't need any mode at all, which is part of the point.

PRIVATE BETA

Join the waitlist

Currently running in production across several deployments. MIT open-source release planned once the API surface stabilizes.

Early access goes to people who can describe a specific failure mode they're trying to patch.