Product•6 min read

From Agents to Teams: How We Built Forge

February 16, 2026 by Asif Waliuddin

From Agents to Teams: How We Built Forge

The platform that came out the other side ships with 87% test coverage across 4,146 tests — and I didn't set out to build a platform at all. I set out to stop my AI tools from overwriting each other's work.

Late 2025, I was using Claude Code on a project with a growing codebase. One session would make an architectural decision: "JWT for auth, sessions for admin." Next session had no memory of it. I'd explain the same constraints again. Third session, same explanation. The tool was starting from scratch every time.

Then I started running Codex CLI alongside Claude Code. One on the backend, one on the frontend. Both edited package.json. Both touched shared utility files. No coordination mechanism. No locks. No shared understanding of who was working on what.

Inside each tool, agents coordinated fine. Claude Code ran 20 subagents and they stayed aligned because the tool managed that coordination internally. The problem wasn't multi-agent. It was multi-tool. Two separate tools with no shared state.

I'd spent 23 years watching this exact problem play out with human teams. And I knew the solution wasn't better agents. It was coordination infrastructure between the tools themselves.

The Failures Came in Order

File conflicts first. Two tools editing the same file. Silent overwrites. Lost work. Claude refactored a module. Codex updated tests against the pre-refactor interface. Both saved. Tests failed. Neither tool knew the other existed.

Knowledge loss second. Decisions made in Claude Code on Monday gone by Tuesday in Codex CLI. No record of why we chose JWT over sessions. No shared memory of the naming conventions we established. Every tool started from scratch. Same cognitive overload that enterprise teams fight with wikis and onboarding docs. Just renamed to "context rot."

Invisible progress third. Three terminals open, three tools running. Which tool finished? Which is stuck in a loop? Did Codex start tests before Claude finished refactoring? Without visibility across tools, I was the coordination layer. Manually checking each terminal, mentally tracking state.

Governance gaps fourth. Quality checks only happened when I remembered to run them. Security scanning was a separate step. Style consistency enforced at the end, not throughout. The gap between "work happening" and "work being checked" was where problems grew.

Wasted context fifth. Each tool maintained its own view of the project. No shared task list. No shared decision log. I was the message bus between my own tools.

Architecture Followed the Problem

I didn't design three levels on a whiteboard. The architecture emerged from the problems, in order.

Level 1: Governance first. The most immediate need was quality. Catching problems before they compounded. I built the plugin layer: slash commands, specialized agents, knowledge skills, and governance hooks for Claude Code. 21 commands. 22 agents. 29 skills. 6 hooks. Zero dependencies. Install in 30 seconds and your sessions have guardrails. Governance hooks exist because I've watched audit findings pile up from shortcuts nobody caught in real time.

This worked well for a single tool. It didn't solve the coordination problem between tools.

Level 2: State and coordination. The multi-tool failures (file conflicts, knowledge loss, invisible progress) all trace to a missing shared state layer. I built the orchestrator in Rust: a single binary that manages file locks, maintains the knowledge base, tracks tasks with dependency graphs, detects drift from specs, and serves 10 MCP tools that any connected AI tool can access.

Rust was the only option that made sense. The coordination layer needs to be correct under concurrency and deploy as a single binary without runtime dependencies. 4MB executable. 292 tests. No daemon, no database, no network requirements. State lives in .forge/ as plain files.

Three adapters connect Claude Code, Codex CLI, and Gemini CLI. Each reads its native config format. The orchestrator doesn't care which tool is talking. Policy enforced here. Nowhere else.

Level 3: Visibility. Governance and coordination work without a UI. The plugin provides commands; the orchestrator provides a TUI. But watching tools coordinate through a terminal has limits. I built the visual layer: a React dashboard with real-time governance HUD, agent activity feed, and the Infinity Terminal.

The Infinity Terminal was the feature I didn't plan. Start a task on your desktop. Close the browser. Open it on your phone during lunch. The session is still running. Network drops, browser restarts, device switches. Sessions survive everything. Built on top of the orchestrator's file-based state layer. Session state is files. Files persist. Sessions persist.

58 React components. 4,146 tests. 87% coverage.

The Numbers Are the Test Suite

4,434 tests across the platform. Every one traces to a specific failure mode I've encountered, either with AI tools or with human teams over 23 years.

292 tests in the Rust orchestrator cover file locking under concurrent access, task dependency resolution, knowledge classification, governance rule evaluation, and MCP tool responses.

4,146 tests in the UI cover component rendering, state management, terminal resilience, and dashboard data flow.

31/31 launch gates passing. Test coverage isn't a metric to optimize. It's a record of things that went wrong.

In Practice

The graduated entry means you start where the pain is sharpest.

If your problem is governance (Claude Code sessions without guardrails), install the plugin. 30 seconds. No further setup.

If your problem is coordination (multiple AI tools on the same repo, file conflicts, lost knowledge), install the orchestrator. Run forge init. Run forge dashboard.

If your problem is visibility (you need to see everything, from anywhere, on any device), add the UI. The Infinity Terminal alone changes how you work with long-running tasks.

Each level adds capability without requiring the others. But after a week with the knowledge flywheel, starting a new session feels different. Decisions made in Claude Code are available when Codex CLI starts. Conventions are already known. That's when people realize they can't go back. Not because of lock-in. State lives in plain files, fully portable. Because working without shared state across tools starts to feel like working blind.

The Bottom Line

I didn't build Forge because agents within a single tool needed coordination. They already have it. I built it because I watched the same coordination failures I'd spent 23 years solving with humans happen all over again when multiple AI tools worked on the same codebase. Faster, silently, and at scale.

Multi-tool AI coordination is enterprise program management. Same failure modes. Same solutions. Different substrate.

Launching March 2.

GitHub · forge.nxtg.ai · Follow the build on X

From Agents to Teams: How We Built Forge

From Agents to Teams: How We Built Forge

The Failures Came in Order

Architecture Followed the Problem

The Numbers Are the Test Suite

In Practice

The Bottom Line

Ship AI you can trust

Enjoyed this article?