/goal Is the Inner Loop. APEX Is the Operating Model.
Claude Code, Codex, and Hermes all launched /goal — a judge-evaluated continuation loop. I'd been running that loop inside APEX for months. The convergence validates the pattern. But the pattern alone isn't enough.
Between April 16 and May 11, 2026 — less than four weeks — three independent platforms shipped the same agentic execution pattern. OpenAI's Codex CLI launched its /goal command on April 16. Anthropic's Claude Code followed with the same feature in version 2.1.139 on May 11. Hermes Agent had been running it before either shipped publicly.
Three independent teams. One pattern. Same month.
That kind of convergence is worth paying attention to. Not because of who got there first, but because of what it means when separate teams all arrive at the same abstraction independently. It means the pattern works, and the industry is figuring that out in real time.
What /goal actually does
The mechanics are consistent across all three implementations. You give the agent a goal statement. The agent accepts it, begins working toward it in turns, and an internal evaluator — a Judge — assesses each completed turn against the original goal. If the goal isn't met and the budget isn't exhausted, the loop continues. When the goal is satisfied, or the turn limit is reached, the agent terminates and surfaces the result. You can interrupt at any point.
Call this pattern Judge-Evaluated Continuation. Generator works. Judge evaluates. Loop continues or closes. It's clean, legible, and effective.
The key architectural insight embedded in this pattern is the separation of generation from evaluation. The generator doesn't grade its own homework. The Judge is a distinct process, applying distinct criteria, making a distinct decision about whether the work is done. Anthropic's research on harness design established why this separation matters: agents evaluating their own output skew reliably positive — confidently approving work that a human would reject. A separate evaluator, configured to be skeptical, is a fundamentally stronger quality mechanism than self-assessment.
The same loop was already APEX's Execution Phase
I published the APEX Framework on April 5, three weeks before Codex CLI shipped /goal. The Execution Phase of APEX describes this same loop: an agent produces output, a review agent evaluates it against specified criteria, the loop iterates or closes, and work surfaces for human verification when it passes the automated gates.
Different label. Same mechanics.
I'm not drawing that comparison to establish priority. I'm drawing it because the convergence is informative. When Anthropic's engineering team, OpenAI's CLI team, and the team building Hermes all land on the same pattern within four weeks of each other, that's not coincidence. It means this pattern is discoverable — that teams running agents in production keep arriving at it independently, because it works.
What it also means: the execution loop has been figured out. Judge-Evaluated Continuation is now established. The harder question is what you build around it.
If you want a mental image for this, think about RoboCop. ED-209 is pure execution — a weapons platform given a directive and a compliance window. "You have twenty seconds to comply." It fires on schedule, escalates on schedule, and falls down the stairs because it has no model of the environment it operates in. No context. No learning. No adaptation. ED-209 is /goal without a Strategic Phase: powerful execution aimed at nothing in particular, with no mechanism to get smarter after it fails.
Murphy is the other architecture. Same execution capability, but wrapped in something that remembers, reasons about context, and learns from what went wrong. He carries his history into every encounter. He adjusts. The film frames this as the human element inside the robotic loop — and that's exactly the structural problem APEX is designed to solve. The execution loop needs a human-informed operating model around it, or you're building ED-209s.
The execution loop needs direction
/goal takes a goal as input. It doesn't help you figure out what goal to set. It doesn't define what "good output" means for your domain. It doesn't establish the criteria the Judge will evaluate against. And after the loop closes, it captures nothing about what happened — so the next run starts from the same baseline.
In APEX terms: /goal covers the Execution Phase, but it has no Strategic Phase feeding it direction and no Reflection Phase capturing learning from it.
The Strategic Phase is where everything that matters happens before an agent runs. It's where you establish what you're building and why — the business context, the user personas, the competitive constraints. It's where you write the specifications the agent executes against. It's where you configure the agents themselves: their identities, their skills, their memory, the full context they carry into every execution cycle. And crucially, it's where you define the quality criteria the Judge will apply.
In APEX, this work organizes across nine domains grouped into three areas. Platform covers Infrastructure, Operational Tooling, and Security and Compliance. Spec covers Business Context, Spec Engineering, and QA Strategic. Config covers Agent Design, Orchestration Design, and QA Operational. Each domain has a clear owner, clear artifacts, and clear boundaries. Business Context is separate from Spec Engineering so that domain knowledge doesn't get tangled with execution instructions. QA Strategic is separate from QA Operational so that humans define quality and agents enforce it — the same principle that makes Judge-Evaluated Continuation work.
This matters for /goal specifically because the Judge's evaluation is only as good as the criteria it's evaluating against. When an agent produces output that technically satisfies the goal statement but misses the actual intent — and in my experience this happens more than you'd expect — the gap is almost always in the Strategic setup. The goal statement was underspecified. The agent lacked the business context to interpret ambiguous requirements correctly. The evaluation criteria were implicit rather than explicit. Those are Strategic problems. No iteration budget fixes them.
The execution loop also needs to learn
The Reflection Phase closes the other side of the loop. After execution completes, APEX prescribes three steps: evaluate the output against original intent, reflect on what the metrics reveal about system performance, and calibrate by improving the Strategic configuration before the next cycle.
Without Reflection, every run of /goal starts from the same baseline. The agent configurations don't improve. The specifications don't sharpen. The quality criteria don't tighten. The same drift patterns repeat cycle after cycle.
In my experience running agentic operations, Reflection is the most consistently skipped phase — especially under delivery pressure. Teams ship, move on, start the next goal. The result is predictable: iteration depth stays flat, first-pass acceptance doesn't improve, the same classes of failure appear in cycle after cycle. The agents aren't the problem. The feedback loop is absent.
APEX suggests starting points — First-Pass Acceptance Rate, Iteration Depth, Human Touch Rate, Calibration Impact, Cycle Time — but these are examples, not a fixed dashboard. Every project needs its own KPIs based on what actually matters in that context. The point is that you measure at all. When you can say "iteration depth on API integration tasks dropped from 4.2 to 2.8 after we added auth architecture patterns to Business Context," you're not speculating about whether the system improved. You're reading it from the data. That's what makes calibration decisions actionable rather than guesswork. And it's what separates a system that learns from one that just runs.
Single-player vs. team operating model
/goal is designed for one person working with one agent or a small cluster. That's an appropriate scope for an individual practitioner. But most teams running agentic systems at any real scale aren't solo — they're five to ten people who need to coordinate contributions to a shared system, maintain consistent quality standards, and learn from shared execution data.
In an APEX instance, the Strategic Phase is a team artifact. The product manager owns Business Context and Spec Engineering. The tech lead owns Orchestration Design. The QA lead owns QA Strategic and QA Operational. When these domains have clear owners and the artifacts live in a shared workspace, multiple people can contribute to the same system without stepping on each other. The Execution Phase runs with everyone's collective expertise embedded in the agent's context. The Reflection Phase produces data that the whole team acts on together.
/goal can run inside an APEX instance — it's a valid harness choice for the Execution Phase. A general-purpose harness like Claude Code or Codex CLI fits well when work is exploratory and a human is available to steer. But the Strategic artifacts need to exist before it runs, and Reflection needs to close the loop afterward. Without that structure, a team of five people all running /goal independently is five individual workflows. Not a coordinated operating model.
What the convergence signals
I think three independent teams converging on the same execution pattern within four weeks is one of the clearer signals the field has produced this year. It establishes that Judge-Evaluated Continuation is the right abstraction for autonomous agentic execution. That debate is over. The tooling will keep improving, and /goal in some form will be a standard feature of every serious agentic platform within the year.
What's not yet standardized is the scaffolding around it. The Strategic phase that gives the execution loop its direction. The Reflection phase that turns each cycle into a learning event. The organizational structure that lets teams coordinate contributions to a shared system rather than each running their own isolated loop.
If you're starting to run autonomous agents at any real scale, the questions worth investing time in aren't about the execution loop itself — that pattern is solved. The questions are: who owns Strategic design in your organization, what does your calibration cadence look like, and are you treating the system itself as the thing that needs to improve each cycle — not just the output?
The hardest part of autonomous agents was never the automation. It was knowing what to automate, and learning from what you got wrong.