Level 5 Agentic Operating System — Blue Orange Digital

Build against Claude Code — not a SaaS wrapper, not a third-party orchestration layer. Open source, fully extensible, ownable. This is your foundation, not a vendor dependency. Everything else in the stack is abstracted above it; Claude Code is the one thing that isn't.

02

Foundation

Abstract Everything

The system must be indifferent to what's underneath it. Swap models, swap tools, swap infrastructure — the orchestration layer shouldn't care. Abstraction is what makes the system durable across model generations, vendor changes, and infrastructure shifts. If any single component is load-bearing, you're not truly abstracted.

03

Foundation

Reduce Everything to Code

Code is the universal interface of an agentic system. It's deterministic, scoreable, testable, versionable, and replayable. When you frame a problem as code, you can write an eval against it. When you can write an eval, the loop closes. Classification, routing, summarization, extraction — all of it must ultimately produce something that can be tested like code. If you can't score it, you can't improve it.

Pillar 02 — Architecture

Design for Scale You Don't Build

Architectural decisions that define how the system grows, what it trusts, and how it sources information.

04

Architecture

Don't Over-Scale Infrastructure

AI parallelizes — use that instead of infrastructure sprawl. Horizontal scaling is the AI's job, not yours. Adding more servers to solve an agentic throughput problem is the wrong lever. Add more agents. The system should scale through parallelism in the agent layer, not through vertical or horizontal infrastructure investment.

05

Architecture

AI is the Source of Truth

Don't mirror the AI's outputs back into human systems as the authoritative record. The agent layer is the system of record. Build around that. When the agentic system produces outputs, those outputs are canonical — downstream processes and human workflows read from the agent layer, not the other way around.

Context Priority Ladder

Always prefer the highest-quality context source available.

RAG / Compacted Context

Highest quality. Pre-distilled, semantically condensed. The gold standard.

Direct Context Injection

Full, unprocessed, window-injected. Rich but unstructured.

API Call

Dynamic retrieval. Real-time but carries latency and cost overhead.

MCP Tool Call

Lowest priority. Use only when no better option exists.

Pillar 03 — Context Management

Carry Everything. Manage It Relentlessly.

Context is the lifeblood of an agentic system — and the most common failure point. A bloated context degrades quality just as much as an empty one.

06

Context

Maximize Context, Minimize Waste

Agents need as much context as possible to make good decisions. Load it aggressively — but actively manage it. Compact, summarize, and prune before hitting window limits. Treat context management like memory management in software: intentional allocation, not accumulation by default. Set compaction thresholds. Build summarization into the orchestration layer. Never let context grow unbounded.

07

Context

Compacted Retrieval Over Raw Injection

Pre-processed, semantically condensed context beats throwing raw documents at the model. RAG and compacted summaries are investments — they cost compute upfront to save quality downstream. Build your context pipeline with the discipline of a database schema: structure it, index it, query it. Don't just dump it. If you wouldn't give a junior analyst a 500-page raw dump to answer a question, don't give it to your agent either.

Pillar 04 — Workflow Design

Data First. Outcomes Second. Fixes Third.

The order matters. Skipping steps is how you end up with expensive agents solving the wrong problems.

08

Workflow

Start with Available Data — Then Define Outcomes — Then Fix Gaps

Don't invent requirements. Start with what data you actually have. What signals exist in your environment right now? Scope the system to run on real inputs. Then define what success looks like — precisely and measurably. If you can't describe the output format and quality threshold, you can't score it. Only after data and outcomes are locked do you hunt for gaps: missing signals, coverage holes, quality issues. These become targeted fixes, not architectural rewrites.

09

Workflow

Don't Tackle Human Problems

Don't go looking for human problems to solve. Don't try to automate judgment, consensus, or org politics. Those aren't bottlenecks you can code away. Let the data surface the workflows — then build the system to execute them. The system handles the work: structured, repeatable, data-driven tasks. Humans stay accountable for decisions that require authority. The tell: if you find yourself building approval flows into the agent layer, you've already made a mistake.

Pillar 05 — Human-AI Boundary

Don't Build For Humans. Build Alongside Them.

The sharpest architectural decisions in an agentic system aren't technical — they're about where the human ends and the machine begins.

10

Boundary

Implement Alongside, Not On Top

Don't replace human systems — run the agent layer in parallel. Prove the system's value before migrating authority to it. Parallel operation is how you build trust in production without burning down the existing org. The agent system and the human system coexist until the agent system has earned the right to be primary.

11

Boundary

Don't Constrain the Agent

Set the goal clearly and let the agent route. Over-specifying the path turns your agent into a very expensive macro. Constraints belong in the eval, not the prompt. Score the output; don't micromanage the method. The moment you're writing step-by-step instructions into an agent prompt, you've stopped building an agent and started building a script.

12

Boundary

Edge Gating, Not HITL

Human-in-the-loop slows everything and creates bottlenecks that scale linearly with workload. Gate at the edges instead: define the criteria for what requires human review, then let the system run. Escalate by exception, not by default. HITL is a design decision that says "I don't trust my evals." Fix the evals.

❌ Human-in-the-Loop

Scales linearly with workload. Every task waits for a human. Creates bottlenecks, delays, and implicit caps on throughput. Human bandwidth is the ceiling.

✅ Edge Gating

Scales logarithmically. Humans handle only true exceptions. The system runs continuously; human attention is spent on genuine judgment calls, not rubber-stamping.

Pillar 06 — Evals

Evals Are the Heartbeat.

Not an afterthought. Not a nice-to-have. The eval framework is what separates a one-shot script from a learning system.

Build Evals First

Build the eval framework before you build the agents. If you can't measure it, you can't improve it — and you definitely can't deploy it safely. Evals are not a post-build QA step; they're the scaffolding everything else hangs from.

Every Agent Gets a Score

Every agent, every workflow, every tool needs a quantifiable output quality metric. This is non-negotiable. Scoring is what makes iteration possible. An agent you can't score is a black box you can't improve.

Evals Close the Loop

An agent without an eval is a one-shot script. An agent with an eval is a learning system. The loop closes when output quality can be measured and fed back into prompt improvements, tool changes, and routing decisions.

Score the Output, Not the Path

Don't measure how the agent got there — measure whether the result is correct, complete, and in the expected format. Let the agent figure out the method. Eval the destination, not the journey. This is how you avoid brittle, over-specified agents.

Pillar 07 — Tool Design

Atomic. Composable. Idempotent.

Tool design is the most common source of agentic system failures at scale. Agents compound tool failures — bad tool design poisons everything downstream.

13

Tools

Atomic, Composable, Idempotent

Atomic: One tool does one thing. If it does two things, split it. Tools that try to do too much fail ambiguously, and agents can't recover from ambiguous failures.

Composable: Tools should chain. The output of one should be valid input to another without transformation. Design for downstream use, not standalone convenience.

Idempotent: Calling the same tool twice with the same input should produce the same result without side effects. Agents retry. Your tools must handle it.

The Compounding Failure Rule

Agents compound tool failures. A bad tool call at step 2 doesn't throw an error — it produces a plausible-looking bad result that poisons every subsequent step. Bad tool design is the #1 reason agentic systems degrade at scale. The failure is silent. The compounding is invisible. The cost is discovered much later.

Pillar 08 — Operational Excellence

Observe. Run Async. Version Your Prompts.

Three operational principles that determine whether your system is maintainable at scale.

14

Observability

Full Trace Visibility — Not Optional

You need full trace visibility — what each agent received, what it produced, what tools it called, and in what order. Without this, debugging is guesswork. Treat agent logs like distributed system traces, not application logs. Build observability before building agents, not after something breaks. If you can't replay an agent's reasoning chain from logs, your observability is insufficient.

15

Async

Async by Default

Agents don't need to be synchronous. Design workflows to be non-blocking from day one. Parallelism is only useful if you've built the orchestration layer to actually exploit it. Every serialized dependency in your workflow is a potential bottleneck. If two agents don't strictly depend on each other's output, they should run in parallel. Async-first is not an optimization — it's the correct default.

16

Versioning

Prompts Are Code. Version Them Like Code.

Prompts live in source control alongside the code they drive. A prompt change is a system change — track it, test it against the eval suite, and deploy it with the same discipline as a code change. You need to be able to roll back a prompt change in 60 seconds. Semantic drift in a prompt is a bug. Treat it like one. If your prompts live in a doc, a Notion page, or someone's memory, you don't have a prompt management system — you have technical debt.

Pillar 09 — Security

Trust Lives at the Edges.

Agentic systems have novel attack surfaces. Prompt injection, credential leakage, and scope creep are real. Security belongs at the boundary, not inside the agent.

—

Security

Security at the Boundary, Not Inside

Prompt injection is real. Agents that process external data can be manipulated through the data itself. Sanitize and validate inputs at every external boundary. Never pass raw external content directly into agent context without scrubbing.

Credential leakage kills trust. Secrets management is infrastructure-level. Credentials do not live in prompts, context, or code — they live in a secrets manager (AWS Secrets Manager or equivalent).

Least-privilege tooling. Agents should only have access to the tools they need for their defined scope. Over-provisioning access is how small failures become catastrophic ones.

Trust lives at the edges. Internal agent-to-agent calls don't need full security overhead. External input, external output — those are your trust boundaries. Gate them hard.

Pillar 08 — Model Selection

Model Selection is an Architecture Decision.

Don't route everything to the largest model. Classify by task complexity. Cost and latency are architectural constraints, not operational ones.

Model Tier	Task Type	Example Use Cases	Why Not Larger?
Opus	Deep reasoning, strategic synthesis	Complex multi-step planning, ambiguous inputs requiring judgment, high-stakes decisions	—
Sonnet	Primary execution, generation	Content generation, code execution, structured extraction, primary agent workhorse	Opus is overkill for deterministic tasks with clear specs and scoring
Haiku	Classification, routing, lightweight transforms	Intent classification, routing decisions, short format checks, high-volume preprocessing	Sonnet and Opus are overpriced and over-latent for binary/categorical outputs

Agent Identity Corollary: Named agents with defined roles outperform generic agents given broad instructions. Scope constrains hallucination. Tight scope means tighter evals. A prospecting agent shouldn't also be writing copy — separate the concerns, wire narrow agents together.

Complete Reference

All 16 Principles at a Glance

Filter by pillar to focus on what matters most for your current stage.