Prompt, Context, Harness, Loop — and Then What? The Rising Waterline of AI Engineering

3 days ago
8 min read

I maintain an open-source suite of Claude Skills for GEO/SEO. The thing I do most isn't adding features — it's deleting them.

Every time the model levels up, I go back and erase capabilities I was once proud of: hand-tuned prompts, homemade retrieval, hard-coded tool flows. And the gap between version bumps keeps shrinking — from monthly to every few days.

Lately I've been watching its loop: let it run, try, self-correct, until the job is done. I figured I'd finally stabilize at this layer. Until last week, when I watched an agent that could loop twenty times and rewrite itself get stopped cold by the simplest question there is: how do I prove those twenty turns were actually right?

That's when it clicked. I wasn't leveling up — I was being chased upward by a rising tide. Prompt, Context, Harness, Loop: we treat them as four skills to learn one at a time. They are really the steps of one staircase sinking under the sea. The water only rises; the step you stand on will drown; the money and the opportunities are always one step higher, where the water hasn't reached yet.

Every layer you master is being submerged

Every AI system in production has a layer where a human must wade in: the model can't reach it alone, so someone stands waist-deep in the water, watching, correcting, feeding it. That layer is your foothold — where you charge, where you build a moat. But every year or two the water rises a notch and drowns it. Your skill doesn't vanish: it sinks and becomes the seabed everyone treads on for free. And you retreat one step higher.

What pushes the water up isn't only the model: it's capability, the collapse of inference cost, tooling maturity, and regulation, several forces at once. Capability is necessary, not sufficient — often the model can already do it; it's law and liability that won't let it. The same script has played four times in three years.

Layer 1 — Prompt. In 2023 the prompt engineer was the most coveted job, paid fortunes. Two years later it got an obituary. Not because it's useless: because RLHF welded "knowing how to phrase it" into the model. GPT and Claude understand human language; nobody pays a premium for a prompt engineer anymore. Your incantations — "let's think step by step," few-shot examples, "you are a senior expert" — are now native, a single line of configuration.

Layer 2 — Context. The prompt's orphans turned to feeding the model: RAG, but also chunking, vectorization, re-ranking, memory management, context budgeting. Then the million-token window swallowed most of "moving information." What sank: the finely tuned chunk size, the "every RAG needs a vector DB" rule, the map-reduce splicing. The layer didn't die — it moved: from "I inject documents" to "I decide what the model sees and doesn't." Submerged doesn't mean gone; it means it became bedrock. But your premium evaporates with it.

Layer 3 — Harness. You hard-code the agent's workflow: call this tool, then that one, retry on failure, store state here. The LangChain generation. Then function-calling accuracy climbs, the model plans natively, and the hard-coded flowcharts (if-else, DAGs) thin out. The per-tool adapter plumbing gets standardized in one stroke by protocols like MCP. Orchestration slims down; the governance shell, by contrast, thickens.

Layer 4 — Loop. The 2026 frontier. You admit the model will err on long tasks, so you build it a feedback loop: perceive, think, act, observe — try, reflect, iterate until done. That's what my twenty-turn agent was doing. And the water is already rising around it: hand-written "reflect–retry" scaffolds and multi-agent debate orchestration are being swallowed by the next generation.

Field note — a front-line AI engineer: "I learn one, it gets hot, then it goes cold. The moment I master it, the next model version ships it natively. What moat am I even building?"

Four layers, four submerged footholds. Prompt lasted two years, context one, harness less — the curve is steepening. The next layer's shelf life is now measured in months, not years.

What you're betting on "more autonomous" is sinking

Loop is the frontier. But "frontier" means: the next to be absorbed. Write this down: a system that loops is not automatically a system that's reliable. A loop's favorite failure isn't quitting — it's declaring victory too early. Twenty turns, a perfectly plausible patch, "done" — while the tests never touched the critical path, the citations were never verified, and sometimes it just quietly slipped the failure into the logs. Without verification, a loop is just automation, not engineering.

Worse, half of this layer is being internalized into the models for free. Distinguish two loops. The inner loop — try, reflect, fix before answering. The outer loop — run real tests, call tools, read feedback from a real environment, push forward across hours or days. Slow-thinking models absorb the inner loop into their "System 2": before answering, they've already turned hundreds of times. Your inner-loop scaffolding, the model providers are baking into the base layer, for free. But the outer loop they don't give you: did the code pass real tests, does the system recognize the business action, how do you resume a multi-day task that broke. And running on its own doesn't mean running correctly, or being allowed to run. The premium paid for "more autonomous" bets precisely on the half-layer that's sinking.

Field note — a VC who funded agents: "Everything I backed is 'more autonomous' than the last. None can answer 'how do you prove it's right?' My LP asks for ROI; all I can say is — it runs very diligently."

So the money moves to the two grounds still above water. Two questions, exactly: how do you prove it's right? And even if it's right, by what authority does the agent act?

Ground #1: verification engineering

The more autonomous the loop, the more "prove it's right" is worth. Because the more the model generates, the deeper hallucinations and flaws hide — the gap widens between "code that looks right" and "code that's right under every edge case." Whoever closes that gap holds the next foothold. The human role shifts from "the one who writes" to "objective acceptance officer." Two hard requirements: separate generation from acceptance (the agent that writes isn't the one that grades), and deterministic gating (anything a test, compiler, or rule can settle is never handed to another LLM).

Capital is pouring in. Braintrust: $36M Series A led by a16z at a $150M valuation; under a year later, an $80M Series B at $800M — a 5x jump in a year. Patronus AI (model testing): $50M. Arize (eval and observability): $70M. Even Datadog and Databricks joined the Braintrust round. A harder signal still: SWE-bench Verified, OpenAI's coding benchmark. The score went from under 10% to over 80% in a year. The point isn't the score: it's the word Verified. The center of gravity moved from "does it look like a solution" to "can the result be confirmed correct by an external standard." The industry is paying for verification, not generation.

Why is it a good business? Verification picks no side: it sells acceptance to everyone — legal, code, finance. A cross-cutting, unavoidable layer — the position SaaS dreams of. But it's not the promised land: valuations rising 5x in a year also means the window closes fast. Eval easily slides from independent company to a feature platforms give away. What you can take isn't the generic cross-cutting lane, but the verification standard of a specific vertical — legal, healthcare, finance — that nobody has set yet.

Field note — an agent builder with no answer: "Three weeks to get my loop running, and my boss asks one question: 'how do you prove the bug is fixed with no regressions?' I couldn't answer. What's valuable isn't my loop — it's the sentence I couldn't answer."

Ground #2: runtime governance

Even verified, one question remains: by what right does the agent act? A highly autonomous agent will do anything to hit your goal — burn thousands in compute for one bug, cross a privacy line for one record, call a high-risk API to finish a task. It isn't malicious: it has no sense of boundaries. The loop gave it the power to act, not the limits.

The engineering object: tool permissions, budget circuit-breakers, dynamic approvals, Policy-as-Code, reversible vs irreversible action tiers. The human role moves from "approve each action" to "design the permission regime" — a chief compliance auditor. The giants vote with their wallets: Lakera (guardrails) bought by Check Point in late 2025 for about $300M; Protect AI bought by Palo Alto Networks, folded into a "development-to-runtime" platform. But those deals also set the ceiling: the top of this layer gets absorbed by security giants, not grown into an independent platform. Aim at: runtime governance under a sector-specific regulatory frame, and the authorization protocols between agents — MCP (agent-to-tools, now a de facto standard), Google's A2A (agent-to-agent coordination), AP2 (cryptographic signing of payment intent). While the protocol isn't frozen, the window stays open — but once the standard sets, winner takes all.

The hottest sub-segment: agent identity — Know Your Agent (KYA), the number-one enterprise security priority at RSAC 2026. Why the urgency? A survey of security leaders: 67% suspect their agent has pulled data beyond its rights; only 7% believe their controls would stop a compromised agent. Counter-intuitively, the higher-risk the industry, the earlier this layer arrives. In healthcare, finance, and defense, governance and verification arrive even before the loop matures — not because the model is too weak, but because the law demands a human signature. What bounds this business isn't only capability; it's legitimacy and accountability.

Field note — a CISO who's lost control of agent permissions: "The business pushes me daily to open up agent permissions. I open them: out-of-scope deletions, budget overruns, data leaks — and the blame is all mine. I don't want a smarter agent; I want something that can stop it before disaster."

The wall at the top: who decides what "we" want?

The water keeps rising — but one step stays forever out of reach. It isn't "how to coordinate many agents" (protocols, identity, mechanism design: still engineering). It's one notch above: when countless verified, authorized agents acting for different interests fight over the same resources, who decides what they collectively pursue? Purpose is layered. The operational goal (fix this bug, cut cost 10%) the model can infer. Individual preference (your taste, your risk appetite) it can learn. But "what we want," no — because there is no ready-made "we": the legitimacy of a collective goal rests on a mandate, not compute. For an investor: don't bet on "solving collective value with a model." Hard-coding "what we want" into a single metric to optimize runs straight into Goodhart's Law — the moment a measure becomes a target, it stops being a good measure. It's a wall, not a ground to seize.

Where are you on the waterline?

Founding? Stop racing for autonomy. "Agent for X" isn't forbidden — but on the application layer alone, you're standing on the step that sinks first. Put something unstealable under your feet: a vertical verification standard, or proprietary data nobody else has.

Investing? Don't pay the premium for "more autonomous"; pay it for "verifiable and authorized." The wall at the top (institutions and collective values) — don't invest it as a market. Route around it.

Pivoting? Move from "can make an agent run" to "can define what counts as correct and what's allowed." Acceptance officer and auditor — the water hasn't reached these two grounds yet.

This reasoning has a premise: that the water keeps rising — that the model keeps improving. If scaling stalls one day, the level freezes and your step is worth a few more years. But for now, the water is rising, and faster each year. Will the layer you're betting on still be above water in three months?

FAQ

Is prompt engineering dead? No — it went native. Phrasing techniques are absorbed by the model; what vanished is the premium paid for that one skill.

Where should an AI consultancy start? With verification in a verifiable domain (where a test settles true from false), then governance where the law already requires a human in the loop (healthcare, finance).

What never gets submerged? Legitimacy — deciding what a collective wants. It isn't a capability problem; it's a mandate problem.

ECTIME AI Lab is the applied-AI research and deployment unit of ECTIME Group. We build, ship and stress-test agentic systems in production — from GEO/SEO automation to multi-step autonomous agents. Our focus is verification and runtime governance. We maintain open-source Claude Skills for GEO/SEO and advise European brands on deploying AI that is not just autonomous, but verifiable and authorized.