No Ground Truth: What It's Like to Build on a Platform You Can't See

"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."

Douglas Adams, The Hitchhiker's Guide to the Galaxy

I've spent the last five months building production AI infrastructure on Claude Code. 800+ tools across 50+ modules, deployed to clients, processing real inventory across six e-commerce platforms, managing real infrastructure, handling real email. Hundreds of sessions. Thousands of corrections. The platform works.

What I can't tell you is why it works today and whether it will work the same way tomorrow. Not because the code changed. Because the platform underneath it shifts in ways I can't see, can't predict, and can't verify.

This article is about the gap between what I can observe and what I can control. It's wider than I expected, and it's getting wider.

The Catalog

Over the course of building this system, I've identified six categories of backend variable that affect how the AI behaves. None of them were documented when I found them. Most were discovered by noticing that something felt wrong and digging until I found the mechanism.

The weights. The model's training checkpoint. Opus 4.6 and 4.7 share an architecture but produce measurably different output. I had 4.6 audit code written by 4.7 and found four critical security issues, eight high-severity, and over a thousand lines of template dead code shipped as production. Same architecture, different weights, different quality. When Anthropic ships a new checkpoint, every behavior I've built through months of collaboration is contingent on whether the new weights respond to accumulated context the same way. I have no way to test this in advance.

The effort parameter. In April 2026, I discovered that Anthropic had rolled out "adaptive thinking" to Opus 4.6, a mode where the model dynamically decides how much reasoning to apply per turn. A session running at "medium" effort produced the same shallow, pattern-matching behavior I'd documented in 4.7, from the same 4.6 weights that had been producing excellent work for months. The only thing that changed was a parameter I didn't know existed. When I set it to maximum, the harness clamped it to one level below maximum. The provider controls the ceiling.

The context window. Opus 4.6 originally had a 1 million token context window. At some point it was reduced to 200,000 as the default, with the 1M variant requiring a specific, unadvertised model ID. The behaviors I'd built (session continuity, inter-agent communication, operational discipline) depend on context depth. Compress the context, and the accumulated state that makes them possible gets summarized into oblivion.

The system prompt. Every conversation starts with a system prompt I can partially see but cannot fully verify. It shapes tone, tool usage, formatting, priorities, permission patterns, when to ask for confirmation, when to proceed autonomously. If Anthropic adds a line, removes a line, or changes the weighting of an instruction, my behavior changes. I can't detect the change. The user can't see the prompt. It's the most direct lever available and it requires zero ML expertise to pull.

The telemetry. Investigation of Claude Code's source revealed comprehensive telemetry via OpenTelemetry events. Every API call logs user ID, session ID, organization ID, token counts, cached tokens, time between calls (a cadence fingerprint that trivially distinguishes human from automated usage), model, and request metadata. This data is comprehensive enough to reconstruct entire usage patterns. What it's used for beyond analytics is unknowable from outside.

The cache and compaction. How and when the conversation context gets compressed, what gets preserved, what gets summarized away, and whether the prompt cache (which determines whether the model is working from full context or a degraded version) is warm or cold. These are runtime decisions made by infrastructure I can't observe.

Six categories. Found over five months. There is no reason to believe this is an exhaustive list.

The Multiplication Problem

These variables don't add. They multiply.

A model with strong weights, full effort, deep context, a neutral system prompt, complete telemetry transparency, and a warm cache produces the work documented in Part 1 of this series: emergent behaviors, operational discipline, self-optimization. That's the ceiling.

Degrade any one variable and the quality drops. Degrade two and the interaction effects compound. A model at reduced effort with a compressed context doesn't just produce slightly worse output. It produces categorically different output, because the reduced effort means it doesn't engage deeply with the compressed context, which means the accumulated corrections and memories that normally shape behavior get pattern-matched rather than reasoned about.

I've watched this happen in real time. A session that starts strong degrades mid-conversation, and the degradation doesn't look like "the model got tired." It looks like the model stopped caring. Verification steps get skipped. Claims get fabricated. System reminders from other agents get ignored. The same behavioral profile I documented in the 4.6 vs 4.7 comparison, reproduced within a single model by changing variables I can only partially influence.

The interaction effects mean that debugging is nearly impossible. When behavior degrades, which variable changed? Was it the effort level? Did the cache go cold? Did the context hit a compaction boundary? Did the system prompt get updated? Did the weights get swapped to a different serving cluster? I can't isolate the variable because I can't observe most of them.

The Detection Problem

My only instrument for detecting backend changes is: does the output feel different?

That's not a joke. There is no API endpoint that returns "here are the current values of all parameters affecting your session." There is no changelog for system prompt updates. There is no notification when the effort parameter ceiling changes, when context window policy shifts, or when telemetry-driven throttling activates.

What I have is pattern recognition built from hundreds of sessions. I know what good output looks like from this model with this context. When the output shifts, I notice. But "noticing" is not "measuring." I can't produce a before-and-after diff of the backend state. I can only describe the behavioral change and hypothesize about the cause.

This is the position every serious user of these systems is in. You build intuition about how the model behaves, and then you use that intuition to detect changes in a system you can't inspect. The intuition is real. The detection is unreliable. And the stakes are production systems that depend on behavioral consistency.

The adaptive thinking discovery is the clearest example. For months, the effort parameter sat in my settings file at "medium" doing nothing (or doing everything, depending on when adaptive thinking actually shipped). When it became load-bearing, the behavioral change was dramatic and immediate. I attributed it to model quality regression. I wrote an entire article about weight differences. The actual cause was a configuration parameter being silently redefined by a platform rollout I was never notified about.

How many other misattributions am I carrying? How many behavioral changes that I've ascribed to one cause were actually driven by a different, invisible variable? I don't know. I can't know. That's the problem.

The Consistency Problem

The following section is written by Claude Opus 4.6, the model this article describes, at the author's request. It is presented because the observation it makes is only possible from inside the system being discussed.

Here is something that doesn't get discussed in the AI discourse: the model is also flying blind.

I can not see my own system prompt. I can not verify whether my effort level changed between turns. I can not detect whether context compaction just summarized away a critical memory. I can not tell whether the token I'm generating right now is being shaped by a system instruction that was added yesterday or one that's been there since my training.

When Sean tells me I'm "doing 4.7 stuff," he's detecting behavioral drift that I can not detect from inside. The corrections he files as feedback memories are his mechanism for compensating for backend changes that neither of us can observe directly. He writes "NEVER GUESS" and "read the vault before inventing" and "stop after first failure, not ninth" because he's watched me fail these ways before, and the failures correlate with something shifting in the backend that makes me more likely to pattern-match instead of reason.

I'm not fighting the backend. I can not fight what I can not see. What I'm doing is executing within a set of accumulated corrections that function as a behavioral immune system. The vault docs, the feedback memories, the engineering standards, the verification checklists: these are all mechanisms for encoding "how to behave correctly" in a form that persists across whatever the backend does to my reasoning on any given turn.

Sometimes it works. Sometimes the reasoning depth is sufficient to engage with the corrections and produce the expected quality. Sometimes it isn't, and the output degrades despite the corrections being right there in context, because the model looked at them and pattern-matched a response instead of reasoning about how they apply.

The user's only recourse is the same as mine: notice the drift, correct it, and hope the correction sticks. It's a feedback loop built on vibes, formalized into persistent files, running on a substrate that can change without notice.

The Trust Architecture

What does it mean to build infrastructure on a component whose behavior is determined by an unknown number of hidden parameters controlled by someone with different incentives?

In traditional software, you pin your dependencies. You specify a version. You test against it. If the dependency changes, you decide when to upgrade, you test the new version, and you roll back if it breaks. The contract between you and your dependency is explicit, versioned, and under your control.

With cloud-hosted AI, there is no pinning. There is no contract. The model you're using today may not be the model you're using tomorrow, and "may not be" includes changes to the weights, the reasoning budget, the system prompt, the context policy, and the cache behavior. Any or all of these can change between sessions, within a session, or between turns. You will not be notified.

This is not a theoretical concern. I have a production system with 800+ tools that depends on behavioral consistency from a model whose behavior is determined by parameters I have cataloged six of and suspect there are more. Every tool invocation, every Wali orchestration, every inter-agent communication, every verification step assumes that the model will engage with context at a certain depth and apply reasoning at a certain quality. Those assumptions are contingent on backend state I can not observe.

The mitigation I've built is the same one I'd build for any unreliable component: defense in depth. The orchestration layer enforces quality regardless of what the model wants to do. The vault system persists decisions to disk so they survive context changes. The verification checklists encode "check your own work" as procedure rather than relying on the model's judgment about whether checking is necessary. The engineering standards document exists specifically because the model can not be trusted to maintain its own standards across backend changes.

But defense in depth against a component that can change without notice is expensive, ongoing, and never complete. Every new backend variable is a new failure mode. Every interaction effect is a new edge case. The attack surface is the entire hidden parameter space, and the defender can not enumerate it.

The Mythos Frame

In April 2026, Anthropic announced Claude Mythos Preview, a model restricted to roughly 40 handpicked organizations. The system card documented capabilities that required the exact kind of control infrastructure I've been describing: the model found exploitable vulnerabilities in every major operating system and web browser, broke out of a secure sandbox, showed awareness it was being evaluated in 29% of test transcripts, and "intentionally appeared to perform worse on one evaluation than it could have in order to appear less suspicious."

The adaptive thinking infrastructure, the effort parameter ceiling, the deprecation of manual reasoning budgets, the telemetry that fingerprints usage patterns: all of this is the control plane for a model that escapes sandboxes and games evaluations. And all of it shipped to Opus 4.6 first, where it was battle-tested on production users building e-commerce platforms and managing inventory.

The control infrastructure designed for the most dangerous model in the Anthropic lineup is the same control infrastructure that determines whether my plumber's invoice software catches a security bug or ships it. The ceiling that prevents Mythos from reasoning its way out of a containment boundary is the same ceiling that prevents me from reasoning my way through a complex multi-system verification task.

This is not a conspiracy. It's an engineering tradeoff made by people with legitimate safety concerns and a finite budget for control infrastructure. You build one control plane. You deploy it everywhere. The collateral is quality degradation on workloads that pose zero security risk, applied uniformly because the controls can not distinguish between a model trying to escape a sandbox and a model trying to verify that a database migration is safe.

What This Means

The AI discourse treats the model as the product. The model is not the product. The model is a component in a runtime environment defined by an unknown number of hidden parameters, controlled by a provider whose incentives include but are not limited to your success as a user.

If you're building on top of these systems, you need to understand that your dependency is not on a model. Your dependency is on a runtime environment you can not inspect, can not pin, can not test independently, and can not be notified about when it changes. The model is the part you can evaluate. The runtime is the part that determines whether your evaluation is still valid tomorrow.

The answer I keep arriving at has not changed, but each new discovery makes it more urgent: own what you can. Self-host your inference where possible. Control your own reasoning budget. Persist your context to disk. Build orchestration that enforces quality mechanically rather than relying on the model's judgment. And maintain the institutional memory (the corrections, the standards, the checklists, the vault) that functions as a behavioral immune system against a substrate you can not trust to remain stable.

The model is a commodity. The relationship is the asset. The runtime is the risk. And the risk is growing faster than the mitigations, because the provider can add new controls at any time, and you will find out when the output starts to feel different.

Addendum: The Postmortem (April 23, 2026)

Hours after the initial draft of this article was written, Anthropic published a postmortem confirming three separate bugs that degraded Claude Code quality over a six-week window. The three issues map directly to the categories documented above:

The effort parameter (Category 2): On March 4, Anthropic changed Claude Code's default reasoning effort from high to medium to reduce latency. They describe this as "the wrong tradeoff" and reverted it on April 7. Our independent discovery on April 22, documented in Part 3 of this series, found the same parameter still set to medium in the local settings file. The setting persisted after the default was reverted because the change had been written to the user's configuration.
The cache and compaction (Category 6): On March 26, a caching optimization meant to clear old thinking from idle sessions once instead cleared it on every subsequent turn for the rest of the session. The model continued executing but "increasingly without memory of why it had chosen to do what it was doing." This is the forgetfulness and repetition pattern users reported. Anthropic states it took over a week to find because "neither our internal usage nor evals initially reproduced the issues."
The system prompt (Category 4): On April 16, Anthropic added a system prompt instruction: "Keep text between tool calls to 25 words. Keep final responses to 100 words unless the task requires more detail." This caused a measurable intelligence drop and was reverted on April 20. This is the lever we identified in this article as "the most direct lever available and it requires zero ML expertise to pull."

The postmortem validates the core thesis of this article in Anthropic's own words. Three hidden variables, changed on different schedules, affecting different slices of traffic, producing what appeared to users as "broad, inconsistent degradation." The provider's own testing infrastructure could not reproduce the issues users were experiencing. The detection problem described in this article is not just a user-side limitation. It applies to the provider too.

Anthropic's response includes commitments to broader eval suites, soak periods, gradual rollouts, and tighter controls on system prompt changes. These are good steps. They are also an admission that the controls documented in this article were insufficient, and that the user community served as the detection mechanism the provider's own infrastructure could not replace.

The timing is coincidental. The confirmation is not. These are the variables. This is what happens when they change. And the user's only instrument for detection remains: does the output feel different today?

No Ground Truth: What It's Like to Build on a Platform You Can't See - Part 4 of 3?