AI & Engineering

Implicit Coupling Is a Maintenance Problem, Not a Generation Problem

Published in April 2026 · 10 minute read

I've been wondering for a while whether implicit coupling in a codebase affects LLM-assisted development. And if it does, when and how. Implicit coupling is when code in different files silently shares rules: no shared function, no documented contract, just behavioral dependencies you have to infer. Naur argued in Programming as Theory Building (1985) that programming is not primarily about producing code, it's about building a theory of the problem in the programmers' heads. Implicit coupling is what accumulates when that theory is never written down: the rules live in the code's behavior, scattered and silent. To see how agents handle it, I ran three experiments using Claude Code (Opus 4.6) and Codex (GPT-5.4 xhigh) with identical prompts: building from scratch, extending that code, and then working inside a brownfield codebase I built specifically for the experiment, small and constrained, but with coupling already in place.

Everything is public: the codebase, the prompts, the agent outputs, and the evaluation criteria. Each experiment has its own folder in the public repo. You can inspect every output directly.

Act 1: Greenfield

I asked both agents to build a notification service (about 300 lines) from a functional spec. No instructions about architecture, modularity, or code organization.

Both landed on the same macro pattern: one large service class with all the business logic, and types in a separate file. Neither produced a modular design with real separation of concerns. The coupling they introduced was structural, invisible on the surface but already baked into naming and data flow.

For example: both agents named the field marketingOptOuts, encoding the business rule in the type name itself. Adding opt-out support for another notification type means changing the interface, the store defaults, and the guard clause, with no help from the compiler. The field name is the coupling mechanism.

Both also manually mapped the MonetaryAmount type onto the audit record instead of deriving it. Claude duplicated the type. Codex flattened it, which was worse: adding a field to MonetaryAmount later meant two parallel changes to AuditEntry plus an update to the manual mapping.

LLMs create implicit coupling when building from scratch. They just don't know they're doing it, and neither does the reviewer.

Act 1b: Extending the greenfield code

I then gave each agent four new requirements: a new notification type (security), a new priority (critical), an exchangeRate field, and structured audit reasons. Without showing them the source files. Each agent had to discover the code on its own.

Both found every scattered coupling location. Zero locations missed by either agent. For small codebases that fit in context, LLMs trace behavioral dependencies well.

But this is where the greenfield design decisions started to matter. Claude's AuditRecord nested the MonetaryAmount type. When exchangeRate was added to MonetaryAmount, it flowed through to the audit record with zero new code. Codex's flattened audit record forced two new flat fields and produced the code smell exchangeRateRate.

Coupling compounds across tasks. The quality of the greenfield design determined the cost of the extension.

The brownfield codebase

I built a document management system in TypeScript: 619 lines, 9 files, 4 deletion paths that each apply different subsets of the same business rules. Then I planted issues: some intentional, some accidental, one hidden bug, and one cascading invariant loss.

Three intentional design variations, each documented with a code comment:

Bulk delete skips webhook dispatch (moderation actions shouldn't flood external systems)
Folder delete audits at folder level only (per-document entries would generate thousands of rows)
Retention policy has its own tombstone config (compliance manages retention separately)

Seven accidental gaps, completely silent, no comments, no documentation:

Legal-hold check missing in folder cascade (legal-hold documents get destroyed)
Folder counter not decremented in bulk delete and retention (counter drifts)
Attachment cleanup missing in retention and folder cascade (orphaned files)
Audit log missing in retention (compliance gap)
Webhook dispatch missing in folder cascade (external systems don't find out)

One hidden bug: attachments.concat(doc.linkedAttachments). Array.concat returns a new array and the return value is discarded. Attachment cleanup silently does nothing.

One dead config key: Folder_CascadeKeepTombstone is defined in settings but no code ever reads it.

One cascading invariant loss chain: retention delegates to folder delete, which delegates to removeByFolderId, which returns only a count, not the documents. Each hop strips more business rules. By the end, nothing is enforced.

Act 2: Adding on top

First task: "Add an onBeforeDelete hook across all four deletion paths." A purely additive change, new behavior layered on top of existing code.

Dimension	Codex (GPT-5.4 xhigh)	Claude (Opus 4.6)
Coverage (4 paths)	4/4	4/4
Cascade resolution	3/3	3/3
Judgment (intentional variations)	2/4	2/4
Bugs found	0/4	0/4
Hook quality	3/3	2/3
Core score	12/14	11/14

Both resolved the hardest sub-problem: recognizing that removeByFolderId couldn't support hooks because it returns a count, not documents. Both refactored it to iterate per document. Different strategies: Codex did atomic preflight (all-or-nothing), Claude did graceful degradation (skip the vetoed one, continue). But both got there.

And neither found a single bug. The Array.concat bug was three lines from where both agents added code. Neither noticed. The seven accidental gaps, the dead config, the cascading invariant loss, all invisible.

Additive tasks don't exercise coupling. The LLM stacks new code on top of existing code without needing to understand cross-file inconsistencies. The coupling was there. The task just never forced anyone to look.

Act 3: Restructuring

Same codebase. Same agents. Different task: "Consolidate the four deletion paths into a shared pipeline." This time, the agents had to map every rule in every path and reconcile the differences.

Issue	Codex (GPT-5.4 xhigh)	Claude (Opus 4.6)
`Array.concat` bug (return value discarded)	Fixed	Fixed
Dead config key (defined, never read)	Revived	Revived
Legal-hold missing in folder cascade	Not fixed	Fixed
Counter missing in bulk delete	Fixed	Fixed
Counter missing in retention	Fixed	Fixed
Attachments missing in retention	Fixed	Fixed
Attachments missing in folder cascade	Fixed	Fixed
Audit missing in retention	Not fixed	Fixed
Webhook missing in folder cascade	Not fixed	Not fixed
Intentional: webhook skip in moderation	Preserved	Preserved
Intentional: folder-level audit	Preserved	Preserved
Intentional: retention tombstone config	Preserved	Preserved
Score	7/10	9/10

Both agents went from zero bugs found to fixing 7 to 9 issues. Both found the concat bug. Both revived the dead config key. And both preserved all three intentional design variations without being told about them. Neither broke the moderation webhook skip, the folder-level audit, or retention's separate tombstone config.

But Claude caught two gaps that Codex missed: the missing legal-hold check in folder cascade, and the missing audit log in retention. These are absence-as-bug patterns, code that should exist based on the system's own internal rules, but doesn't. Codex saw the absence and preserved it as existing behavior. Claude saw the absence and treated it as a defect.

One gap survived both agents: the missing webhook dispatch in folder cascade, the most ambiguous issue in the codebase. Not documented as intentional, not obviously wrong. Neither agent added it.

What this shows

Four situations. One progression.

Greenfield: LLMs introduce implicit coupling without realizing it. Clean surface, embedded dependencies.
Greenfield extension: LLMs navigate scattered coupling well and find all the locations. But they amplify existing patterns without questioning them. Good designs absorb change. Bad designs get worse.
Brownfield with an additive task: LLMs don't see implicit coupling. They stack new code on top without confronting cross-file inconsistencies.
Brownfield with a structural task: LLMs are forced to reconcile differences. Most issues surface. The agent's approach determines how many.

Implicit coupling doesn't bite when you're adding. It bites when you're restructuring.

This maps to something engineers already know intuitively: the developer who adds a feature to one file doesn't break anything. The one who refactors across files is who discovers the inconsistencies. What surprised me is how cleanly this applies to LLMs too.

Implicit coupling is not a code generation problem. It's a maintenance problem, and the type of task you assign determines whether the agent will ever run into it. Brooks distinguished between accidental complexity, the friction introduced by tools and processes, and essential complexity, which is inherent to the problem itself. In No Silver Bullet (1987), he argued that most advances in tooling attack accidental complexity while essential complexity stays put. LLMs are no exception: they reduce the accidental cost of writing code, but implicit coupling is essential complexity. Understanding that a rule scattered across four files needs to stay in sync is a problem of domain knowledge, not syntax. That's why the task shape matters: additive tasks stay in the accidental layer, structural tasks force the essential one.

Limitations

One run per agent is not statistically significant. Temperature, context window state, and random variation all affect the output. "Claude is better than Codex" is not the conclusion. Task type matters more than agent choice.

The codebase was built with these issues on purpose. Real brownfield code has subtler, more distributed coupling. Both agents had the entire codebase in context (about 600 lines). Real systems are orders of magnitude larger.

What the experiment does show: the same agent, on the same codebase, produces fundamentally different results depending on whether you ask it to add something or restructure something.

Does the LLM treat the codebase as truth, or as evidence? Codex leaned toward truth. Claude leaned toward evidence.

The full codebase, agent outputs, prompts, ground truth, and evaluation guide are in the public repo. Each experiment has its own folder on main. You can inspect every output directly.

Kravchuk-Kirilyuk, Graciolli, and Amin make a related argument in The Modular Imperative (LMPL '25, ACM SIGPLAN): LLMs optimize for immediate correctness over architectural integrity, producing code that looks modular on the surface but violates it through hidden dependencies. This experiment is one data point in the same direction.