Breaking Verifiable Abstractions

Posted on 2025-08-31 :: 8 min read :: Tags: software engineering, ai

Abstractions are one of the central ideas of computer science as well building and engineering complex systems in general. In essence, an abstraction is an interface over a system or process, somehow enhancing the mechanisms in which we interact with the underlying system. A famous example of abstraction is a programming language, which abstracts details of how the underlying machine executes the provided code. Loops and function calls are compiled into lower lever primitives of comparisons and jumps, the memory allocators takes care of storing and loading data from the memory, a majority of the programmers in 2025 has never seen a single line of the instruction set architecture of the machines their code works on, and the majority of code today does not run on physical machines, but virtual machines like browsers we built on top of them.

An important aspect of an abstraction is its leakiness, which, when I think about it, does not really do justice to the abstractions. Wikipedia page on Leaky Abstraction states:

A leaky abstraction in software development refers to a design flaw where an abstraction, intended to simplify and hide the underlying complexity of a system, fails to completely do so. This results in some of the implementation details becoming exposed or 'leaking' through the abstraction, forcing users to have knowledge of these underlying complexities to effectively use or troubleshoot the system.

Followed by the Law of Leaky Abstractions by Joel Spolsky:

All non-trivial abstractions, to some degree, are leaky.

Given that all abstractions are leaky, I feel that we should not really talk about what an abstration leaks, but rather what it hides, because that is the point we really care about. I think I came close to this idea when writing The Lies About Abstraction but failed to grok, or articulate it at the time. Abstractions really only ever preserve the properties we measure. Let's take SQL as an example, it is an abstraction for communicating with a storage system, we use it to instruct the system to store and load data. SQL engines make strong guarantees on the "functional correctness" of the engine, it should not matter if we are using 2.12 or 3.71, SQLite strongly tries to ensure that the same set of queries would result in the same set of answers. It does not make strong guarantees on the time of the query though. If the underlying model changes, it is possible that your query is slower or faster now. If you, adhering to Hyrum's Law, relied on observable but not guaranteed behavior, you might find yourself at odds with the users of your service.

Such as scenario is prevalent in virtually all domains of software engineering. Software systems are typically evolving systems with a set of primary objectives, and in the process of this evolution any behavior that is not explicitly tested or guaranteed is subject to changes. Consider the concept of semantic versioning, which states that a MAJOR version is when you make incompatible API changes, which assumes a specific definition of compatibility that boils down to the fact that the new version will not throw a compile error, at least from the perspective of the user. More concretely, the secondary condition over compilation is "documented behavior shall not change", which is a social condition that is far from machine checkability, hence the truly guaranteed behavior is always the tested ones.

With that, let me get a step closer to the point. A few months ago, I wrote my ever favorite blog post on the limits of AI-assisted programming which I promptly gave away the answer in the title, Verifiability is the Limit. I won't reiterate it all here, but the central idea is "what you can verify marks the limit of what you can create" regardless of the method of creation, albeit human, LLM, or magic, the genie of the magic oil lamp was a cautionary tale for all of us. Closer to the end of the blog post, I posed a question, and gave some preliminary thoughts on possible answers.

If verifiability is the limit, if it’s the bottleneck of using LLMs for programming, the natural question is how do we raise the limit, how do we make it easier to verify? ...We should rely much more on declarative random testing methods such as property-based testing, where the programmer defines a predicate that should hold over all possible inputs, and this predicate is tested by generating random inputs and passing them to the program. Such methods have been used to enhance the reliability of programs in many domains, but they haven’t caught up with the software engineering community. The advantage of such methods is that having one general predicate is much easier to inspect and understand than many unit tests, in addition to the added testing strength...

For the past month, I have been at DataDog working on an LLM-guided continuous optimizer powered by observability and random testing, and day by day I became more convinced that the future holds crazy possibilities for combining LLMs with verifiable abstractions, which I see surprisingly little. (in the following weeks, the team I worked with will release a blog post that I am also an author, announcing the project, which I will promptly link here) The industry as a whole has been spending an insane amount of time and effort making it easier to program by prompting, because prompting solves a multitude of problems where programmers traditionally have to specify many details they do not really care, such as what type of container to use for a specific scenario, and instead only convey their intent, which the LLM acts as the programmer making such decisions, letting the programmer only focus on high level details. Similar to how "higher-level languages" abstract away memory management, prompting abstracts away anything and everything you don't specify. This property of prompting, which is its central value proposition, is also its greatest weakness even without hallicunations, because there are no executable semantics of a prompt. Given the context, the temperature, the model, the random number generator, the mood of {{insert your favorite CEO}}, the prompt will lead to different execution semantics. Programs on the other hand, have testable semantics, they are executable, analyzable, are quite strict in what they do, and how. Shortly, the prompt-to-program abstraction is not a verifiable one, which is what stirred up the discussion in the first place.

Well I already gave away the spoiler, why are we fixating so much on prompts when we could just use programs instead?

People already use LLMs for translating code between languages, but they still rely on the LLM to correctly do the translation even though the translation has one of the most popular correctness properties of all the time, equivalence that we can programmatically test for. This problem is traditionally called translation validation, and is far from being solved, we definitely do not have the capability of proving two programs will behave exactly the same way for every single input (it's undecidable, so we will never have that capability). We, however, have the capability of doing differential testing, which has given rise to some of the greatest testing successes of the industry by finding bugs in C compilers and SQL engines, as well as being a central piece of modern practices for developing high assurance software.

I posit that some of the most interesting use cases of AI will emerge from using programs as sources instead of focusing on prompts.

Translating entire libraries and programs across ecosystems through a continuous feedback loop that is based on counterexample guided program synthesis.
Optimizing programs for cache friendliness or memory consumption, securing them against side channel attacks, automatically switching between equivalence programs with different characteristics of readability or density, auto-parallelization of single-threaded programs.
Breaking existing abstractions at lower levels by switching them with verifiable translations, instead of accepting the bytecode or assembly emitted by the compiler as a given, using the instrumentation and observations from the program for doing machine-level verifiable optimizations similar to what a JIT compiler would do.

Let's make no mistake, there are already people out there claiming they can do these things, or that LLMs will be able to do these things, I do not believe such claims without the accompanying translation validation. Verifiability is the limit, even if LLM capabilities increase a 1000x from today, it is impossible to trust that a change in your machine executable would fit your intent, or that you could even specify such changes in many cases. "Machine go brrr" is not the specification we should be using, tests and measurements with concrete results are.

The bottleneck of these approaches today are not just LLMs being unable to apply such changes, but also the fact that random testing still has a long way to go. The translations I mentioned rely on fast feedback loops to be operational, while getting the fuzzers to generate complex inputs are still research projects. Catching concurrency bugs is still a relatively developing area, building testing targets for comparing functions or programs across languages is non-trivial, deterministic simulation testing is here to stay, but it's history is a very short one with a few promising companies pushing it through. Programming languages are as impure as they come, hurting the testability at all costs, as much as possible, although functional programming community as well as effect systems are pushing for explicit impurity, I am not sure how to test two unordered hash maps in two languages acting with the same characteristics.

Let me tie the two parts of the article as I finish. Although all abstractions are leaky, the measure of an abstraction is its testability. If we can test some behavior, we can guarantee it. Such guarantees allow us to look beneath the abstraction and break it while keeping the guaranteed behavior intact. I hypothesize that given sufficient improvements in random testing, this break the abstraction, optimize what you can observe has the potential to become a major appraoch in optimizing programs and being a centerpiece of programming.