From entailment to contextual refinement

29 Oct 2025

This post is about one particularly thorny obstacle on the journey to developing a mechanisation of staged logic for my PhD thesis.

First, some background on staged logic. It is a specification logic for higher-order programs with effects.

Here is the syntax of its formulae. It is an untyped lambda calculus with specifications and logical connectives. is some formula for describing program states, like a heap formula from separation logic.

Like other logics, it comes with an entailment relation, initially defined as follows, where is a model of the logic. The symmetric version, equivalence, is typically defined in terms of entailment.

Rendering error due to Error: MathJax retry -- an asynchronous action is required; try using one of the promise-based functions and await its resolution.

This definition is known under many names in different communities.

In other logics, such as separation logic, it is written and called entailment, as a lifting of metalogic implication
In TLA, where the two objects are temporal formulae representing state machines, it is written and called refinement
When the two objects are programs, it is called program approximation, with the symmetric version called behavioural equivalence. In the context of compiler correctness, it is also a simple form of forward (or backward, depending on who you ask) simulation
When the two objects are languages, it is called inclusion and written
When the two objects are specifications, the relation is written and called subsumption
When the two objects are types, is instead called subtyping

Entailment intuitively means that the models or behaviours of are somehow "contained in" those of . When the operands are some kind of program, can be seen as a specification of , as it "bounds" the behaviours the latter can have. When the operands are logical formulae, is called stronger or more precise, in that it describes fewer models, since every model satisfying it will satisfy .

We started with this and it worked great for interactive proofs in Rocq for a while.

Delimited continuations and the bubble-up semantics

Many indexed non-solutions

gentails, for "generalised entails". It also autocorrects poorly.

A similar problem in the lambda calculus

It turns out that this issue shows up in the lambda calculus.

Consider how to prove that these two expressions are equivalent.

According to the definition above, given that we know how evaluates, we have to show that evaluates to the same value.

Suppose we have a big-step semantics . A lambda expression is a value, and the rule for values just says:

In other words, no more evaluation takes place.

If we try this for the two programs above, we end up with distinct syntactic objects that are not equal, though we can so clearly see that they behave the same.

The way I originally described it to people (with the view that staged logic was a logic) was that we had "syntax in the model". Which led to a rabbit hole of looking into denotational approaches to interpret that syntax with.

However, staged logic is equally viewed as an abstract programming language, and though it seems obvious in hindsight, the problem we had with bubbles is exactly the one described here with the lambda calculus.

Contextual equivalence

The solution is a stronger notion of equivalence: contextual equivalence (with the asymmetric version being called contextual refinement).

≜ ≜

It seems customary to call it the "gold standard" for a notion of equivalence, so I will repeat that here.

Why does it solve the problem? Consider the statement again.

We know that these expressions are equivalent informally because no matter the argument we supply to each, we'll get the same result. We might even have justified this to ourselves by mentally evaluating both sides with some symbolic value and seeing that we get exactly at the end.

In other words, if we only had some way to continue evaluating these expressions, we could easily tell that they were equivalent. The problem is that we can't do this, because they are already values - the big-step semantics has nothing more to say about them.

If only we had a way to describe hypothetical, future inputs to those programs...

Well, that's what contextual refinement offers. It quantifies over all contexts, so for all that don't get stuck and produce derivations, we are able to continue running these programs. In other words, the context gives us a way to talk about how each lambda will eventually be used.

How would we actually prove this particular equivalence? Direct induction over contexts is known to be "notoriously difficult". The two tenable proof methods are bisimulations and logical relations. I am more familiar with the latter, as it is much more commonly used in mechanised settings.

The idea there is to strengthen the definition of equivalence, so it more precisely says when two expressions are equivalent. To cope with advanced language features, the definition will have to be inductive over the structure of types and/or the number of evaluation steps the program takes.

It should also come as no surprise that the Proper instances we agonised over for so long are just simple corollaries of contextual equivalence. Hence this relation can be used for rewriting.

With that, we seem to be finally able to formalise staged logic.

Conclusion

Discovering all this was a long and agonising journey, as the connections between these topics are typically discussed in very disparate settings. I still don't know of any source which connects them like this, so I just wanted to put all this in words, so the next person who needs this connection not suffer a similar fate.

I'm grateful to the authors of a paper which pointed this out particularly clearly: Program equivalence in an untyped, call-by-value functional language with uncurried functions. Moreover, it is for an untyped setting, which is a relatively interesting and unexplored one to apply logical relations in.