Why touch grass when you can soar above it?
The code is available here.
I recommend checking out my previous post, Designing a Constrained Decoding Engine for Structured Text Generation in Large Language Models, before reading this one if you haven’t already.
By inspecting our design, we observe several key aspects on the implementation:
Considering these factors, C++ and Rust are the optimal choices. They offer zero-cost abstractions, allow low-level operations, can export a C ABI, and are heavily optimized for performance. While I personally prefer Rust for its near zero-cost safety features and modern toolchain, modern C++ is equally capable under these criteria.
If the requirements for efficiency and ease of embedding were lifted, virtually any language with sufficient expressive power could be used.
engine.initialize()Implementing this method requires us to initialize the grammar, tokenizer’s vocabulary and configuration first. While initializing the vocabulary and configuration is straightforward, initializing the grammar involves constructing certain data structures from the EBNF text to facilitate a more efficient implementation later on. Specifically, we want to create an abstract syntax tree(AST) of the EBNF, validate the AST, and then simplify the AST into High-level intermediate representation(HIR).
Fortunately, in Rust we have ebnf library that handles most of work for us. We only need to make minor modifications1 to:
any! and except! special nonterminalsIt is possible for an EBNF snippet to be syntactically correct yet semantically incorrect. For example: A::=B;, where B is undefined. Hence, we will validate the AST against:
This step is trickier. To make efficient implementation easier, we would like to do the following:
Remove useless rules
Useless rules are rules that never contribute to text recognition. For example:
S ::= 'ab'S | 'ab'A | 'ab'B;
S ::= S;
A ::= 'c'd;
B ::= 'a'B;
C ::= 'dc';
Assuming S is the starting nonterminal, the rule C ::= 'dc'; is unused and will be removed as it contributes nothing to the derivations starting from S. Similarly, the rule S ::= S; is considered useless because it recursively refers to itself indefinitely without leading to a terminal state. On the other hand, the rule B ::= 'a'B; will be retained. Although this production never terminates, it allows for the expression of constraints on infinite sequences, thus enhancing the grammar’s expressiveness.3.
Flatten nested rules with nonterminals
Nested rules refer to rules containing groups, options and/or repetitions. We can view groups, options and/or repetitions are “anonymous nonterminals.” For example,
A ::= ("a"|"b")"c"*;
is equivalent to:
A ::= GroupA1 C;
GroupA1 ::= "a"|"b";
C ::= "c"*;
To unify representation, we will replace all “anonymous nonterminals” with explicitly named nonterminals.
Flatten operators with precedence
The AST of filter ::= 'a' , 'b' , 'c' | 'd'; looks like this:
- - lhs: filter
rhs:
Operator:
- Terminal: a
- Concatenation
- Operator:
- Terminal: b
- Concatenation
- Operator:
- Terminal: c
- Alternation
- Terminal: d
This representation makes perfect sense for an EBNF AST but is inefficient for manipulation due to its hierarchy’s misalignment with operator precedence. Additionally, deep tree traversal can lead to inefficiencies such as pointer chasing and cache misses, as well as implementation challenges like stack overflow due to excessive recursion4.
We prefer a tree-like representation that aligns with operator precedence and is relatively shallow, like this:
- - lhs: filter
rhs:
Operator:
Altercations:
- Concatenations:
- Terminal: d
- Concatenations:
- Terminal: a
- Terminal: b
- Terminal: c
Group rules with same left-hand side together
Well, most people already group rules with same left-hand side together. But maybe some translation layers(that convert Python class to EBNF schema) become lazy in certain cases. Anyway, we will turn A::='a'; A::='b; into A::='a'|'b'; to ensure consistency.
Merge consecutive terminals
Obviously A::='b''c''d'; can be simplified to A::='bcd';.
Deduplicate alternations
Obviously A::='a'|'a'; can be simplified to A::='a';. Ensuring alternations of the same nonterminal are unique are helpful in later optimizations.
Remove unit rules
Rules like A::=B; B::=C; can be simplified to A::=C;. EBNF-unique recursion semantics should be perserved in this process: expressions like A ::= B+; B ::= C?; become A ::= B*;.
Expand repetitions and options
Note that:
A::=B?; is equivalent to A::=""|B;.A::=B*; is equivalent to A::=""|B|AB;5.A::=B+; is equivalent to A::=B|AB;.
This simplification, along with flattening nested rules with nonterminals, remove EBNF-unique recursion semantics, allowing cleaner implementation.
Merge nonterminals with same productions
A ::= B|C;
B ::= 'a';
C ::= 'a';
can be simplified to:
A ::= B|B;
B ::= 'a';
Eliminate nullable rules
Consider this grammar:
A::=B? B? B? B? B?;
B ::= 'b';
And this string: 'bbbbb'.
If we do not remove nullable rules(nonterminals that accept empty string), then at the first 'b', we need to check:
B? B? B? B? B? accepts the character, since it is the direct derivation of A.B? B? B? B? accepts the character, since B? is nullable, allowing us to skip it.B? B? B? accepts the character, since B? B? is nullable, allowing us to skip it.B? B? accepts the character, since B? B? B? is nullable, allowing us to skip it.whether B? accepts the character, since B? B? B? B? is nullable, allowing us to skip it.
When advancing to the next 'b', we must update our position in all five variations and repeat the above checks for each variation. A naive implementation may lead to exponential time complexity. While a more advanced approach might deduplicate the rules to reduce runtime costs, the most efficient strategy is to rewrite the grammar to avoid nullable rules entirely, so we will not have any runtime costs.
The Grammar Rewriting Procedure
Identify All Nullable Symbols: Nullable symbols are those that can:
A ::= '';).A ::= B? B?;).Modify Productions Involving Nullable Symbols: For each nullable symbol within a production, create two additional productions: one with the symbol and one without it. If removing the symbol results in an empty production, do not add it6.
A ::= B?; is expanded to A ::= B;.The code implementing all the procedures described above is available in my forked ebnf repository.
engine.modify_possible_logits()Unlike the AST transformations, which only require basic programming skills and straightforward tree traversal and construction algorithms7, implementing this function needs advanced algorithms and more aggressive optimizations.
Let’s revisit our problem. After our engine is initialized with LNF, we need to
The first three steps are similar to the operations of a parser in streaming mode. Since our grammar is a proper superset of context free grammars, we should examine algorithms capable of parsing any context free language. This excludes LL(k) parsers and LR(k) parsers(such as yacc) which only handle a subset of context-free languages. The remaining parsers are Earley, GLR, GLL and parsing with derivatives. I prefer a modified Earley algorithm because:
except! special nonterminal semantics without compromising its original time complexity.If you are unfamiliar with how the original Earley algorithm operates, I highly recommend reading Vaillant Loup’s blog which remains one of the best resources on Earley parsing even after ten years.
Operate on byte level rather than token level
Traditional parsers operate on token sequences generated by a lexer. However, in our context, using a separate lexer is impractical as each LLM call generates only one logits and one sampled token. Requiring all terminals in KBNF to be composed from LLM tokens would make creating KBNF schema extremely tedious, rendering it infeasible. Encoding terminals into LLM tokens is infeasible since terminals could be tokenized in unexpected ways. Therefore, we will encode all LLM token strings and KBNF terminals in UTF-8, enabling our engine to operate at the byte level.
This also means we need a way to indicate how far we are in a given terminal, since a terminal may consist of multiple bytes. We can achieve this goal by adding one extra field stateID to all Earley items.
Discard completed Earley items
The Earley parsing algorithm typically retains all completed items to construct a parsing forest. However, since our goal is only to recognize strings, we can discard completed Earley items (i.e., those for which completions have been executed) in an Earley set. This optimization allows us to compress the set of to-be-completed Earley items into a set of tuples (<nonterminal>, <start_column>), e.g., compressing (A -> a., 1) into (A, 1).
Deploy Leo’s optimization
Leo’s optimization, a modification on the original Earley algorithm, enables parsing all LR(k) grammars in linear time. Despite its effectiveness, it is often not explained in plain terms. I will restate Leo’s optimization here. Consider this grammar with the string 'aaaaa':
A::='a'A|'a';
The corresponding Earley sets:
======0======
(A ->.a A, 0)
(A ->.a, 0)
======1======
(A -> a.A, 0)
(A -> a., 0)
(A ->.a A, 1)
(A ->.a, 1)
======2======
(A -> a.A, 1)
(A -> a., 1)
(A ->.a A, 2)
(A ->.a, 2)
(A -> a A., 0)
======3======
(A -> a.A, 2)
(A -> a., 2)
(A ->.a A, 3)
(A ->.a, 3)
(A -> a A., 1)
(A -> a A., 0)
======4======
(A -> a.A, 3)
(A -> a., 3)
(A ->.a A, 4)
(A ->.a, 4)
(A -> a A., 2)
(A -> a A., 1)
(A -> a A., 0)
======5======
(A -> a.A, 4)
(A -> a., 4)
(A ->.a A, 5)
(A ->.a, 5)
(A -> a A., 3)
(A -> a A., 2)
(A -> a A., 1)
(A -> a A., 0)
Note that the repeated processing of (A -> a A., x) leads to quadratic time complexity. Also, note that the completed Earley items’ chain
(A -> a., 4)=>(A -> a A., 3)=>(A -> a A., 2)=>(A -> a A., 1)=>(A -> a A., 0)
where the first completed item triggers the the second completed item, which triggers the third, and so on, is the only way (A -> a.A, 0) in set 0 becomes (A -> aA., 0) in set 4. Similarly, this chain is also the only way (A -> a.A, 0) in set 0 becomes (A -> aA., 0) in set 3, in set 2, and so on. This motivates us to cache (A -> aA., 0) and A in some ways, since we know the completion of A will always lead to (A -> aA., 0) eventually. This can be more formally stated with the pseudocode:
Leo optimization(Lazily evaluated)
def find_leo_item(earley_set, nonterminal):
filtered = list(filter(earley_set,
lambda item: item.current_symbol == nonterminal))
if len(filtered) == 1: # there exists exact one parent
temp = filtered[0].advance_dot() # create a new item
if temp.dot_index == len(item.rule):
# reaches the end of the rule after advancing
return temp # This is the leo item
return None
def find_topmost_leo_item(earley_set, nonterminal,last_leo_item):
if nonterminal in earley_set.leo_items:
return earley_set.leo_items[nonterminal]
# one nonterminal will only correspond to one chain in a set
# otherwise it is no longer a chain.
leo_item = find_leo_item(earley_set, nonterminal, last_leo_item)
if leo_item is not None:
leo_item = find_topmost_leo_item(
leo_item.initial_earley_set,
leo_item.nonterminal, leo_item)
earley_set.leo_items[leo_item.nonterminal] = leo_item
return last_leo_item
def leo_complete(earley_item, earley_set):
leo_item = find_topmost_leo_item(earley_item.initial_earley_set,
earley_item.nonterminal, None)
if leo_item is not None:
earley_set.leo_items[earley_item.nonterminal] = leo_item
earley_complete(leo_item, earley_set)
return True
return False
def complete(earley_item, earley_set):
if not leo_complete(earley_item, earley_set):
earley_complete(earley_item, earley_set)
Finding the topmost Leo item is not too complex. It is essentially the chain of Earley items where all the intermediate items are thrown away. The definition of Leo item constraint, however, is subtle. If we use this constraint instead:
def find_leo_item(earley_set, nonterminal):
filtered = list(filter(earley_set,
lambda item: item.advance_dot().dot_index == len(item.rule)))
# find all rules one step away from completition first
if len(filtered) == 1: # there exists exact one
if filtered[0].current_symbol == nonterminal:
# check the postdot symbol
return temp # This will NOT work!
return None
Then the code will NOT work! This post shows a counterexample.
Use contiguous buffers for Earley sets and HIR
A naive implementation of Earley sets might utilize Vec<Vec<T>>, a 2D nested vector. However, this setup isn’t optimal due to potential cache misses; the inner vectors’ buffers may not be allocated in one contiguous memory block.
A more efficient approach is to use a “jagged vector,” similar to a jagged array but with the capability to append data to the last row and create new rows as needed. A jagged vector utilizes a contiguous buffer as its element storage and a separate dense array of indices to locate rows within the buffer.
By employing multiple hierarchical index arrays, this structure also enables efficient access to flattened LNF grammar, where struct[nonterminalID][production_id][rule_ID] corresponds to a single dotted symbol in an Earley item.
I implemented the jagged array in my jaggedarray crate.
Define the representation of a regular expression Various representations exist for regular expressions, including naive backtracking, densely represented deterministic finite automata (DFA), sparsely represented DFA, and Thompson nondeterministic finite automata (NFA). I prefer using the lazily built dense DFA as the default option because it maintains linear time complexity and is relatively fast. Other options like fully compiled DFA are configurable.
Extend the earley algorithm to handle regular expressions
We will integrate regular expressions into the Earley algorithm with a stateID in an Earley item to represent the current state within a DFA. When we process the regular expression during scanning stage of the Earley algorithm:
stateID.stateID.Since empty rules are already handled during the generation of LNF, there is no need to check if the regular expression can accept an empty string here.
except!
To support except!, we will:
except!
The simplest and probably the most efficient way to represent the except! is to use a regular expression8 in unanchored search mode.except!
We will integrate regular expressions into the Earley algorithm with a stateID in an Earley item to represent the current state within a DFA. With bit manipulation, we can store the current maximum repetition in stateID as well.
When we process the regular expression in except! during scanning stage of the Earley algorithm:
stateID.(nonterminalID, dot_position, production_id, start_position, stateID). Some fields usually require a narrower range than others; for example, it’s unlikely for there to be more than 256 items in a single production or more than 256 productions for a single nonterminal. This allows us to use smaller integers, such as u16u8u8u16u16 or u8u8u8u8u32. With generics, the most optimal combination can be selected based on user configuration and specific grammars. This method also applies to Leo items, to-be-completed Earley items, and LNF nodes.Filter tokens by possible first bytes
Once an Earley set enters the scanning stage (where there are no items left to be completed or predicted), we know precisely which bytes can potentially be accepted. Efficient prefix-search data structures (like HashMap<u8, Box<[u8]>>) can be utilized to filter out a potentially large number of tokens without running the Earley recognizer.
Cache Possible Tokens
Theoretical Discussion
Unlike regular expressions, the computational model equivalent to context-free grammar is pushdown automata(PDA), which is not a finite state automaton. This implies that it is theoretically impossible to fully cache all states of a given context-free grammar. Moreover, even for LL(k) or LR(k) grammars, subsets of context-free grammar equivalent to deterministic pushdown automata, finite state automata may still not simulate them due to the infinite capacity of the stack in PDA.
So why bother caching possible tokens? There are two main reasons:
Simplicity of many grammars: A significant number of grammars are not exceedingly complex. For example, regular grammars can always be fully cached, which is one reason tools like Outlines and Guidance initially supported only regular expressions.
Finite output length: When the output length is finite, the number of states involved is also finite. Given that the same grammar is often reused to generate outputs of similar lengths, it is likely that the same states will be reached repeatedly, making caching effective.
Indeed, there are scenarios where neither of these conditions is met. Hence, enabling cache support in our system will be optional.
The Algorithm
The actual caching algorithm is complex. Naively caching all Earley sets for a given input sequence is very inefficient. We need a method to compact Earley sets. For Earley sets S(0), S(1), ..., S(m), we have the following observations:
S(i) for which n=max(start position in D) < i < m do not influence accepted bytes, where D consists of items in S(m) not created at m.S(m) do not influence accepted bytes.(nonterminal, start_position, ...) in S(m), if S(start_position) contains a Leo item corresponding to nonterminal, then the start_position can be changed to the Leo item’s start position without affecting accepted bytes, given that the Leo item transition is still maintained.These observations lead to our compacting algorithm:
S(m) and apply observation 3 wherever possible.S(m) to position n and update Leo items.n.This compacted Earley set can then serve as an index in the cache. What’s more, even if caching is disabled, this compacting algorithm might still be useful as it reduces memory usage on long inputs and possibly cache misses.
Additional heuristic caching strategies might involve using nonterminal IDs as cache indices and computing the union of possible tokens over all dotted nonterminals. I will leave the decision on whether to use these strategies to the user.
Cache mode
It is theoretically possible to cache all states of Earley recognizer up to m input bytes. However, its time complexity is equivalent to enumerate all possible ASTs with inputs up to m bytes. I suspect this will only work for simple grammars or short inputs. Lazy mode sounds like a more reasonable default option.
The other APIs are trivial to implement.
Let’s revisit our goals to assess whether we have achieved them.
Partially achieved. KBNF is straightforward to use, though it is not an out-of-the-box solution. To address this, I plan to develop a Python library called “Formatron” as a high-level wrapper for KBNF, aimed at achieving user-friendliness comparable to Outlines and Guidance.
Theoretically speaking, yes. We have successfully achieved linear time complexity for every LR(k) grammar and quadratic time complexity for every unambiguous grammar. For general context free grammar, things are more ambiguous(pun intended): while subcubic algorithms exist(although with a large constant), all other general-purpose parsing algorithms(like Earley, GLR, GLL…) are indeed cubic, like ours.
Practically, it’s always a work in progress. While significant efforts have been made to speed up the engine, there are still unexplored techniques such as SIMD, multithreading, GPU computing, and LR algorithms. I believe there is always room for improvement.
Partially achieved. We have provided several configurable options, but it’s possible that users may desire additional flexibility—such as support for dynamic grammars or parsing expression grammars. Gathering more user feedback will help us identify and implement the additional flexibility needed.
Implementing a practical engine for constrained decoding is much more exciting than designing it, It challenges us to apply traditional computer science knowledge to new demands in the era of large language models.
In the future, I plan to write a follow-up post on designing the Python library “Formatron.”
Well, sort of. You can definitely argue that we also need user-friendly errors for our EBNF variant to enhance usability, and I think you are right. The problem is that the library does not have these features, and implementing these features are time-consuming, and hence I will omit them. At least for now. ↩
It is technically possible to support more complex nonterminals that only contain terminals after substituting all nonterminals and grouping. However, this feature is harder to implement so I will leave it for now. ↩
Is this particularly useful? Maybe not. Does this affect implementation’s efficiency? Definitely not. So if EBNF by definition allows it, why don’t we allow it? ↩
That’s why I code all the steps before with a while loop and a stack on the heap. ↩
This is the left recursion version. You may want the right recursion version. I prefer left recursion because I will use Earley Algorithm later. ↩
Note that accepting an empty string at the beginning is useless in our context. ↩
Well, not that simple if you, like me, do not want to use recursion and hate reference counting and unnecessary cloning. ↩
After typing this down I realized I could support full regular expression as negative lookaheads easily. Whether that is a good design choice and how much interference it has on other optimizations are unclear though. ↩