Why touch grass when you can soar above it?
Structured text generation means to create text in specific formats, such as JSON, markdown, HTML, or code. Typically, prompting a model to produce text in these formats works well most of the time. But what if you need it to work perfectly every time? That’s where constrained decoding comes into play. It prevents the generation of unwanted tokens by adjusting the model’s output probabilities. Combined with prompt engineering, constrained decoding ensures the output strictly follows the desired format. Moreover, if constrained decoding limits the possible outputs to a single token, there’s no need to call the model again, which can significantly speed up the generation process.
While it might seem straightforward, implementing constrained decoding correctly is difficult. Consider a simple scenario where we need the model to output in the format XXXXX*[YYYY]*. Here, XXXXX should not include *[ and YYYY should not include both *[ and ]*. It is tempting to quickly write:
# some logic
in_y = False
if last_token == "*[":
in_y = True
if in_y:
logits[find_token_id("*[")] = float("-inf") # tweak the probability
if last_token == "]*":
# logic to end generation
# some logic
Unfortunately, this code fails for several reasons:
*[ may not exist in the model’s vocabulary at all.*[ into two tokens(* and [) and bypass our constraint.a*[ or *[a or whatever token containing *[ and bypass our constraint.The correct version:
# some logic
for token in vocabulary:
if '*[' in token and token.index('*]') != len(token)-1:
logits[find_token_id(token)] = float("-inf")
in_y = False
if output_text.endswith("]*"):
# logic to end generation
if "*[" in output_text:
in_y = True
if in_y:
for token in vocabulary:
if '*[' in token:
logits[find_token_id(token)] = float("-inf")
if output_text.endswith("*") and '[' in token:
logits[find_token_id(token)] = float("-inf")
# some logic
Even handcoding a simple format requires handling many nuances. Now, imagine handcoding constrained decoding for full JSON. This example should demonstrate why handcoding constrained decoding in general is not a good idea.
Out-of-the-box solutions such as Guidance, Outline, Jsonformer, and others, can be very useful and might perfectly meet your needs. However, they might not be the best fit if you:
To enable users to specify a format, we need a method to describe that format. A long-established approach is to use a metasyntax notation, a format designed to describe other formats. To select a metasyntax notation, we need to ask ourselves: “How much expressiveness do we need?” Considering our goal to support all commonly used formats, we should examine the most complex yet practical formats—programming languages. They turn out to be mostly context-free. Then, a variant of Extended Backus-Naur Form(EBNF) is a good choice:
Besides all the standard features of EBNF(except exceptions1), we also want:
digits ::= #'[0-9]'; (*A nonterminal that accepts one digit*)
(*Yes, this weird thing is a comment in ebnf*)
Adding regular expression improves usability. It also enables potential optimizations since regular expression can be compiled into determistic finite automata(DFA).
Optional concatenation symbol
a = 'a',a; (*pedantic EBNF*)
a = 'a'a; (*This is fine*)
a = 'a' a; (*This is fine as well*)
The concatenation symbol , is traditionally used to separate elements but often merely adds syntactic noise. Allowing its omission simplifies the grammar visually.
Regex-like operators extension
digits ::= #'[0-9]';
digits ::= digits+; (*A nonterminal that accepts one or more digit*)
aaa ::= 'a'*; (*A nonterminal that accepts zero or more 'a'*)
aaa ::= 'a'?; (*A nonterminal that accepts zero or one 'a'*)
Many EBNF variants already use this extension, enhancing usability.
2024/05/21 update: Removed any! special nonterminal. After further investigation I found its semantics difficult to define clearly. Regex should be enough.
except!(<strings>,[n]) special nonterminal
start ::= except!('\n\n')'\n\n';(*Accept a text sequence
where the only '\n\n' is at the end*)
(*'114514\n\n' will be accepted*)
(*'114514' will not be accepted*)
(*'114\n\n514\n\n' will not be accepted*)
start ::= except!('\n\n', 10)'\n\n';(*Accept a text sequence
that at most consists of 10 bytes
where the only '\n\n' is at the end*)
(*'114514\n\n' will be accepted*)
(*'114514' will not be accepted*)
(*'114\n\n514\n\n' will not be accepted*)
end ::= '\n\n'; (*The nonterminal in except!,
after full expansion,
must only contain alternation of strings.*)
start ::= except!(end)end;(*Accept a text sequence
where the only '\n\n' is at the end*)
(*'114514\n\n' will be accepted*)
(*'114514' will not be accepted*)
(*'114\n\n514\n\n' will not be accepted*)
I know it looks ugly, but it’s necessary. Without this extension, we still can’t easily express the format XXXXX*[YYYY]* where XXXXX should not include *[ and YYYY should not include both *[ and ]*.
Can’t we just augment regular expression with negative lookahead?
Most regular expression engines either do not support lookahead at all or support the arbitrary version. Modifying these engines to extend or limit their capabilities would introduce significant complexities.
Furthermore, augmenting regular expressions in this way would require a way to embed nonterminals within the regular expressions, leading to added complexity. Such changes could also break compatibility with standard regular expression syntax.
Can’t we just use except!(<strings>) to represent one token and treat it like a normal nonterminal?
It won’t work. Consider this hypothetical syntax:
X ::= except!('*[')|except!('*[')X;
'*[' in the middle does not prevent the model from outputting '*' and [ tokens separately.'*[' in the generated text, then the semantics will be confusing. For instance, theoretically '*'X should allow *[ at the beginning (where the first '*' comes from the terminal and the second from X, which only bans *[ and not [); however, this proposed definition would reject *[ at the beginning.X would essentially recreate the original except! syntax but with a more convoluted representation and implementation.The fundamental issue is that to implement except! semantics, the nonterminal must be stateful to track the text it has accepted. However, nonterminal recursion in EBNF or context-free languages is not designed to be stateful.
Can’t we just use more natural syntax like except!(<strings>)*?
This extended semantics is not natural at all in the context of context-free languages. It is context-sensitive. As discussed above, we need an integrated method to handle its repetition, but except!(<strings>)* makes it appear as a simple combination of except!(<strings>) and *. In other words, a nonterminal repeated 0 or more times. As the discussion above indicates, this is not the correct interpretation. An integrated, though less elegant syntax, will clearly indicate that its semantics differ significantly from other parts of the grammar.
Update 2024/05/15: I decide to name my EBNF variant as KBNF(Koishi’s BNF). Long live Touhou Project!
Fortunately, considering the use cases, our theoretical API can be very simple:
engine.initialize(): Initializes the engine with grammar, vocabulary, configuration, etc.engine.modify_possible_logits(): A high-level method that takes a new token and a mutable logits slice to:
engine.advance(): A low-level method that takes a new token, updates internal states, and returns flags to indicate:
engine.current_possible_tokens(): Returns current possible tokens based on the engine’s state.update_logits_from_tokens(): Modifies logits based on the current possible tokens. These three low-level methods support scenarios where the engine’s states need updating during prefill stages.engine.reset(): Resets the engine’s internal states to their initial conditions.engine.clone(): Clones the engine’s internal states but does not clone vocabulary and grammar.
Designing a practical engine for constrained decoding is an exciting experience, revealing how many intricacies are involved in creating a library that, at first glance, appears straightforward. Most challenges stem from the our requirement that the engine must be practical for real-world applications. I have written the implementation post here.
The standard exception symbol, per ISO standard, is a weird thing and has too much limitations to be useful. ↩
Regular expression in this post refers to the regular expression as it is defined in wikipedia. Many programming languages have extended regular expression to include features such as arbitrary lookarounds and recursion, which essentially turns it into context-free or context-sensitive languages and is almost impossible to implement efficiently. ↩