Xintong Sun's blog

Why touch grass when you can soar above it?


Project maintained by Dan-wanna-M Hosted on GitHub Pages — Theme by mattgraham

Implementing a Constrained Decoding Engine for Structured Text Generation in Large Language Models

The code is available here.

I recommend checking out my previous post, Designing a Constrained Decoding Engine for Structured Text Generation in Large Language Models, before reading this one if you haven’t already.

Choosing a programming language

By inspecting our design, we observe several key aspects on the implementation:

  1. Complex Implementation: The design is complex and likely will lead to complex implementation.
  2. Performance Constraints: The implementation is primarily CPU-bound or memory-bound, rather than IO-bound.
  3. Efficiency: The implementation must be efficient.
  4. Integration Flexibility: The program must be easily integratable into other applications or languages.
  5. Stability of Demand: Changes in implementation due to shifting demands are extremely unlikely.

Considering these factors, C++ and Rust are the optimal choices. They offer zero-cost abstractions, allow low-level operations, can export a C ABI, and are heavily optimized for performance. While I personally prefer Rust for its near zero-cost safety features and modern toolchain, modern C++ is equally capable under these criteria.

If the requirements for efficiency and ease of embedding were lifted, virtually any language with sufficient expressive power could be used.

Implement core API

engine.initialize()

Implementing this method requires us to initialize the grammar, tokenizer’s vocabulary and configuration first. While initializing the vocabulary and configuration is straightforward, initializing the grammar involves constructing certain data structures from the EBNF text to facilitate a more efficient implementation later on. Specifically, we want to create an abstract syntax tree(AST) of the EBNF, validate the AST, and then simplify the AST into High-level intermediate representation(HIR).

Construct EBNF AST

Fortunately, in Rust we have ebnf library that handles most of work for us. We only need to make minor modifications1 to:

  1. Support any! and except! special nonterminals
  2. Support comments
  3. Check regular expressions’ syntactic correctness
  4. Intern strings for regular expressions, nonterminals and terminals.

Validate the AST

It is possible for an EBNF snippet to be syntactically correct yet semantically incorrect. For example: A::=B;, where B is undefined. Hence, we will validate the AST against:

  1. Undefined nonterminals
  2. Invalid excepted nonterminals
    • All excepted nonterminals must directly contain nonempty terminals combined with concatneations and alternations2.

Obtain HIR

This step is trickier. To make efficient implementation easier, we would like to do the following:

The code implementing all the procedures described above is available in my forked ebnf repository.

engine.modify_possible_logits()

Unlike the AST transformations, which only require basic programming skills and straightforward tree traversal and construction algorithms7, implementing this function needs advanced algorithms and more aggressive optimizations.

Choosing an algorithm

Let’s revisit our problem. After our engine is initialized with LNF, we need to

  1. Determine if the current engine state can accept the input token.
  2. Update the state if possible.
  3. Identify all possible tokens that can be accepted by the updated state.
  4. Update the logits accordingly.

The first three steps are similar to the operations of a parser in streaming mode. Since our grammar is a proper superset of context free grammars, we should examine algorithms capable of parsing any context free language. This excludes LL(k) parsers and LR(k) parsers(such as yacc) which only handle a subset of context-free languages. The remaining parsers are Earley, GLR, GLL and parsing with derivatives. I prefer a modified Earley algorithm because:

Modifications to the original Earley algorithm

If you are unfamiliar with how the original Earley algorithm operates, I highly recommend reading Vaillant Loup’s blog which remains one of the best resources on Earley parsing even after ten years.

The other APIs are trivial to implement.

Do we meet our goals?

Let’s revisit our goals to assess whether we have achieved them.

Users should be able to easily use the engine to specify commonly used formats or patterns

Partially achieved. KBNF is straightforward to use, though it is not an out-of-the-box solution. To address this, I plan to develop a Python library called “Formatron” as a high-level wrapper for KBNF, aimed at achieving user-friendliness comparable to Outlines and Guidance.

The engine should implement formats in the most efficient way. In other words, if a user does not use extra expressiveness, then they should not incur unnecessary overhead

Theoretically speaking, yes. We have successfully achieved linear time complexity for every LR(k) grammar and quadratic time complexity for every unambiguous grammar. For general context free grammar, things are more ambiguous(pun intended): while subcubic algorithms exist(although with a large constant), all other general-purpose parsing algorithms(like Earley, GLR, GLL…) are indeed cubic, like ours.

Practically, it’s always a work in progress. While significant efforts have been made to speed up the engine, there are still unexplored techniques such as SIMD, multithreading, GPU computing, and LR algorithms. I believe there is always room for improvement.

Users should have the flexibility to select their preferred options when implementation choices involve compromises

Partially achieved. We have provided several configurable options, but it’s possible that users may desire additional flexibility—such as support for dynamic grammars or parsing expression grammars. Gathering more user feedback will help us identify and implement the additional flexibility needed.

Conclusion

Implementing a practical engine for constrained decoding is much more exciting than designing it, It challenges us to apply traditional computer science knowledge to new demands in the era of large language models.

In the future, I plan to write a follow-up post on designing the Python library “Formatron.”

  1. Well, sort of. You can definitely argue that we also need user-friendly errors for our EBNF variant to enhance usability, and I think you are right. The problem is that the library does not have these features, and implementing these features are time-consuming, and hence I will omit them. At least for now

  2. It is technically possible to support more complex nonterminals that only contain terminals after substituting all nonterminals and grouping. However, this feature is harder to implement so I will leave it for now

  3. Is this particularly useful? Maybe not. Does this affect implementation’s efficiency? Definitely not. So if EBNF by definition allows it, why don’t we allow it? 

  4. That’s why I code all the steps before with a while loop and a stack on the heap. 

  5. This is the left recursion version. You may want the right recursion version. I prefer left recursion because I will use Earley Algorithm later. 

  6. Note that accepting an empty string at the beginning is useless in our context. 

  7. Well, not that simple if you, like me, do not want to use recursion and hate reference counting and unnecessary cloning. 

  8. After typing this down I realized I could support full regular expression as negative lookaheads easily. Whether that is a good design choice and how much interference it has on other optimizations are unclear though.