Formatron v0.4.9
Formatron empowers everyone to control the output format of language models with minimal overhead.
Loading...
Searching...
No Matches
utils.py File Reference

Go to the source code of this file.

Namespaces

namespace  formatron
 
namespace  formatron.integrations
 This subpackage contains integrations with other frameworks and libraries.
 
namespace  formatron.integrations.utils
 

Functions

bytes formatron.integrations.utils._multiple_replace (typing.Dict[bytes, bytes] replacements, re.Pattern[bytes] regex, bytes text)
 
typing.Dict[int, bytes] formatron.integrations.utils.get_original_characters (typing.Dict[str, int] vocab, typing.Optional[list[typing.Callable]] processors=None)
 Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.
 
typing.List[typing.Callable] formatron.integrations.utils.autodetect_processors (typing.Dict[str, int] vocab)
 Autodetect vocabulary processors.
 
 formatron.integrations.utils.update_vocab_0xHH (typing.Dict[bytes, bytes] token_to_char)
 Vocabulary processor for <0xHH> tokens (used in llama tokenizers)
 
 formatron.integrations.utils.update_vocab_sentencepiece (typing.Dict[bytes, bytes] token_to_char)
 Vocabulary processor for ▁ token (used in sentencepiece tokenizers)
 
 formatron.integrations.utils.update_vocab_dot_G (typing.Dict[bytes, bytes] token_to_char)
 Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors)
 
 formatron.integrations.utils._huggingface_bytelevel_decoder ()
 I hate legacy code.