| 
    Formatron v0.4.11
     
     
    
   Formatron empowers everyone to control the output format of language models with minimal overhead. 
   | 
 
Functions | |
| bytes | _multiple_replace (typing.Dict[bytes, bytes] replacements, re.Pattern[bytes] regex, bytes text) | 
| typing.Dict[int, bytes] | get_original_characters (typing.Dict[str, int] vocab, typing.Optional[list[typing.Callable]] processors=None) | 
| Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.   | |
| typing.List[typing.Callable] | autodetect_processors (typing.Dict[str, int] vocab) | 
| Autodetect vocabulary processors.   | |
| update_vocab_0xHH (typing.Dict[bytes, bytes] token_to_char) | |
| Vocabulary processor for <0xHH> tokens (used in llama tokenizers)   | |
| update_vocab_sentencepiece (typing.Dict[bytes, bytes] token_to_char) | |
| Vocabulary processor for ▁ token (used in sentencepiece tokenizers)   | |
| update_vocab_dot_G (typing.Dict[bytes, bytes] token_to_char) | |
| Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors)   | |
| _huggingface_bytelevel_decoder () | |
| I hate legacy code.   | |
      
  | 
  protected | 
      
  | 
  protected | 
| typing.List[typing.Callable] formatron.integrations.utils.autodetect_processors | ( | typing.Dict[str, int] | vocab | ) | 
| typing.Dict[int, bytes] formatron.integrations.utils.get_original_characters | ( | typing.Dict[str, int] | vocab, | 
| typing.Optional[list[typing.Callable]] | processors = None ) | 
Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.
| vocab | The mangled vocabulary. | 
| processors | List of callables with signature (token_to_char: typing.Dict[bytes, bytes])->None. Callables can be used to "unmangle" encoded characters to original characters. If None, processors will be auto-detected. | 
| formatron.integrations.utils.update_vocab_0xHH | ( | typing.Dict[bytes, bytes] | token_to_char | ) | 
| formatron.integrations.utils.update_vocab_dot_G | ( | typing.Dict[bytes, bytes] | token_to_char | ) |