|
Formatron v0.4.9
Formatron empowers everyone to control the output format of language models with minimal overhead.
|
Functions | |
| bytes | _multiple_replace (typing.Dict[bytes, bytes] replacements, re.Pattern[bytes] regex, bytes text) |
| typing.Dict[int, bytes] | get_original_characters (typing.Dict[str, int] vocab, typing.Optional[list[typing.Callable]] processors=None) |
| Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors. | |
| typing.List[typing.Callable] | autodetect_processors (typing.Dict[str, int] vocab) |
| Autodetect vocabulary processors. | |
| update_vocab_0xHH (typing.Dict[bytes, bytes] token_to_char) | |
| Vocabulary processor for <0xHH> tokens (used in llama tokenizers) | |
| update_vocab_sentencepiece (typing.Dict[bytes, bytes] token_to_char) | |
| Vocabulary processor for ▁ token (used in sentencepiece tokenizers) | |
| update_vocab_dot_G (typing.Dict[bytes, bytes] token_to_char) | |
| Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors) | |
| _huggingface_bytelevel_decoder () | |
| I hate legacy code. | |
|
protected |
|
protected |
| typing.List[typing.Callable] formatron.integrations.utils.autodetect_processors | ( | typing.Dict[str, int] | vocab | ) |
| typing.Dict[int, bytes] formatron.integrations.utils.get_original_characters | ( | typing.Dict[str, int] | vocab, |
| typing.Optional[list[typing.Callable]] | processors = None ) |
Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.
| vocab | The mangled vocabulary. |
| processors | List of callables with signature (token_to_char: typing.Dict[bytes, bytes])->None. Callables can be used to "unmangle" encoded characters to original characters. If None, processors will be auto-detected. |
| formatron.integrations.utils.update_vocab_0xHH | ( | typing.Dict[bytes, bytes] | token_to_char | ) |
| formatron.integrations.utils.update_vocab_dot_G | ( | typing.Dict[bytes, bytes] | token_to_char | ) |