Formatron v0.4.9
Formatron empowers everyone to control the output format of language models with minimal overhead.
|
Functions | |
bytes | _multiple_replace (typing.Dict[bytes, bytes] replacements, re.Pattern[bytes] regex, bytes text) |
typing.Dict[int, bytes] | get_original_characters (typing.Dict[str, int] vocab, typing.Optional[list[typing.Callable]] processors=None) |
Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors. | |
typing.List[typing.Callable] | autodetect_processors (typing.Dict[str, int] vocab) |
Autodetect vocabulary processors. | |
update_vocab_0xHH (typing.Dict[bytes, bytes] token_to_char) | |
Vocabulary processor for <0xHH> tokens (used in llama tokenizers) | |
update_vocab_sentencepiece (typing.Dict[bytes, bytes] token_to_char) | |
Vocabulary processor for ▁ token (used in sentencepiece tokenizers) | |
update_vocab_dot_G (typing.Dict[bytes, bytes] token_to_char) | |
Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors) | |
_huggingface_bytelevel_decoder () | |
I hate legacy code. | |
|
protected |
|
protected |
typing.List[typing.Callable] formatron.integrations.utils.autodetect_processors | ( | typing.Dict[str, int] | vocab | ) |
typing.Dict[int, bytes] formatron.integrations.utils.get_original_characters | ( | typing.Dict[str, int] | vocab, |
typing.Optional[list[typing.Callable]] | processors = None ) |
Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.
vocab | The mangled vocabulary. |
processors | List of callables with signature (token_to_char: typing.Dict[bytes, bytes])->None. Callables can be used to "unmangle" encoded characters to original characters. If None, processors will be auto-detected. |
formatron.integrations.utils.update_vocab_0xHH | ( | typing.Dict[bytes, bytes] | token_to_char | ) |
formatron.integrations.utils.update_vocab_dot_G | ( | typing.Dict[bytes, bytes] | token_to_char | ) |