Formatron v0.4.9
Formatron empowers everyone to control the output format of language models with minimal overhead.
Loading...
Searching...
No Matches
formatron.integrations.utils Namespace Reference

Functions

bytes _multiple_replace (typing.Dict[bytes, bytes] replacements, re.Pattern[bytes] regex, bytes text)
 
typing.Dict[int, bytes] get_original_characters (typing.Dict[str, int] vocab, typing.Optional[list[typing.Callable]] processors=None)
 Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.
 
typing.List[typing.Callable] autodetect_processors (typing.Dict[str, int] vocab)
 Autodetect vocabulary processors.
 
 update_vocab_0xHH (typing.Dict[bytes, bytes] token_to_char)
 Vocabulary processor for <0xHH> tokens (used in llama tokenizers)
 
 update_vocab_sentencepiece (typing.Dict[bytes, bytes] token_to_char)
 Vocabulary processor for ▁ token (used in sentencepiece tokenizers)
 
 update_vocab_dot_G (typing.Dict[bytes, bytes] token_to_char)
 Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors)
 
 _huggingface_bytelevel_decoder ()
 I hate legacy code.
 

Function Documentation

◆ _huggingface_bytelevel_decoder()

formatron.integrations.utils._huggingface_bytelevel_decoder ( )
protected

I hate legacy code.

Definition at line 83 of file utils.py.

◆ _multiple_replace()

bytes formatron.integrations.utils._multiple_replace ( typing.Dict[bytes, bytes] replacements,
re.Pattern[bytes] regex,
bytes text )
protected

Definition at line 7 of file utils.py.

◆ autodetect_processors()

typing.List[typing.Callable] formatron.integrations.utils.autodetect_processors ( typing.Dict[str, int] vocab)

Autodetect vocabulary processors.

Definition at line 41 of file utils.py.

◆ get_original_characters()

typing.Dict[int, bytes] formatron.integrations.utils.get_original_characters ( typing.Dict[str, int] vocab,
typing.Optional[list[typing.Callable]] processors = None )

Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.

Parameters
vocabThe mangled vocabulary.
processorsList of callables with signature (token_to_char: typing.Dict[bytes, bytes])->None. Callables can be used to "unmangle" encoded characters to original characters. If None, processors will be auto-detected.

Definition at line 20 of file utils.py.

◆ update_vocab_0xHH()

formatron.integrations.utils.update_vocab_0xHH ( typing.Dict[bytes, bytes] token_to_char)

Vocabulary processor for <0xHH> tokens (used in llama tokenizers)

Definition at line 58 of file utils.py.

◆ update_vocab_dot_G()

formatron.integrations.utils.update_vocab_dot_G ( typing.Dict[bytes, bytes] token_to_char)

Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors)

Definition at line 73 of file utils.py.

◆ update_vocab_sentencepiece()

formatron.integrations.utils.update_vocab_sentencepiece ( typing.Dict[bytes, bytes] token_to_char)

Vocabulary processor for ▁ token (used in sentencepiece tokenizers)

Definition at line 66 of file utils.py.