Functions
bytes	_multiple_replace (typing.Dict[bytes, bytes] replacements, re.Pattern[bytes] regex, bytes text)

typing.Dict[int, bytes]	get_original_characters (typing.Dict[str, int] vocab, typing.Optional[list[typing.Callable]] processors=None)
	Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.

typing.List[typing.Callable]	autodetect_processors (typing.Dict[str, int] vocab)
	Autodetect vocabulary processors.

	update_vocab_0xHH (typing.Dict[bytes, bytes] token_to_char)
	Vocabulary processor for <0xHH> tokens (used in llama tokenizers)

	update_vocab_sentencepiece (typing.Dict[bytes, bytes] token_to_char)
	Vocabulary processor for ▁ token (used in sentencepiece tokenizers)

	update_vocab_dot_G (typing.Dict[bytes, bytes] token_to_char)
	Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors)

	_huggingface_bytelevel_decoder ()
	I hate legacy code.

Function Documentation

◆ _huggingface_bytelevel_decoder()

formatron.integrations.utils._huggingface_bytelevel_decoder ( )

protected

I hate legacy code.

Definition at line 83 of file utils.py.

bytes formatron.integrations.utils._multiple_replace	(	typing.Dict[bytes, bytes]	replacements,
		re.Pattern[bytes]	regex,
		bytes	text )

protected

Definition at line 7 of file utils.py.

typing.List[typing.Callable] formatron.integrations.utils.autodetect_processors ( typing.Dict[str, int] vocab )

Autodetect vocabulary processors.

Definition at line 41 of file utils.py.

typing.Dict[int, bytes] formatron.integrations.utils.get_original_characters	(	typing.Dict[str, int]	vocab,
		typing.Optional[list[typing.Callable]]	processors = None )

Get a vocabulary of original characters unmangled to raw UTF-8 bytes by the provided processors.

Parameters

vocab	The mangled vocabulary.
processors	List of callables with signature (token_to_char: typing.Dict[bytes, bytes])->None. Callables can be used to "unmangle" encoded characters to original characters. If None, processors will be auto-detected.

Definition at line 20 of file utils.py.

formatron.integrations.utils.update_vocab_0xHH ( typing.Dict[bytes, bytes] token_to_char )

Vocabulary processor for <0xHH> tokens (used in llama tokenizers)

Definition at line 58 of file utils.py.

formatron.integrations.utils.update_vocab_dot_G ( typing.Dict[bytes, bytes] token_to_char )

Vocabulary processor for GPT2 style token mangling, like from \n to Ġ(used in huggingface bytelevel preprocessors)

Definition at line 73 of file utils.py.

formatron.integrations.utils.update_vocab_sentencepiece ( typing.Dict[bytes, bytes] token_to_char )

Vocabulary processor for ▁ token (used in sentencepiece tokenizers)

Definition at line 66 of file utils.py.