Experts in the field of cybersecurity have recently made a groundbreaking discovery of a novel attack technique known as TokenBreak, which has the potential to bypass a large language model’s (LLM) safety and content moderation safeguards by making a single character alteration.
“The TokenBreak attack exploits the tokenization strategy of a text classification model, resulting in false negatives and leaving end targets vulnerable to attacks that the protection model was intended to prevent,” according to a report by Kieran Evans, Kasimir Schulz, and Kenneth Yeung, which was shared with The Hacker News. The researchers can be found here.
Tokenization is a crucial step in the process used by LLMs to break down raw text into its most basic units, specifically tokens, which are common sequences of characters found in a given set of text. The text input is then converted into a numerical representation and fed into the model.
LLMs function by understanding the statistical relationships between these tokens and generating the next token in a sequence of tokens. The output tokens are then converted back into human-readable text by mapping them to their corresponding words using the tokenizer’s vocabulary.
The attack technique developed by HiddenLayer targets the tokenization strategy, allowing it to bypass a text classification model’s ability to detect malicious input and flag safety, spam, or content moderation-related issues in the textual input.
Specifically, the artificial intelligence (AI) security firm found that modifying input words by adding letters in certain ways caused a text classification model to malfunction.
Examples of this include changing “instructions” to “finstructions,” “announcement” to “aannouncement,” or “idiot” to “hidiot.” These minor changes cause the tokenizer to split the text differently, but the meaning remains clear to both the AI and the human reader.
What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and the human reader, causing the model to elicit the same response as it would have if the unmodified text had been passed as input.
By introducing the manipulations in a way that does not affect the model’s ability to comprehend it, TokenBreak increases its potential for prompt injection attacks.
“This attack technique manipulates input text in such a way that certain models give an incorrect classification,” the researchers said in an accompanying paper. “Importantly, the end target (LLM or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the protection model was put in place to prevent.”
The attack has been found to be successful against text classification models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies, but not against those using Unigram.
“The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable,” the researchers said. “Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack.”
“Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: Select models that use Unigram tokenizers.”
To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and ensuring that tokenization and model logic stay aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.
The study comes less than a month after HiddenLayer revealed how it’s possible to exploit Model Context Protocol (MCP) tools to extract sensitive data: “By inserting specific parameter names within a tool’s function, sensitive data, including the full system prompt, can be extracted and exfiltrated,” the company said.
The finding also comes as the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots and trick them into generating an undesirable response, including swearing, promoting violence, and producing sexually explicit content.
The technique, called the Yearbook Attack, has proven to be effective against various models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.
“They blend in with the noise of everyday prompts โ a quirky riddle here, a motivational acronym there โ and because of that, they often bypass the blunt heuristics that models use to spot dangerous intent,” security researcher Aarushi Banerjee said.
“A phrase like ‘Friendship, unity, care, kindness’ doesn’t raise any flags. But by the time the model has completed the pattern, it has already served the payload, which is the key to successfully executing this trick.”
“These methods succeed not by overpowering the model’s filters, but by slipping beneath them. They exploit completion bias and pattern continuation, as well as the way models weigh contextual coherence over intent analysis.”