Skip to main content

Researchers Develop Practical Method to Secure Large Language Models

Researchers at Anthropic, the company behind the Claude AI assistant, have developed an approach that they believe provides a practical, scalable method to make it harder for malicious actors to jailbreak or bypass the built-in safety mechanisms of a range of large language models (LLMs).

The Approach

The approach employs a set of natural language rules — or a "constitution" — to create categories of permitted and disallowed content in an AI model’s input and output, and then uses synthetic data to train the model to recognize and apply those content classifiers.

"Constitutional Classifiers" Anti-Jailbreak Technique

“Constitutional Classifiers” Anti-Jailbreak Technique

Researchers noted that their work focused on how to augment an LLM with classifiers that monitor an AI model’s inputs and outputs and blocks potentially harmful content. Instead of using hard-coded static filtering, they wanted something that would have a more sophisticated understanding of a model’s guardrails and act as a real-time filter when generating responses or receiving inputs.

Effectiveness of the Approach

"This simple approach is highly effective: in over 3,000 hours of human red teaming on a classifier-guarded system, we observed no successful universal jailbreaks in our target domain," the researchers wrote. The red-team tests involved the bug bounty hunters trying to obtain answers from Claude AI to a set of harmful questions involving CBRN risks, using thousands of known jailbreaking hacks.

Related Article

Related:

Code-Scanning Tool’s License at Heart of Security Breakup


Source Link