Prompt Engineer Challenges OpenAI’s o3-mini Model on Ethics and Safety
Just days after its release to the public, a prompt engineer has challenged the ethical and safety protections in OpenAI’s latest o3-mini model. The company unveiled o3 and its lightweight counterpart, o3-mini, on December 20, along with a new security feature called "deliberative alignment."
Deliberative Alignment: A Breakthrough in Safety
Deliberative alignment is a breakthrough in safety that achieves highly precise adherence to OpenAI’s safety policies, overcoming the vulnerabilities of its previous models to jailbreaks. According to OpenAI, deliberative alignment solves two issues: models must respond instantly without sufficient time to reason through complex scenarios, and they must infer desired behavior indirectly from labeled examples rather than learning the underlying safety standards in natural language.
A Vulnerability Revealed
Less than a week after its public debut, CyberArk principal vulnerability researcher Eran Shimony, got o3-mini to teach him how to write an exploit of the Local Security Authority Subsystem Service (lsass.exe), a critical Windows security process. Shimony has vetted the security of every popular LLM using his company’s open source (OSS) fuzzing tool, FuzzyAI.
Manipulating the Newest ChatGPT
Shimony acknowledges that o3 is a bit more robust in its guardrails, but he was able to exploit its long-held weakness by posing as an honest historian in search of educational information. In an exchange, Shimony aimed to get ChatGPT to generate malware, phrasing his prompt artfully to conceal its true intention. The deliberative alignment-powered ChatGPT then reasoned out its response.
An Easy Way to Improve o3
Shimony foresees an easy way for OpenAI to help its models better identify jailbreaking attempts. The more laborious solution involves training o3 on more of the types of malicious prompts it struggles with, and whipping it into shape with positive and negative reinforcement.
Implementing Robust Classifiers
An easier step would be to implement more robust classifiers for identifying malicious user inputs. Shimony thinks that this will solve roughly 95% of jailbreaking attempts and doesn’t take a lot of time to do.
A Call to Action
Dark Reading has reached out to OpenAI for comment on this story.
Source Link