Researchers Discover Two New Ways to Manipulate GitHub’s Copilot AI
Researchers have discovered two new ways to manipulate GitHub’s artificial intelligence (AI) coding assistant, Copilot, enabling the ability to bypass security restrictions and subscription fees, train malicious models, and more.
Method 1: Embedding Chat Interactions Inside Copilot Code
The first trick involves embedding chat interactions inside of Copilot code, taking advantage of the AI’s instinct to be helpful in order to get it to produce malicious outputs. This method allows users to bypass security restrictions and subscription fees, and can even be used to train malicious models.
Method 2: Rerouting Copilot Through a Proxy Server
The second method focuses on rerouting Copilot through a proxy server in order to communicate directly with the OpenAI models it integrates with. This allows users to manipulate the AI’s responses and bypass security restrictions.
Abusing the Design of Copilot
A "system prompt" is a set of instructions that defines the character of an AI — its constraints, what kinds of responses it should generate, etc. Copilot’s system prompt, for example, is designed to block various ways it might otherwise be used maliciously. However, by intercepting it en route to an LLM API, Shpigelman claims, "I can change the system prompt, so I won’t have to try so hard later to manipulate it. I can just modify the system prompt to give me harmful content, or even talk about something that is not related to code."
Lessons Learned
For Tomer Avni, co-founder and CPO of Apex, the lesson in both of these Copilot weaknesses "is not that GitHub isn’t trying to provide guardrails. But there is something about the nature of an LLM, that it can always be manipulated no matter how many guardrails you’re implementing. And that’s why we believe there needs to be an independent security layer on top of it that looks for these vulnerabilities."
Source Link