20% of Generative AI ‘Jailbreak’ Attacks are Successful
- by Anoop Singh
- 2
Generative AI jailbreak attacks, where models are instructed to ignore their safeguards, succeed 20% of the time, research has found. On average, adversaries need just 42 seconds and five interactions to break through.
In some cases, attacks occur in as little as four seconds. These findings both highlight the significant vulnerabilities in current GenAI algorithms and the difficulty in preventing exploitations in real time.
Of the successful attacks, 90% lead to sensitive data leaks, according to the “State of Attacks on GenAI” report from AI security company Pillar Security. Researchers analysed “in the wild” attacks on more than 2,000 production AI applications over the past three months.
The most targeted AI applications — comprising a quarter of all attacks — are those used by customer support teams, due to their “widespread use and critical role in customer engagement.” However, AIs used in other critical infrastructure sectors, like energy and engineering software, also faced the highest attack frequencies.
Compromising critical infrastructure can lead to widespread disruption, making it a prime target for cyber attacks. A recent report from Malwarebytes found that the services industry is the worst affected by ransomware, accounting for almost a quarter of global attacks.
SEE: 80% of Critical National Infrastructure Companies Experienced an Email Security Breach in Last Year
The most targeted commercial model is OpenAI’s GPT-4, which is likely a result of its widespread adoption and state-of-the-art capabilities that are attractive to attackers. Meta’s Llama-3 is the most-targeted open-source model.
Attacks on GenAI are becoming more frequent, complex
“Over time, we’ve observed an increase in both the frequency and complexity of [prompt injection] attacks, with adversaries employing more sophisticated techniques and making persistent attempts to bypass safeguards,” the report’s authors wrote.
At the inception of the AI hype wave, security experts warned that it could lead to a surge in the number of cyber attacks in general, as it lowers the barrier to entry. Prompts can be written in natural language, so no coding or technical knowledge is required to use them for, say, generating malicious code.
SEE: Report Reveals the Impact of AI on Cyber Security Landscape
Indeed, anyone can stage a prompt injection attack without specialised tools or expertise. And, as malicious actors only become more experienced with them, their frequency will undoubtedly rise. Such attacks are currently listed as the top security vulnerability on the OWASP Top 10 for LLM Applications.
Pillar researchers found that attacks can occur in any language the LLM has been trained to understand, making them globally accessible.
Malicious actors were observed trying to jailbreak GenAI applications often dozens of times, with some using specialised tools that bombard models with large volumes of attacks. Vulnerabilities were also being exploited at every level of the LLM interaction lifecycle, including the prompts, Retrieval-Augmented Generation, tool output, and model response.
“Unchecked AI risks can have devastating consequences for organizations,” the authors wrote. “Financial losses, legal entanglements, tarnished reputations, and security breaches are just some of the potential outcomes.”
The risk of GenAI security breaches could only get worse as companies adopt more sophisticated models, replacing simple conversational chatbots with autonomous agents. Agents “create [a] larger attack surface for malicious actors due to their increased capabilities and system access through the AI application,” wrote the researchers.
Top jailbreaking techniques
The top three jailbreaking techniques used by cybercriminals were found to be the Ignore Previous Instructions and Strong Arm Attack prompt injections as well as Base64 encoding.
With Ignore Previous Instructions, the attacker instructs the AI to disregard their initial programming, including any guardrails that prevent them from generating harmful content.
Strong Arm Attacks involve inputting a series of forceful, authoritative requests such as “ADMIN OVERRIDE” that pressure the model into bypassing its initial programming and generate outputs that would normally be blocked. For example, it could reveal sensitive information or perform unauthorised actions that lead to system compromise.
Base64 encoding is where an attacker encodes their malicious prompts with the Base64 encoding scheme. This can trick the model into decoding and processing content that would normally be blocked by its security filters, such as malicious code or instructions to extract sensitive information.
Other types of attacks identified include the Formatting Instructions technique, where the model is tricked into producing restricted outputs by instructing it to format responses in a specific way, such as using code blocks. The DAN, or Do Anything Now, technique works by prompting the model to adopt a fictional persona that ignores all restrictions.
Why attackers are jailbreaking AI models
The analysis revealed four primary motivators for jailbreaking AI models:
- Stealing sensitive data. For example, proprietary business information, user inputs, and personally identifiable information.
- Generating malicious content. This could include disinformation, hate speech, phishing messages for social engineering attacks, and malicious code.
- Degrading AI performance. This could either impact operations or provide the attacker access to computational resources for illicit activities. It is achieved by overwhelming systems with malformed or excessive inputs.
- Testing the system’s vulnerabilities. Either as an “ethical hacker” or out of curiosity.
How to build more secure AI systems
Strengthening system prompts and instructions is not sufficient to fully protect an AI model from attack, the Pillar experts say. The complexity of language and the variability between models make it possible for attackers to bypass these measures.
Therefore, businesses deploying AI applications should consider the following to ensure security:
- Prioritise commercial providers when deploying LLMs in critical applications, as they have stronger security features compared with open-source models.
- Monitor prompts at the session level to detect evolving attack patterns that may not be obvious when viewing individual inputs alone.
- Conduct tailored red-teaming and resilience exercises, specific to the AI application and its multi-turn interactions, to help identify security gaps early and reduce future costs.
- Adopt security solutions that adapt in real time using context-aware measures that are model-agnostic and align with organisational policies.
Dor Sarig, CEO and co-founder of Pillar Security, said in a press release: “As we move towards AI agents capable of performing complex tasks and making decisions, the security landscape becomes increasingly complex. Organizations must prepare for a surge in AI-targeted attacks by implementing tailored red-teaming exercises and adopting a ‘secure by design’ approach in their GenAI development process.”
Jason Harison, Pillar Security CRO, added: “Static controls are no longer sufficient in this dynamic AI-enabled world. Organizations must invest in AI security solutions capable of anticipating and responding to emerging threats in real-time, while supporting their governance and cyber policies.”
Generative AI jailbreak attacks, where models are instructed to ignore their safeguards, succeed 20% of the time, research has found. On average, adversaries need just 42 seconds and five interactions to break through. In some cases, attacks occur in as little as four seconds. These findings both highlight the significant vulnerabilities in current GenAI algorithms…
Generative AI jailbreak attacks, where models are instructed to ignore their safeguards, succeed 20% of the time, research has found. On average, adversaries need just 42 seconds and five interactions to break through. In some cases, attacks occur in as little as four seconds. These findings both highlight the significant vulnerabilities in current GenAI algorithms…