// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM
Jailbreaking
Tricking an AI, especially a chatbot, into bypassing its safety rules and generating forbidden content or actions.
TECHNICAL DEFINITION
Jailbreaking refers to the act of crafting specific prompts or input sequences to circumvent the safety mechanisms and ethical guardrails of a large language model (LLM), inducing it to generate harmful, unethical, or restricted content it was designed to refuse.
BACKGROUND
Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.
READ MORE ON WIKIPEDIASYNONYMS & ALIASES
- Prompt bypass
- Safety bypass
- Guardrail circumvention
- Model subversion
USAGE NOTE
Researchers constantly develop new methods to prevent jailbreaking attempts on LLMs.
DEVELOPERS
Organizations developing technology related to Jailbreaking.
Develops large language models (LLMs) and actively researches and implements safety measures, red teaming, and alignment techniques to make their models, like ChatGPT, more resistant to adversarial prompt engineering, including jailbreaking attempts.
A leading AI safety company known for developing Constitutional AI, a method designed to train AI models (like Claude) to be less susceptible to harmful instructions and adversarial prompts, thereby directly addressing jailbreaking vulnerabilities.
Conducts extensive research in AI safety and responsible AI development for its advanced language models. They employ red teaming and adversarial testing to identify and mitigate vulnerabilities like jailbreaking and prompt injection attacks.
Through its Responsible AI initiatives and AI platform development (Azure AI, Copilot), Microsoft invests significantly in developing robust defenses and safety mechanisms to protect its AI models from various adversarial attacks, including prompt-based manipulation and jailbreaking.
This company offers AI firewall and security solutions specifically designed to protect AI systems from adversarial attacks, data poisoning, and model manipulation, which includes mitigating the risks associated with prompt injection and jailbreaking.
Develops frameworks like the Adversarial ML Threat Matrix (ATT&CK for ML) which categorizes and describes adversarial tactics, including prompt injection and manipulation techniques used in AI jailbreaking, helping organizations understand and defend against them.
Engages in research and development of large language models (e.g., Llama family) and actively explores techniques for improving their safety, robustness, and alignment, including methods to prevent and detect adversarial prompts and jailbreaking attempts.