// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Jailbreaking

Tricking an AI, especially a chatbot, into bypassing its safety rules and generating forbidden content or actions.

TECHNICAL DEFINITION

Jailbreaking refers to the act of crafting specific prompts or input sequences to circumvent the safety mechanisms and ethical guardrails of a large language model (LLM), inducing it to generate harmful, unethical, or restricted content it was designed to refuse.

BACKGROUND

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Prompt bypass
  • Safety bypass
  • Guardrail circumvention
  • Model subversion

USAGE NOTE

Researchers constantly develop new methods to prevent jailbreaking attempts on LLMs.

DEVELOPERS

Organizations developing technology related to Jailbreaking.

  • OpenAI

    Develops large language models (LLMs) and actively researches and implements safety measures, red teaming, and alignment techniques to make their models, like ChatGPT, more resistant to adversarial prompt engineering, including jailbreaking attempts.

  • Anthropic

    A leading AI safety company known for developing Constitutional AI, a method designed to train AI models (like Claude) to be less susceptible to harmful instructions and adversarial prompts, thereby directly addressing jailbreaking vulnerabilities.

  • Google DeepMind

    Conducts extensive research in AI safety and responsible AI development for its advanced language models. They employ red teaming and adversarial testing to identify and mitigate vulnerabilities like jailbreaking and prompt injection attacks.

  • Microsoft

    Through its Responsible AI initiatives and AI platform development (Azure AI, Copilot), Microsoft invests significantly in developing robust defenses and safety mechanisms to protect its AI models from various adversarial attacks, including prompt-based manipulation and jailbreaking.

  • Robust Intelligence

    This company offers AI firewall and security solutions specifically designed to protect AI systems from adversarial attacks, data poisoning, and model manipulation, which includes mitigating the risks associated with prompt injection and jailbreaking.

  • MITRE Corporation

    Develops frameworks like the Adversarial ML Threat Matrix (ATT&CK for ML) which categorizes and describes adversarial tactics, including prompt injection and manipulation techniques used in AI jailbreaking, helping organizations understand and defend against them.

  • Meta AI

    Engages in research and development of large language models (e.g., Llama family) and actively explores techniques for improving their safety, robustness, and alignment, including methods to prevent and detect adversarial prompts and jailbreaking attempts.

RELATED TERMS IN AI ETHICS & SAFETY