// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Jailbreak

A technique used to bypass an AI model's safety filters or ethical guidelines, often to make it generate content it was designed to refuse.

TECHNICAL DEFINITION

A jailbreak is an adversarial prompting technique designed to circumvent the safety mechanisms and ethical alignment policies of large language models (LLMs), inducing them to generate prohibited or harmful content.

BACKGROUND

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Prompt injection (malicious)
  • Safety bypass
  • Filter evasion
  • Adversarial prompting

USAGE NOTE

AI developers constantly work to patch jailbreak vulnerabilities to maintain model safety and integrity.

DEVELOPERS

Organizations developing technology related to Jailbreak.

  • OpenAI

    A leading AI research and deployment company that develops large language models like GPT and is heavily invested in AI safety, alignment, and red-teaming efforts to understand and prevent 'jailbreaking' of their models.

  • Google DeepMind

    A global leader in AI research and development, Google DeepMind (and Google AI) focuses on building advanced AI systems and conducts extensive research into AI safety, responsible AI, and mitigating adversarial attacks, including techniques to make models robust against jailbreaks.

  • Anthropic

    An AI safety and research company known for developing 'Constitutional AI' as a method to align AI models with human values and make them more resistant to harmful outputs and jailbreaking techniques.

  • Microsoft

    Through Microsoft Research and Azure AI, the company integrates advanced AI into its products and conducts significant research into responsible AI, model security, and robustness against adversarial prompt engineering and jailbreak attempts.

  • Meta AI (Facebook AI Research - FAIR)

    Meta AI develops and open-sources large language models (e.g., Llama) and actively researches methods to improve their safety, robustness, and resistance to misuse, including various forms of adversarial prompting and jailbreaking.

  • AI Safety Institute (e.g., UK AI Safety Institute)

    Government-backed organizations dedicated to advanced AI safety research. Their mandate includes understanding, evaluating, and mitigating frontier AI risks, which directly involves studying and developing defenses against model jailbreaks and adversarial misuse.

  • Stanford University (e.g., Center for Research on Foundation Models - CRFM)

    Academic research groups like Stanford's CRFM and Institute for Human-Centered AI (HAI) conduct cutting-edge research into the capabilities, risks, and safety of foundation models, including deep dives into prompt engineering, adversarial attacks, and jailbreaking methodologies to develop mitigation strategies.

RELATED TERMS IN PROMPTING & LOGIC