// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Safety Training

The process of teaching an AI system to avoid generating harmful, biased, or unsafe content.

TECHNICAL DEFINITION

Safety training involves fine-tuning or pre-training AI models, especially LLMs, with curated datasets and reinforcement learning from human feedback (RLHF) to instill ethical guidelines, reduce bias, and prevent the generation of harmful, toxic, or misleading content.

BACKGROUND

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Safety fine-tuning
  • Ethical training
  • Guardrail training
  • Responsible AI training

USAGE NOTE

Safety training is an ongoing effort to make AI models more responsible and less prone to misuse.

DEVELOPERS

Organizations developing technology related to Safety Training.

  • Anthropic

    An AI safety and research company focused on building reliable, interpretable, and steerable AI systems. They developed the 'Constitutional AI' technique, a method for training AI models to adhere to a set of principles or a 'constitution,' reducing the need for extensive human feedback to police harmful outputs.

  • OpenAI

    A major AI research and deployment company that heavily utilizes safety training techniques like Reinforcement Learning from Human Feedback (RLHF) and extensive red teaming to align its models, such as GPT-4, with human values and reduce harmful or biased outputs.

  • Google DeepMind

    The consolidated AI division at Google, responsible for developing models like Gemini. They have extensive research teams dedicated to AI safety, alignment, and ethics, developing scalable oversight methods and safety filters that are integrated into their foundational models.

  • Scale AI

    A data platform that provides high-quality training data for AI applications. They offer services specifically for safety training, including data annotation for fine-tuning, human feedback for RLHF, and managed red teaming services to identify and mitigate model vulnerabilities before deployment.

  • Robust Intelligence

    An AI security company that provides a platform for testing and validating AI models against security, ethical, and operational risks. Their technology includes automated red teaming and continuous validation to ensure models are safe and robust after deployment.

  • Credo AI

    An AI governance platform that helps organizations operationalize responsible AI. Their software allows companies to measure, manage, and report on AI risks related to fairness, transparency, and safety, ensuring models comply with internal policies and external regulations.

  • Meta AI

    The artificial intelligence laboratory of Meta Platforms. In developing their open-source Llama models, they have invested in safety-specific tuning, including techniques to reduce toxicity and bias, and have published extensive research on responsible AI practices and safety evaluations.

  • UK AI Safety Institute

    A government-backed organization, the first of its kind, focused on advancing AI safety for the public interest. It is tasked with testing the safety of advanced AI models, developing new evaluation techniques, and conducting foundational safety research.

RELATED TERMS IN AI ETHICS & SAFETY