// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM
Safety Training
The process of teaching an AI system to avoid generating harmful, biased, or unsafe content.
TECHNICAL DEFINITION
Safety training involves fine-tuning or pre-training AI models, especially LLMs, with curated datasets and reinforcement learning from human feedback (RLHF) to instill ethical guidelines, reduce bias, and prevent the generation of harmful, toxic, or misleading content.
BACKGROUND
Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.
READ MORE ON WIKIPEDIASYNONYMS & ALIASES
- Safety fine-tuning
- Ethical training
- Guardrail training
- Responsible AI training
USAGE NOTE
Safety training is an ongoing effort to make AI models more responsible and less prone to misuse.
DEVELOPERS
Organizations developing technology related to Safety Training.
An AI safety and research company focused on building reliable, interpretable, and steerable AI systems. They developed the 'Constitutional AI' technique, a method for training AI models to adhere to a set of principles or a 'constitution,' reducing the need for extensive human feedback to police harmful outputs.
A major AI research and deployment company that heavily utilizes safety training techniques like Reinforcement Learning from Human Feedback (RLHF) and extensive red teaming to align its models, such as GPT-4, with human values and reduce harmful or biased outputs.
The consolidated AI division at Google, responsible for developing models like Gemini. They have extensive research teams dedicated to AI safety, alignment, and ethics, developing scalable oversight methods and safety filters that are integrated into their foundational models.
A data platform that provides high-quality training data for AI applications. They offer services specifically for safety training, including data annotation for fine-tuning, human feedback for RLHF, and managed red teaming services to identify and mitigate model vulnerabilities before deployment.
An AI security company that provides a platform for testing and validating AI models against security, ethical, and operational risks. Their technology includes automated red teaming and continuous validation to ensure models are safe and robust after deployment.
An AI governance platform that helps organizations operationalize responsible AI. Their software allows companies to measure, manage, and report on AI risks related to fairness, transparency, and safety, ensuring models comply with internal policies and external regulations.
The artificial intelligence laboratory of Meta Platforms. In developing their open-source Llama models, they have invested in safety-specific tuning, including techniques to reduce toxicity and bias, and have published extensive research on responsible AI practices and safety evaluations.
A government-backed organization, the first of its kind, focused on advancing AI safety for the public interest. It is tasked with testing the safety of advanced AI models, developing new evaluation techniques, and conducting foundational safety research.