// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM
Content Filtering
Automatically detecting and blocking inappropriate or unwanted content generated by or fed into an AI system.
TECHNICAL DEFINITION
Content filtering utilizes machine learning models, often separate from the primary AI, to identify and block or flag specific types of undesirable content (e.g., hate speech, explicit material, spam) in inputs or outputs of AI systems, like LLMs or image generators.
BACKGROUND
Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.
READ MORE ON WIKIPEDIASYNONYMS & ALIASES
- Content moderation (automated)
- Output filtering
- Input sanitization
- Harmful content detection
USAGE NOTE
Content filtering is a common technique to ensure AI outputs adhere to platform policies and legal standards.
DEVELOPERS
Organizations developing technology related to Content Filtering.
Develops a Moderation API that checks content against its safety policies, classifying text into categories like hate, self-harm, and violence. It's a key tool for developers to filter prompts and model outputs.
Develops the Perspective API, which uses machine learning models to score the perceived impact a comment might have on a conversation. It helps developers and publishers detect abusive comments and toxic language.
Offers Azure AI Content Safety, a service that detects and filters harmful user-generated and AI-generated content in applications. It analyzes text and images for sexual content, violence, hate, and self-harm.
Pioneered the 'Constitutional AI' approach, a method for training AI systems to adhere to a set of principles or a 'constitution.' This technique is an advanced form of built-in content filtering to prevent harmful or unethical outputs.
Provides enterprise-grade AI models for content moderation. Their APIs can classify text, images, video, and audio for a wide range of harmful and unwanted content classes, enabling platforms to filter content at scale.
As a provider of large language models, Cohere has built-in safety and content moderation features to prevent the generation of harmful content. They provide tools for developers to ensure responsible AI use.
Offers solutions for AI safety and alignment, including data annotation and model evaluation to identify and filter out unsafe, biased, or toxic content. Their services help fine-tune models to adhere to specific content policies.
Provides a Trust & Safety platform that uses AI to detect and moderate harmful content in real-time. The technology is designed to filter a wide spectrum of violations, including hate speech, disinformation, and graphic content.