// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Content Filtering

Automatically detecting and blocking inappropriate or unwanted content generated by or fed into an AI system.

TECHNICAL DEFINITION

Content filtering utilizes machine learning models, often separate from the primary AI, to identify and block or flag specific types of undesirable content (e.g., hate speech, explicit material, spam) in inputs or outputs of AI systems, like LLMs or image generators.

BACKGROUND

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Content moderation (automated)
  • Output filtering
  • Input sanitization
  • Harmful content detection

USAGE NOTE

Content filtering is a common technique to ensure AI outputs adhere to platform policies and legal standards.

DEVELOPERS

Organizations developing technology related to Content Filtering.

  • OpenAI

    Develops a Moderation API that checks content against its safety policies, classifying text into categories like hate, self-harm, and violence. It's a key tool for developers to filter prompts and model outputs.

  • Google (Jigsaw)

    Develops the Perspective API, which uses machine learning models to score the perceived impact a comment might have on a conversation. It helps developers and publishers detect abusive comments and toxic language.

  • Microsoft

    Offers Azure AI Content Safety, a service that detects and filters harmful user-generated and AI-generated content in applications. It analyzes text and images for sexual content, violence, hate, and self-harm.

  • Anthropic

    Pioneered the 'Constitutional AI' approach, a method for training AI systems to adhere to a set of principles or a 'constitution.' This technique is an advanced form of built-in content filtering to prevent harmful or unethical outputs.

  • Hive AI

    Provides enterprise-grade AI models for content moderation. Their APIs can classify text, images, video, and audio for a wide range of harmful and unwanted content classes, enabling platforms to filter content at scale.

  • Cohere

    As a provider of large language models, Cohere has built-in safety and content moderation features to prevent the generation of harmful content. They provide tools for developers to ensure responsible AI use.

  • Scale AI

    Offers solutions for AI safety and alignment, including data annotation and model evaluation to identify and filter out unsafe, biased, or toxic content. Their services help fine-tune models to adhere to specific content policies.

  • ActiveFence

    Provides a Trust & Safety platform that uses AI to detect and moderate harmful content in real-time. The technology is designed to filter a wide spectrum of violations, including hate speech, disinformation, and graphic content.

RELATED TERMS IN AI ETHICS & SAFETY