// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Multi-Modal

AI models that can understand and generate content across different types of data, such as text, images, audio, and video, not just one type.

TECHNICAL DEFINITION

Multi-modal AI models are capable of processing, interpreting, and generating information across multiple distinct modalities, such as text, images, audio, and video, enabling a holistic understanding and interaction with diverse data types.

BACKGROUND

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

SYNONYMS & ALIASES

Cross-modal AI
Multi-sensory AI
Unified AI

USAGE NOTE

Multi-modal models are advancing applications like image captioning, video summarization, and AI assistants that can see and hear.

DEVELOPERS

Organizations developing technology related to Multi-Modal.

OpenAI
Developer of the GPT series of models, including GPT-4o, which has advanced multi-modal capabilities, processing text, audio, and image inputs and outputs. Also created Sora, a text-to-video model, and DALL-E 3 for text-to-image generation.
Google
Creator of the Gemini family of models (Pro, Ultra, Flash), which are natively multi-modal and designed to understand and reason across text, images, video, audio, and code. This technology powers products like Google's AI Overviews and Vertex AI.
Meta AI
Conducts research and develops multi-modal models like ImageBind, which learns a joint embedding across six modalities. Their Llama family of models also incorporates vision capabilities, and they are a key developer of models for AR/VR applications.
Anthropic
Developer of the Claude family of AI models. Their Claude 3 models possess strong vision capabilities, allowing them to analyze and interpret images, charts, graphs, and documents.
Runway
An applied AI research company focused on creative tools. They are known for their text-to-video and video-to-video models, such as Gen-2, which enable users to generate video content from text prompts or existing images.
Stability AI
Known for developing the open-source Stable Diffusion text-to-image model. The company also develops models for other modalities, including audio (Stable Audio) and video, contributing to the open multi-modal ecosystem.
Microsoft
Integrates multi-modal capabilities across its product suite, including Azure AI and Microsoft 365 Copilot. They conduct their own research, developing models like Kosmos-2, a Multimodal Large Language Model (MLLM) capable of grounding text to visual elements.
Midjourney
An independent research lab that produces a proprietary AI program for creating highly detailed and artistic images from textual descriptions. It is one of the most popular and advanced text-to-image generation services.

RELATED TERMS IN PROMPTING & LOGIC

BACK TO AI ENGINEERING & PROMPT DESIGN LEXICON

TECHNICAL DEFINITION

BACKGROUND

SYNONYMS & ALIASES

USAGE NOTE

DEVELOPERS

OpenAI

Google

Meta AI

Anthropic

Runway

Stability AI

Microsoft

Midjourney

RELATED TERMS IN PROMPTING & LOGIC