// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Vision Transformer

A Vision Transformer (ViT) applies the powerful Transformer architecture, originally for text, to image recognition tasks by treating image patches like words in a sentence.

TECHNICAL DEFINITION

A Vision Transformer (ViT) is a deep learning model that adapts the Transformer architecture, traditionally used for natural language processing, to computer vision tasks by dividing images into fixed-size patches and processing them as a sequence of embeddings.

BACKGROUND

Prompt engineering is the process of structuring natural language inputs to produce specified outputs from a generative artificial intelligence (GenAI) model. Context engineering is the related area of software engineering that focuses on the management of non-prompt and prompt contexts supplied to the GenAI model, such as system instructions, metadata, API tools and tokens.

SYNONYMS & ALIASES

ViT
Image Transformer

USAGE NOTE

Gaining popularity for image classification and other vision tasks, often outperforming CNNs.

DEVELOPERS

Organizations developing technology related to Vision Transformer.

Google (Google AI / DeepMind)
Pioneered the Vision Transformer (ViT) architecture and continues to develop advanced vision transformer models and applications, including multimodal models that integrate vision and language.
Meta AI (formerly Facebook AI Research - FAIR)
A leading research lab making significant contributions to computer vision and self-supervised learning with vision transformers, developing models like DINO, MAE, and various foundational architectures.
Microsoft Research
Conducts extensive research on large-scale vision models, including the development of advanced transformer architectures such as the Swin Transformer, and their integration into various AI systems and services.
Hugging Face
Provides the most widely used open-source library ('Transformers') for accessing, fine-tuning, and deploying pre-trained Vision Transformers and other transformer models, making them crucial for AI engineering and prompt design workflows.
NVIDIA
Develops the high-performance GPUs, software platforms (e.g., CUDA, PyTorch, TensorRT), and research initiatives that are critical for efficiently training, optimizing, and deploying large-scale Vision Transformers.
OpenAI
While primarily known for LLMs, OpenAI has developed influential multimodal models like CLIP and DALL-E, which heavily leverage vision transformer components for image understanding and generation, directly impacting prompt design.
Baidu Research
A major AI research institute, particularly strong in China, that actively develops and applies Vision Transformers for a wide range of computer vision tasks and integrated AI solutions.
Salesforce Research
Engages in research on multimodal AI, including the development and application of vision transformers for enterprise solutions, focusing on areas like visual question answering and image-text understanding.

RELATED TERMS IN MODEL ARCHITECTURE

BACK TO AI ENGINEERING & PROMPT DESIGN LEXICON

TECHNICAL DEFINITION

BACKGROUND

SYNONYMS & ALIASES

USAGE NOTE

DEVELOPERS

Google (Google AI / DeepMind)

Meta AI (formerly Facebook AI Research - FAIR)

Microsoft Research

Hugging Face

NVIDIA

OpenAI

Baidu Research

Salesforce Research

RELATED TERMS IN MODEL ARCHITECTURE