// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Vision Transformer

A Vision Transformer (ViT) applies the powerful Transformer architecture, originally for text, to image recognition tasks by treating image patches like words in a sentence.

TECHNICAL DEFINITION

A Vision Transformer (ViT) is a deep learning model that adapts the Transformer architecture, traditionally used for natural language processing, to computer vision tasks by dividing images into fixed-size patches and processing them as a sequence of embeddings.

BACKGROUND

Prompt engineering is the process of structuring natural language inputs to produce specified outputs from a generative artificial intelligence (GenAI) model. Context engineering is the related area of software engineering that focuses on the management of non-prompt contexts supplied to the GenAI model, such as metadata, API tools, and tokens.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • ViT
  • Image Transformer

USAGE NOTE

Gaining popularity for image classification and other vision tasks, often outperforming CNNs.

DEVELOPERS

Organizations developing technology related to Vision Transformer.

  • Google (Google AI / DeepMind)

    Pioneered the Vision Transformer (ViT) architecture and continues to develop advanced vision transformer models and applications, including multimodal models that integrate vision and language.

  • Meta AI (formerly Facebook AI Research - FAIR)

    A leading research lab making significant contributions to computer vision and self-supervised learning with vision transformers, developing models like DINO, MAE, and various foundational architectures.

  • Microsoft Research

    Conducts extensive research on large-scale vision models, including the development of advanced transformer architectures such as the Swin Transformer, and their integration into various AI systems and services.

  • Hugging Face

    Provides the most widely used open-source library ('Transformers') for accessing, fine-tuning, and deploying pre-trained Vision Transformers and other transformer models, making them crucial for AI engineering and prompt design workflows.

  • NVIDIA

    Develops the high-performance GPUs, software platforms (e.g., CUDA, PyTorch, TensorRT), and research initiatives that are critical for efficiently training, optimizing, and deploying large-scale Vision Transformers.

  • OpenAI

    While primarily known for LLMs, OpenAI has developed influential multimodal models like CLIP and DALL-E, which heavily leverage vision transformer components for image understanding and generation, directly impacting prompt design.

  • Baidu Research

    A major AI research institute, particularly strong in China, that actively develops and applies Vision Transformers for a wide range of computer vision tasks and integrated AI solutions.

  • Salesforce Research

    Engages in research on multimodal AI, including the development and application of vision transformers for enterprise solutions, focusing on areas like visual question answering and image-text understanding.

RELATED TERMS IN MODEL ARCHITECTURE