// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Model Serving

Model serving is the act of running a deployed model to provide predictions or inferences to other applications.

TECHNICAL DEFINITION

Model serving refers to the infrastructure and processes that host a deployed machine learning model, exposing an API endpoint for real-time or batch inference requests from client applications.

BACKGROUND

A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate and analyze text in many contexts, and are a foundational technology behind modern chatbots. Biased or inaccurate training data can make an LLM's output less reliable.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Inference serving
  • prediction service
  • model endpoint

USAGE NOTE

Model serving platforms often handle scaling, load balancing, and latency optimization.

DEVELOPERS

Organizations developing technology related to Model Serving.

  • Amazon Web Services (AWS)

    AWS offers Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Its model serving capabilities include real-time inference, batch transform, and serverless inference endpoints.

  • Google Cloud

    Google Cloud's Vertex AI provides a unified platform for ML development, offering robust model serving options including online predictions (real-time inference via HTTP endpoints) and batch predictions, with features like auto-scaling, monitoring, and model versioning.

  • Microsoft Azure

    Azure Machine Learning provides a comprehensive MLOps platform, including powerful capabilities for deploying and serving machine learning models. It supports real-time and batch inferencing, managed endpoints, and integration with Azure Kubernetes Service (AKS) for scalable deployments.

  • Hugging Face

    Hugging Face is a leader in natural language processing and provides an inference API and dedicated infrastructure for deploying and serving a vast array of transformer models, allowing developers to easily integrate state-of-the-art models into their applications.

  • Databricks

    Databricks offers a unified platform for data and AI, with MLflow as a key component for managing the ML lifecycle. Its model serving features enable deployment of MLflow models to dedicated REST API endpoints for real-time inference.

  • NVIDIA

    NVIDIA develops the Triton Inference Server, an open-source inference serving software that streamlines AI inference by maximizing GPU utilization and providing a standardized way to deploy AI models from any framework (TensorFlow, PyTorch, ONNX Runtime, etc.) on any GPU or CPU.

  • Seldon

    Seldon is an MLOps company providing open-source and enterprise solutions for deploying, monitoring, and managing machine learning models at scale on Kubernetes. Their Seldon Core platform focuses on model serving, explainability, and drift detection.

  • KServe (formerly KFServing)

    KServe is an open-source project that provides a Kubernetes-native platform for serving machine learning models. It enables serverless inference, auto-scaling, canary rollouts, and multi-framework support for deploying models from TensorFlow, PyTorch, scikit-learn, XGBoost, and more.

RELATED TERMS IN MLOPS & DEPLOYMENT