// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Model Serving

Model serving is the act of running a deployed model to provide predictions or inferences to other applications.

TECHNICAL DEFINITION

Model serving refers to the infrastructure and processes that host a deployed machine learning model, exposing an API endpoint for real-time or batch inference requests from client applications.

BACKGROUND

A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate, and analyze text in many contexts, and are a foundational technology behind modern chatbots. Biased or inaccurate training data can make an LLM's output less reliable.

SYNONYMS & ALIASES

Inference serving
prediction service
model endpoint

USAGE NOTE

Model serving platforms often handle scaling, load balancing, and latency optimization.

DEVELOPERS

Organizations developing technology related to Model Serving.

Amazon Web Services (AWS)
AWS offers Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Its model serving capabilities include real-time inference, batch transform, and serverless inference endpoints.
Google Cloud
Google Cloud's Vertex AI provides a unified platform for ML development, offering robust model serving options including online predictions (real-time inference via HTTP endpoints) and batch predictions, with features like auto-scaling, monitoring, and model versioning.
Microsoft Azure
Azure Machine Learning provides a comprehensive MLOps platform, including powerful capabilities for deploying and serving machine learning models. It supports real-time and batch inferencing, managed endpoints, and integration with Azure Kubernetes Service (AKS) for scalable deployments.
Hugging Face
Hugging Face is a leader in natural language processing and provides an inference API and dedicated infrastructure for deploying and serving a vast array of transformer models, allowing developers to easily integrate state-of-the-art models into their applications.
Databricks
Databricks offers a unified platform for data and AI, with MLflow as a key component for managing the ML lifecycle. Its model serving features enable deployment of MLflow models to dedicated REST API endpoints for real-time inference.
NVIDIA
NVIDIA develops the Triton Inference Server, an open-source inference serving software that streamlines AI inference by maximizing GPU utilization and providing a standardized way to deploy AI models from any framework (TensorFlow, PyTorch, ONNX Runtime, etc.) on any GPU or CPU.
Seldon
Seldon is an MLOps company providing open-source and enterprise solutions for deploying, monitoring, and managing machine learning models at scale on Kubernetes. Their Seldon Core platform focuses on model serving, explainability, and drift detection.
KServe (formerly KFServing)
KServe is an open-source project that provides a Kubernetes-native platform for serving machine learning models. It enables serverless inference, auto-scaling, canary rollouts, and multi-framework support for deploying models from TensorFlow, PyTorch, scikit-learn, XGBoost, and more.

RELATED TERMS IN MLOPS & DEPLOYMENT

BACK TO AI ENGINEERING & PROMPT DESIGN LEXICON

TECHNICAL DEFINITION

BACKGROUND

SYNONYMS & ALIASES

USAGE NOTE

DEVELOPERS

Amazon Web Services (AWS)

Google Cloud

Microsoft Azure

Hugging Face

Databricks

NVIDIA

Seldon

KServe (formerly KFServing)

RELATED TERMS IN MLOPS & DEPLOYMENT