// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM
Model Serving
Model serving is the act of running a deployed model to provide predictions or inferences to other applications.
TECHNICAL DEFINITION
Model serving refers to the infrastructure and processes that host a deployed machine learning model, exposing an API endpoint for real-time or batch inference requests from client applications.
BACKGROUND
A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate and analyze text in many contexts, and are a foundational technology behind modern chatbots. Biased or inaccurate training data can make an LLM's output less reliable.
READ MORE ON WIKIPEDIASYNONYMS & ALIASES
- Inference serving
- prediction service
- model endpoint
USAGE NOTE
Model serving platforms often handle scaling, load balancing, and latency optimization.
DEVELOPERS
Organizations developing technology related to Model Serving.
AWS offers Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Its model serving capabilities include real-time inference, batch transform, and serverless inference endpoints.
Google Cloud's Vertex AI provides a unified platform for ML development, offering robust model serving options including online predictions (real-time inference via HTTP endpoints) and batch predictions, with features like auto-scaling, monitoring, and model versioning.
Azure Machine Learning provides a comprehensive MLOps platform, including powerful capabilities for deploying and serving machine learning models. It supports real-time and batch inferencing, managed endpoints, and integration with Azure Kubernetes Service (AKS) for scalable deployments.
Hugging Face is a leader in natural language processing and provides an inference API and dedicated infrastructure for deploying and serving a vast array of transformer models, allowing developers to easily integrate state-of-the-art models into their applications.
Databricks offers a unified platform for data and AI, with MLflow as a key component for managing the ML lifecycle. Its model serving features enable deployment of MLflow models to dedicated REST API endpoints for real-time inference.
NVIDIA develops the Triton Inference Server, an open-source inference serving software that streamlines AI inference by maximizing GPU utilization and providing a standardized way to deploy AI models from any framework (TensorFlow, PyTorch, ONNX Runtime, etc.) on any GPU or CPU.
Seldon is an MLOps company providing open-source and enterprise solutions for deploying, monitoring, and managing machine learning models at scale on Kubernetes. Their Seldon Core platform focuses on model serving, explainability, and drift detection.
KServe is an open-source project that provides a Kubernetes-native platform for serving machine learning models. It enables serverless inference, auto-scaling, canary rollouts, and multi-framework support for deploying models from TensorFlow, PyTorch, scikit-learn, XGBoost, and more.