// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Test Set

A portion of the dataset held back from training to evaluate how well the trained model performs on new, unseen data.

TECHNICAL DEFINITION

A distinct subset of a dataset, separate from the training set, used to provide an unbiased evaluation of a machine learning model's generalization ability and performance on unseen data after training is complete.

BACKGROUND

Prompt engineering is the process of structuring natural language inputs to produce specified outputs from a generative artificial intelligence (GenAI) model. Context engineering is the related area of software engineering that focuses on the management of non-prompt and prompt contexts supplied to the GenAI model, such as system instructions, metadata, API tools and tokens.

SYNONYMS & ALIASES

Validation set
hold-out set
evaluation set

USAGE NOTE

It's crucial to ensure the test set accurately reflects real-world data to get a reliable performance estimate.

DEVELOPERS

Organizations developing technology related to Test Set.

Scale AI
Develops a data-centric platform for AI development that includes services for generating, annotating, and curating high-quality datasets used for training and testing large language models.
Hugging Face
An open-source platform that provides tools and resources for machine learning. Their 'Datasets' library is a standard for accessing and managing datasets, and they host leaderboards that evaluate models against standardized test sets.
LangSmith
Developed by LangChain, LangSmith is a platform for debugging, testing, evaluating, and monitoring LLM applications. It allows developers to create custom datasets (test sets) and run evaluators to score model outputs.
Weights & Biases
An MLOps platform that provides tools for tracking experiments, versioning data, and managing models. Their products help teams create, manage, and evaluate models against test sets, especially for LLM-based applications.
Arize AI
An ML observability platform that helps teams monitor and troubleshoot AI in production. The platform enables the evaluation of model performance against specific data slices or 'golden' test sets to detect issues like drift and performance degradation.
Kolena
A machine learning testing platform designed for creating and managing curated test suites. It enables teams to go beyond aggregate metrics by running fine-grained tests on specific scenarios to identify model failure points.
Galileo
Provides a data intelligence platform specifically for unstructured data, helping teams build high-quality NLP models. Their tools automatically find and fix data errors in training and test sets, ensuring more reliable evaluation.
Arthur
An AI performance company that offers a platform for monitoring, measuring, and improving machine learning models. It includes robust capabilities for LLM evaluation, allowing users to test models for accuracy, fairness, and toxicity using curated test sets.

RELATED TERMS IN DATA SCIENCE

BACK TO AI ENGINEERING & PROMPT DESIGN LEXICON

TECHNICAL DEFINITION

BACKGROUND

SYNONYMS & ALIASES

USAGE NOTE

DEVELOPERS

Scale AI

Hugging Face

LangSmith

Weights & Biases

Arize AI

Kolena

Galileo

Arthur

RELATED TERMS IN DATA SCIENCE