// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM
Datasheets
Detailed documentation for datasets used to train AI models, explaining their characteristics, collection, and potential biases.
TECHNICAL DEFINITION
Datasheets for Datasets are structured documents providing comprehensive metadata about a dataset used in machine learning, detailing its motivation, composition, collection process, preprocessing steps, known biases, and recommended uses to promote transparency and responsible AI.
BACKGROUND
The ethics of artificial intelligence covers a broad range of topics within AI that are considered to have particular ethical stakes. This includes algorithmic biases, fairness, accountability, transparency, privacy, and regulation, particularly where systems influence or automate human decision-making. It also covers various emerging or potential future challenges such as machine ethics, lethal autonomous weapon systems, arms race dynamics, AI safety and alignment, technological unemployment, AI-enabled misinformation, how to treat certain AI systems if they have a moral status, artificial superintelligence and existential risks.
READ MORE ON WIKIPEDIASYNONYMS & ALIASES
- Dataset documentation
- data manifest
- data inventory
USAGE NOTE
Creating datasheets helps ensure transparency and responsible use of training data for AI models.
DEVELOPERS
Organizations developing technology related to Datasheets.
A co-originator of the 'Datasheets for Datasets' concept through its research arm. Google has developed open-source tools like the Model Card Toolkit and the Know Your Data tool, which help practitioners document and understand AI datasets and models, aligning directly with the principles of datasheets.
Offers 'AI FactSheets' as part of its Watsonx.governance platform. This commercial product automates the collection of model development facts directly from data sources and model repositories, providing transparency and facilitating governance, which is a direct implementation of the datasheet concept for enterprise use.
Integrates 'Dataset Cards' across its platform, a practical and widely adopted implementation of the datasheet concept. These standardized markdown files accompany each dataset, providing structured information on its creation, content, intended use, and limitations, promoting responsible data sharing.
Develops and integrates responsible AI tools into its Azure Machine Learning platform. The Responsible AI dashboard includes components for data analysis that help generate the critical insights required for comprehensive datasheet documentation.
Provides the Unity Catalog, a unified governance solution for data and AI assets. It enables organizations to capture and manage metadata, lineage, and documentation for datasets, which are foundational components for creating and maintaining datasheets at an enterprise scale.
A research institute founded by Dr. Timnit Gebru, the lead author of the original 'Datasheets for Datasets' paper. DAIR continues to research and advocate for documentation standards and tools that promote transparency and accountability in AI development.
An MLOps platform whose 'Artifacts' feature allows for versioning of datasets and models. This system provides a framework for attaching rich, version-controlled documentation and metadata to datasets, enabling the practical implementation of datasheets within the development workflow.