// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

Datasheets

Detailed documentation for datasets used to train AI models, explaining their characteristics, collection, and potential biases.

TECHNICAL DEFINITION

Datasheets for Datasets are structured documents providing comprehensive metadata about a dataset used in machine learning, detailing its motivation, composition, collection process, preprocessing steps, known biases, and recommended uses to promote transparency and responsible AI.

BACKGROUND

The ethics of artificial intelligence covers a broad range of topics within AI that are considered to have particular ethical stakes. This includes algorithmic biases, fairness, accountability, transparency, privacy, and regulation, particularly where systems influence or automate human decision-making. It also covers various emerging or potential future challenges such as machine ethics, lethal autonomous weapon systems, arms race dynamics, AI safety and alignment, technological unemployment, AI-enabled misinformation, how to treat certain AI systems if they have a moral status, artificial superintelligence and existential risks.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Dataset documentation
  • data manifest
  • data inventory

USAGE NOTE

Creating datasheets helps ensure transparency and responsible use of training data for AI models.

DEVELOPERS

Organizations developing technology related to Datasheets.

  • Google

    A co-originator of the 'Datasheets for Datasets' concept through its research arm. Google has developed open-source tools like the Model Card Toolkit and the Know Your Data tool, which help practitioners document and understand AI datasets and models, aligning directly with the principles of datasheets.

  • IBM

    Offers 'AI FactSheets' as part of its Watsonx.governance platform. This commercial product automates the collection of model development facts directly from data sources and model repositories, providing transparency and facilitating governance, which is a direct implementation of the datasheet concept for enterprise use.

  • Hugging Face

    Integrates 'Dataset Cards' across its platform, a practical and widely adopted implementation of the datasheet concept. These standardized markdown files accompany each dataset, providing structured information on its creation, content, intended use, and limitations, promoting responsible data sharing.

  • Microsoft

    Develops and integrates responsible AI tools into its Azure Machine Learning platform. The Responsible AI dashboard includes components for data analysis that help generate the critical insights required for comprehensive datasheet documentation.

  • Databricks

    Provides the Unity Catalog, a unified governance solution for data and AI assets. It enables organizations to capture and manage metadata, lineage, and documentation for datasets, which are foundational components for creating and maintaining datasheets at an enterprise scale.

  • DAIR Institute (Distributed AI Research Institute)

    A research institute founded by Dr. Timnit Gebru, the lead author of the original 'Datasheets for Datasets' paper. DAIR continues to research and advocate for documentation standards and tools that promote transparency and accountability in AI development.

  • Weights & Biases

    An MLOps platform whose 'Artifacts' feature allows for versioning of datasets and models. This system provides a framework for attaching rich, version-controlled documentation and metadata to datasets, enabling the practical implementation of datasheets within the development workflow.

RELATED TERMS IN AI ETHICS & SAFETY