AI Preferences T. Vaughan Internet-Draft Common Crawl Foundation Intended status: Informational 20 January 2025 Expires: 24 July 2025 Vocabulary for Expressing Content Preferences for AI Training draft-vaughan-aipref-vocab-00 Abstract This document proposes a vocabulary for expressing content preferences for rightsholders who wish to manage the use of their content in AI training. This vocabulary allows publishers to express preferences through metadata or content-delivery protocols. The vocabulary can be applied at different levels of granularity and incorporates preferences for permissions, usage scope, and data retention, providing a foundation for interoperability across various Internet protocols. About This Document This note is to be removed before publishing as an RFC. The latest revision of this draft can be found at https://thunderpoot.github.io/draft-vaughan-aipref-vocab/draft- vaughan-aipref-vocab.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-vaughan-aipref- vocab/. Discussion of this document takes place on the AI Preferences mailing list (mailto:ai-control@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/ai-control/. Subscribe at https://www.ietf.org/mailman/listinfo/ai-control/. Source for this draft and an issue tracker can be found at https://github.com/thunderpoot/draft-vaughan-aipref-vocab. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Vaughan Expires 24 July 2025 [Page 1] Internet-Draft AIPREF Vocab January 2025 Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 24 July 2025. Copyright Notice Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions and Definitions . . . . . . . . . . . . . . . . . 3 3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Vocabulary Elements / Preference Signals . . . . . . . . . . 3 4.1. Permission . . . . . . . . . . . . . . . . . . . . . . . 3 4.2. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.3. Temporal Restrictions . . . . . . . . . . . . . . . . . . 4 4.4. Content-Specific Granularity . . . . . . . . . . . . . . 4 4.5. Content Type . . . . . . . . . . . . . . . . . . . . . . 4 4.6. Derivative Content . . . . . . . . . . . . . . . . . . . 4 4.7. Data Retention . . . . . . . . . . . . . . . . . . . . . 5 4.8. Preference Persistence . . . . . . . . . . . . . . . . . 5 4.9. Precedence . . . . . . . . . . . . . . . . . . . . . . . 5 4.10. Geographic Restrictions . . . . . . . . . . . . . . . . . 5 5. Implementation Considerations . . . . . . . . . . . . . . . . 5 5.1. HTTP Headers . . . . . . . . . . . . . . . . . . . . . . 6 5.2. Robots Exclusion Protocol (REP) . . . . . . . . . . . . . 6 5.3. Tags for Sub-Document Level Control . . . . . . . 6 5.4. “Well-Known” Locations . . . . . . . . . . . . . . . . . 7 5.5. Embedded Metadata . . . . . . . . . . . . . . . . . . . . 9 5.6. Content Credentials (ISO 22144) . . . . . . . . . . . . . 10 5.7. ISCC (ISO 24138) . . . . . . . . . . . . . . . . . . . . 10 6. Example Usage Scenarios . . . . . . . . . . . . . . . . . . . 10 7. Security Considerations . . . . . . . . . . . . . . . . . . . 10 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 Vaughan Expires 24 July 2025 [Page 2] Internet-Draft AIPREF Vocab January 2025 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 9.1. Normative References . . . . . . . . . . . . . . . . . . 10 9.2. Informative References . . . . . . . . . . . . . . . . . 11 Appendix A. Table of Preference Signals . . . . . . . . . . . . 11 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 13 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 13 1. Introduction As AI models become more reliant on large-scale data (driven by scaling laws that link model performance to dataset size), content publishers seek ways to control how their content is used in training these models. This draft provides a vocabulary that enables publishers to signal preferences for AI training concerning their content. 2. Conventions and Definitions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 3. Scope The AI-PREF vocabulary is limited to expressing content preferences for AI training and does not include enforcement mechanisms or client authentication. Default opt-in or opt-out statuses are beyond the scope of this proposal, as it focuses solely on establishing a standard for signalling explicit preferences. In cases where no preferences are signalled, the decision on whether this constitutes an opt-in or opt-out should be determined at the policy level downstream. It is important to note that preference signals are advisory. 4. Vocabulary Elements / Preference Signals 4.1. Permission Basic indicators of whether content can be used for AI training. * *allow_training*: Boolean * *restricted_training:* public, non-commercial, internal, licensed Vaughan Expires 24 July 2025 [Page 3] Internet-Draft AIPREF Vocab January 2025 4.2. Purpose Defines acceptable uses in training. * *purpose*: String: - *generation*: Creating models that are capable of generating content - *embedding*: Converting content to vector representations o *classification*: Categorising or labelling content o *summary*: Creating condensed versions of content o *paraphrase*: Creating derivative versions of content o *quotation*: Repetition of a passage or fragment of original content o *translation*: Converting content between languages 4.3. Temporal Restrictions Specifies the date range for training use. * *effective_date*: ISO 8601 Date string * *expiration_date*: ISO 8601 Date string 4.4. Content-Specific Granularity Defines the scope of applicability. Refers to the level at which preferences apply within the content. * *scope*: global, content-specific, conditional 4.5. Content Type Specifies content types the preference applies to. * *mime_type*: text, image, video, audio, application, [RFC2046]. 4.6. Derivative Content Allows or restricts derivatives like summaries. * *allow_derivatives*: Boolean Vaughan Expires 24 July 2025 [Page 4] Internet-Draft AIPREF Vocab January 2025 * *derivative_type*: summary, paraphrase, translation 4.7. Data Retention Defines content retention period post-training. * *retention_period*: ISO 8601 Duration string (e.g., P3Y6M4DT12H30M5S) 4.8. Preference Persistence Indicates if preferences should persist in derived datasets, or be optional. A derived dataset is the result of processing, transforming, or extracting information from the original source, such as aggregated statistics and summaries, or subsets of data. * *metadata_persistence*: Boolean 4.9. Precedence Conflicts should be resolved by assigning precedence values (e.g., high, medium, low) to rules, with a defined hierarchy that allows content producers to override publishers, domain operators, and others as necessary. * *precedence:* Sets priority when preferences conflict with other layered preferences. 4.10. Geographic Restrictions Specifies regions where preferences apply, ISO 3166-1. * *geo_limitations:* Specifies geographic regions where training permissions apply. 5. Implementation Considerations Implementing the AI-PREF vocabulary effectively can be accomplished using various mechanisms, depending on the needs and existing infrastructure of content publishers. Approaches include, but are not limited to, using HTTP headers, possible extensions to [RFC9309] ([PURPOSE]), and (for example) tags and other embedded data (such as EXIF) for sub-document-level control. Vaughan Expires 24 July 2025 [Page 5] Internet-Draft AIPREF Vocab January 2025 5.1. HTTP Headers Publishers can use HTTP headers to communicate AI-PREF preferences directly in response to client requests. This approach allows fine- grained control and easy integration into existing server configurations. *Example header:* AI-PREF: allow_training=true; purpose=generation,classification; retention_period=P3Y6M4DT12H30M5S This header specifies that the content can be used for text generation and classification, with a retention period of 3 years, 6 months, 4 days, 12 hours, 30 minutes, and 5 seconds. The syntax and options should be carefully chosen to ensure compatibility with common web servers and clients. 5.2. Robots Exclusion Protocol (REP) For publishers who already use REP (as defined in RFC9309 (https://datatracker.ietf.org/doc/rfc9309/)), extending REP rules to include AI-PREF preferences could be beneficial. Example rule: User-agent: * Allow-Training: non-commercial Purpose: embedding, summarisation This REP rule specifies that all user agents are allowed to use the content for non-commercial AI training, limited to embedding and summarisation purposes. Further extensions to REP could specify additional constraints, such as geographic limitations or temporal restrictions. 5.3. Tags for Sub-Document Level Control To specify AI-PREF preferences at the level of individual HTML documents or specific parts of a document, tags and HTML attributes can be used. Example tag: Example HTML attribute: