On May 9, 2025, the Copyright Office released a “pre-publication” draft of the third, and final, volume of its study on the intersection of artificial intelligence and copyright law, under the title “Generative AI Training” (the “Report”). The first two reports in the Copyright Office series covered legal issues around deepfakes and the copyrightability of AI works respectively.
The Report is the product of the Copyright Office’s yearslong engagement with stakeholders, which officially commenced with the Copyright Office’s August 2023 notice of inquiry calling for comment across a range of copyright issues pertaining to AI.
The Report (i) provides a technical primer explaining how copyright-protected works are used in the development of AI models; (ii) addresses whether various activities across the development and deployment of AI applications constitute infringement under existing US copyright laws; and (iii) considers whether any statutory defenses to infringement apply and the feasibility of various licensing schemes.
Does Generative AI Training Infringe Copyright?
The Report examines potential infringement arguments in relation to activities in the deployment and use of generative AI — from the collection of content used to train AI models through to the generation of outputs in response to user prompts. The Report states that the steps in producing a training dataset — which include downloading the data, transferring it across different storage media, and modifying data for use in training — “clearly implicate the right of reproduction.”
Turning to training itself, the Report further affirms that the works that comprise a dataset are reproduced when they are “shown” to the model, though there remains a factual question of whether such copies are too transitory to satisfy the fixation requirement (under the Copyright Act a “copy” must be “fixed”, in other words it must be “sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration”). The Report also addresses an area of significant controversy, stating that the model weights themselves — that is, the parameters that the model derives from the training data and which it uses to make predictions — may themselves be copies or derivative works of the works in a training dataset. Regarding this point, the Report acknowledges that courts have reached different conclusions on the question of whether model weights embody the underlying works in the training data and suggests that the determination “turns on whether the model has retained or memorized substantial protectable expression from the work(s) at issue.”
Having considered potential infringement arguments in respect of the training phases, the Report then considers issues around the deployment and use of generative AI models. The Report first addresses retrieval augmented generation (RAG), a technique where an AI model supplements its output with information from outside its training dataset, such as information obtained from the open web via a search engine or from a retrieval database established specifically for this purpose. The Report also briefly addresses the outputs of generative AI models, which it suggests have been shown can “produce near exact replicas of still images from movies, copyrightable characters, or text from news stories.” Such cases implicate the reproduction and derivative work rights according to the Report. Although the Report indicates that “[t]hese infringement issues…will be addressed in later Part of this Report[,]” no such discussion appears in the current draft of the Report.
Does Fair Use Apply?
The Report identifies fair use as the primary defense available against claims of copyright infringement involving generative AI. Accordingly, the Report assesses whether each of the four fair use factors favors the application of the doctrine to generative AI technologies.
The first fair use factor, which considers the purpose and character of the infringing use, is discussed at some length. The Report expresses the view that generative AI training will often be transformative, but cautions that “transformativeness is a matter of degree, and how transformative or justified a use is will depend on the functionality of the model and how it is deployed.” Where the AI model generates outputs that are similar or identical to the works it was trained on, it will not be transformative. On the other hand, the Report identifies AI models used for research purposes or those used where guardrails prevent it from producing outputs that substitute for the works in its training dataset as paradigmatic transformative uses.
In considering the second fair use factor, the nature of the copyrighted work, the Report observes that because many AI models are trained on a variety of different data types. These data sets will typically encompass both expressive and functional works, as well as published and unpublished works. The Report therefore concedes that the assessment under the second factor will depend on the model and the works used for training.
The third fair use factor considers the amount and substantiality of the portion of the copyrighted work used. The Report comments that AI model training ordinarily involves the copying of entire works, but accepts that such wholesale copying “appears to be practically necessary for some forms of training for many generative AI models.” The Report also regards how much of the copyrighted work is made available to the end-user, and whether there are any measures to limit the amount of the work that is reproduced for the end-user, as relevant considerations under the third fair use factor.
The final fair use factor assesses of the effect of the use on the potential market for or value of the copyrighted work. This factor, which is often considered the most significant aspect of the fair use assessment, involves several related but distinct potential impacts on the market for the copyrighted work. The first of these concerns market substitution that results in lost sales to the rightsholder. Although the Report noted “competing perspectives on whether or how the outputs of generative AI can substitute for the originals”, it acknowledged that the danger of lost sales is especially present with regards to works that were created for the specific purpose of AI training. Relatedly, the Report suggests that the fourth factor can also involve an analysis of “harms caused where a generative AI model’s outputs, even if not substantially similar to a specific copyrighted work, compete in the market for that type of work.” Such impacts may occur where the AI model is capable of generating an output in the style of a copyrighted work. The Report then examines potential impacts on licensing opportunities for the copyrighted works, noting that a nascent licensing market for AI training has emerged recently.
In weighing the factors, the report is noncommittal, given the fact-specific nature of the fair use analysis and the varied ways that AI models use existing works to train their model. It does note, however, that “uses for purposes of non-commercial research or analysis that do not enable portions of the works to be reproduced in the outputs are likely to be fair” while uses that involve the copying of expressive works from pirated sources or that generate outputs that are capable of substituting for the copyrighted works are unlikely to be fair.
In addition to fair use, the Report surveys various international approaches that may provide avenues through which AI developers may use copyrighted materials for training generative AI models. For example, under the Directive on Copyright in the Digital Single Market (DSM Directive), the European Union has implemented a limited exception to copyright for text and data mining. The text and data mining exception under the DSM Directive allows copyright owners to opt out, with the recent EU AI Act expressly requiring developers to respect DSM Directive opt-outs.
Can Content for AI Training Be Licensed?
The Report concludes by exploring different potential strategies that may be used to enable AI developers to license content for the training of generative AI models. First, the report considers voluntary licensing, whereby the rightsholder and licensee freely negotiate the terms of a license among themselves. Instead of striking licenses with individual licensors, licensing agreements might also be entered into with collective rights management organizations, which administer licenses and collect and distribute fees on behalf of multiple rightsholders. The Report acknowledges several practical difficulties with voluntary licensing. The financial and operational feasibility of voluntary licensing for AI training has been called into question in light of the sheer volume and diversity of content required for training datasets. Some commentors have also predicted that voluntary licensing of training data will provide only negligible revenue to individual licensors because license fees would need to be divided among a massive number of rightsholders. Despite these hurdles, the Report notes a growing number of reported license deals between AI developers and content owners.
As an alternative to voluntary licensing , the Report discusses the possibility of compulsory licensing for AI training data. Under compulsory licensing, remuneration is provided without a specific agreement between the rightsholder and the user though an administrative scheme established by statute. Although compulsory licensing might provide a solution in situations where negotiating licenses in a free market would be unfeasible, the Report notes several disadvantages to this approach: they are generally avoided because they derogate the rightsholders’ ability to control the distribution of their works; they are complex and costly to administer; and they are inflexible and slow to adapt to technological and market changes.
The Report also considers Extended Collective Licensing (ECL), which represents something of a hybrid approach between voluntary and compulsory licensing. Under ECL, a collective rights management organization is authorized to administer collective licenses for specific classes of works. But rightsholders may opt out and choose instead to negotiate individual licenses with licensees. Last, the Report reflects on the possibility of a statutory opt out mechanism, similar to the text and data mining exception in the DSM Directive, that allows rightsholders to indicate that they do not consent for their works to be used for AI training.
The Report concludes that voluntary licensing should be the preferred method in areas where it is feasible. For cases where voluntary licensing is unworkable, alternative approaches like ECL may be explored.
Next Steps
Although the Report doesn’t carry the force of law it represents the Copyright Office’s position following years of engagement with stakeholders on all sides of the AI ecosystem. As the Report notes, a number of significant infringement actions are working their way through the courts and the decisions in these actions will offer some clarity on the permissibility of training AI under copyright law. Concurrently, market practices around licensing data are continuing to evolve. Despite these developments, it is likely that a case-by-case assessment will be necessary, because of the nature of the infringement and fair use analyses and the highly specific methods of training and deploying AI.
While this uncertainty is likely to persist, organizations from across the AI ecosystem — including developers, deployers, users, as well as creators of content used to train AI — can take several steps. First, they can continue to monitor courthouse developments as the various cases on the application of copyright to AI progress. More proactively, organizations should implement effective AI governance programs that document how AI is being deployed within the company and the source of AI training data. Additionally, organizations must continuously evaluate value of data and potential opportunities for the use and licensing of data for AI as the licensing market evolves.