On May 9, 2025, the U.S. Copyright Office released a pre-publication version of Part 3 in its ongoing Copyright and Artificial Intelligence study, offering a comprehensive legal and policy analysis of how generative AI systems use copyrighted content in training. While a final version is forthcoming, no substantive changes are expected.
Titled Generative AI Training, this third installment does not propose new laws or regulatory exceptions. Instead, it outlines a pragmatic, balanced framework centered on voluntary licensing, contextual fair use, and transparency, shifting the government’s posture away from litigation and toward scalable, cooperative solutions. For cybersecurity, information governance, and eDiscovery professionals, the report offers timely guidance on navigating the growing compliance and litigation risks tied to large-scale AI deployment.
A Complex Legal Question: Is AI Training Lawful?
At the heart of the report is a complex question: when AI systems ingest copyrighted works to learn language, image, or audio patterns, is that process lawful? The Copyright Office examines this issue by breaking down the stages of AI model development—collection, training, deployment, and output—and assessing where each step might intersect with a creator’s exclusive rights under copyright law. The answer, as with many issues involving emerging technology, is “it depends.”
Applying the Fair Use Test
The Office begins its analysis by applying the four-factor fair use test to AI model training. This judicially established and statutorily codified doctrine provides the most prominent defense to copyright infringement and is now at the core of numerous lawsuits involving AI companies. The report confirms that certain training activities, such as those conducted for research, accessibility, or non-commercial analytical purposes, are more likely to qualify as fair use. These uses, the Office explains, align with traditional examples of fair use in the case law, particularly where the resulting outputs do not resemble the original works and do not compete in the same market.
When Commercial Use Increases Risk
However, the situation changes markedly when AI developers use copyrighted content to train commercial models whose outputs are designed to replicate or substitute for human-created works. In such cases, the purpose and character of the use becomes more commercial and potentially less transformative. If the output of a generative AI model closely resembles a song, article, photograph, or illustration that was included in its training dataset, the original rightsholder may have grounds for a reproduction or derivative work claim.
The Memorization Concern and Embedded Content
This potential for infringement becomes more pronounced when considering how these models “memorize” parts of their training data. Although AI companies often describe their systems as statistical tools that do not store or recall specific works, the reality is more nuanced. The report references research showing that models can and do reproduce verbatim excerpts of training data, especially when prompted in ways that echo the original material. In one example, language models were shown to generate copyrighted lyrics and news articles with little prompting. The Copyright Office argues that when such behavior occurs, it likely means that the copyrighted work is embedded in the model’s weights—a process that may itself constitute infringement.
Liability Remains Unsettled
This possibility opens a legal gray area around liability. Who is responsible when an AI system outputs infringing content—the developer who trained the model, the company that deployed it, or the user who prompted it? While the Office does not take a definitive position, it acknowledges that these questions are likely to be resolved by courts and possibly Congress in future deliberations. For now, it recommends caution and transparency, urging developers and deployers to document their training practices and provide mechanisms for creators to opt out.
Licensing as a Viable Alternative
Rather than advocate for immediate legislative change, the Office recommends expanding the use of voluntary licensing regimes. Licensing, it argues, offers a flexible and scalable solution to the content rights issues that AI training raises. Several sectors, including music publishing and academic content, already operate under collective licensing frameworks. These arrangements allow creators to pool rights and negotiate access through intermediaries, reducing the transaction costs of individual permissions. The Office also explores the possibility of Extended Collective Licensing (ECL), a model used in countries like Denmark and Canada, where licensing bodies can grant permissions on behalf of a broad class of rights holders, including those who do not actively participate, so long as they retain the right to opt out.
Importantly, the report does not endorse compulsory licensing, nor does it recommend creating new statutory exceptions for AI training. Instead, it emphasizes that the current Copyright Act, with its fair use doctrine and licensing mechanisms, is capable of accommodating AI development—if interpreted and applied thoughtfully. This measured approach reflects both the legal complexity of the issue and the Office’s statutory role as an advisor rather than a regulator.
A Patchwork of International Policies
The international landscape adds another layer of complication. The report surveys several national regimes and finds substantial divergence. The European Union, for example, has adopted a text and data mining exception that allows AI training on publicly available materials, subject to opt-outs. Israel and Japan have interpreted their fair use provisions broadly to accommodate AI development. The United Kingdom, after initially proposing a broad exception, pulled back in response to public opposition and is now re-evaluating its stance. For global companies developing or deploying AI models, this patchwork of rules presents compliance challenges that may increase the value of universally accepted licensing standards or industry-wide codes of conduct.
Technical Design Implications
The Office also delves into the technical architecture of AI systems, especially the distinction between training and deployment. It notes that AI models are not standalone entities but parts of larger systems that often include retrieval mechanisms, safety filters, and user interfaces. One popular design, retrieval-augmented generation (RAG), allows models to fetch live data from third-party sources like search engines or proprietary databases to augment their responses. While technically efficient, this practice may also raise new concerns around unauthorized reproduction, especially if entire articles or images are pulled into outputs without attribution or licensing.
This insight is particularly relevant for eDiscovery professionals, who are increasingly called upon to investigate how AI-generated content is sourced, stored, and served. Legal teams may need to trace a model’s training data, examine how prompts were constructed, and assess whether outputs reflect unauthorized uses of copyrighted works. The same is true for cybersecurity analysts tasked with auditing third-party AI services. A model trained on pirated books or scraped websites may expose an organization to intellectual property liability, even if the organization had no direct role in the model’s development.
A Call to Action for Governance
For information governance leaders, the report is a call to action. As generative AI becomes embedded in document review, content generation, and knowledge management tools, organizations must develop internal policies to manage the sourcing, licensing, and storage of training data. This includes tracking the origins of any third-party models, clarifying usage terms, and implementing guardrails to prevent infringing outputs. It also means building opt-out and metadata preservation systems that respect the rights of original creators and align with applicable laws.
Cybersecurity professionals, meanwhile, must begin treating training data and model weights as sensitive assets. Just as software supply chains are monitored for vulnerabilities, AI content pipelines must be secured against unauthorized data ingestion and deployment. The report’s findings suggest that some models are trained on content obtained from pirated or unauthorized sources, including shadow libraries and torrent trackers. These practices pose legal risks that must be factored into risk assessments and vendor procurement decisions.
A Balanced, Forward-Looking Approach
Perhaps the most striking aspect of the report is what it does not do. It does not take sides in ongoing litigation, such as the high-profile lawsuits involving authors, music publishers, and AI firms. It does not call for sweeping regulatory reform or trying to resolve every dispute. Instead, it provides a detailed map of the current terrain, acknowledging the uncertainty, identifying the pressure points, and encouraging stakeholders to proceed with transparency, good faith, and an eye toward sustainable development.
In a space often marked by hyperbole and alarmism, the Copyright Office’s approach is refreshingly pragmatic. It recognizes that AI is a transformative technology, but not one exempt from legal or ethical scrutiny. It acknowledges the legitimate concerns of creators while also validating the real-world constraints developers face when trying to build competitive and compliant systems.
For professionals operating at the intersection of technology, law, and compliance, the report offers both a caution and a compass. AI training may not yet be fully governed by settled law, but it is no longer operating in a vacuum. The rules are forming, the expectations are rising, and the consequences—legal, financial, and reputational—are increasingly real.
News Sources
- Copyright and Artificial Intelligence (U.S. Copyright Office)
- U.S. Copyright Office. “Copyright and Artificial Intelligence, Part 3: Generative AI Training.” (2025) (PDF)
- News/Media Alliance Applauds the Copyright Office’s AI Study Report on Fair Use (News/Media Alliance)
- Trump fires top US copyright official (POLITICO)
Assisted by GAI and LLM Technologies
Source: HaystackID published with permission from ComplexDiscovery OÜ