New York Times vs. OpenAI: A Legal Battle Over Copyright in the Age of AI

In a landmark case that could set a precedent for future copyright disputes involving artificial intelligence (AI), the New York Times (NYT) has sued OpenAI, the creator of ChatGPT, for allegedly using its articles without permission. This ongoing legal battle has taken a new twist as OpenAI has requested that the NYT prove the originality of its articles, asking for detailed source materials for each copyrighted work. This development adds a new layer of complexity to the case, which already involves multiple lawsuits from various rightsholders, including record labels, book authors, visual artists, and other newspapers.

The lawsuit, initially filed in December 2023, alleges that OpenAI used millions of NYT articles to train its AI models, specifically ChatGPT, without offering any compensation. OpenAI’s defense hinges on the argument that the material was publicly available and, therefore, falls under ‘fair use’. However, the NYT contends that OpenAI’s actions constitute copyright infringement and is pushing for compensation.

On July 1, OpenAI’s legal team filed a request in a New York district court, asking the judge to compel the NYT to provide documents proving the originality of its articles. This includes interview memos, reporters’ notes, records of files, and other materials cited in the articles. OpenAI maintains that it does not seek to discover the identities of confidential sources. However, the NYT argues that the request is overly broad, unprecedented, and serves no purpose other than to harass the newspaper for pursuing its lawsuit.

“OpenAI is not entitled to unbounded discovery into nearly 100 years of underlying reporters’ files, on the off chance that such a frolic might conceivably raise a doubt about the validity of The Times’s registered copyrights,” read a filing by NYT’s legal team. The newspaper’s legal representatives also emphasized that the request invades reporter’s privilege and could have a chilling effect on journalistic practices.

The crux of OpenAI’s argument is that it needs to differentiate between ‘expressive, original, human-authored content’ and ‘non-expressive, non-original, or non-human-authored content’. OpenAI’s lawyers argue that the request is crucial to understanding what parts of the NYT’s articles are original and therefore subject to copyright protection. “Having chosen to put directly at issue how the Times created the works at issue—including the methods, time, labor, and investment—OpenAI has a right to discovery into the same,” OpenAI’s filing read.

The NYT, however, refutes this claim. “OpenAI claims that the reporters’ notes underlying the asserted works may shed light on whether The Times’s news articles are really original, expressive content—but that is not how copyright law works. The expressive nature of a work is determined by reference to the work itself,” the newspaper stated in its response.

This case is being closely watched as it could have far-reaching implications for the tech industry and content creators alike. The NYT claims to be the first major U.S. media company to sue OpenAI over copyright infringement, and the outcome of this case could set a legal precedent for how AI companies can use publicly available content to train their models.

OpenAI has faced similar lawsuits from other entities. The nonprofit Center for Investigative Reporting (CIR) recently filed a lawsuit against both OpenAI and Microsoft, accusing them of using copyrighted materials from its publications, such as Mother Jones and Reveal, to train their AI models. “OpenAI and Microsoft started vacuuming up our stories to make their product more powerful, but they never asked for permission or offered compensation,” said Monika Bauerlein, CEO of the Center for Investigative Reporting.

OpenAI has also begun signing licensing agreements with some news organizations to mitigate these legal challenges. The company has inked deals with The Associated Press, The Wall Street Journal, The Atlantic, Prisa Media, Le Monde, Financial Times, and Business Insider’s parent company Axel Springer. However, these agreements cover only a fraction of the content required for AI models to continuously improve.

Synthetic data has been proposed as a potential solution to reduce reliance on copyrighted material. This data is artificially generated rather than collected from real-world sources, and can be produced by machine learning algorithms. OpenAI’s CEO, Sam Altman, has expressed interest in this approach but also voiced concerns about its feasibility. “As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine,” Altman said at a tech conference in May 2023.

As it stands, the case between the NYT and OpenAI remains unresolved, with both parties awaiting the court’s next decision. The outcome of this case could significantly impact how AI companies approach the use of copyrighted materials and how content creators protect their intellectual property in the digital age.

News Sources

Assisted by GAI and LLM Technologies

SOURCE: HaystackID

Sign up for our Newsletter

Stay up to date with the latest updates from Newslines by HaystackID.

Success! You are now signed up for our newsletter.
There has been some error while submitting the form. Please verify all form fields again.