Why Is The New York Times Suing OpenAI?

In a lawsuit filed in federal court, the New York Times is alleging that OpenAI and Microsoft's products are free-riding on the substantial investment the organization has made to journalism, by copying text from copyrighted work without permission.

Why Is The New York Times Suing OpenAI?

The New York Times, in a lawsuit filed in federal court in the Southern District of New York, is suing OpenAI, the creators of the popular ChatGPT generative artificial intelligence chatbot, for copyright infringement and trademark dilution.

“For more than 170 years, The Times has given the world deeply reported, expert, independent journalism. Times journalists go where the story is, often at great risk and cost, to inform the public about important and pressing issues. They bear witness to conflict and disasters, provide accountability for the use of power, and illuminate truths that would otherwise go unseen. Their essential work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness.”

The Times’ complaint contends that Microsoft and OpenAI’s products are threatening the independent journalism that is crucial for the health of American democracy, by stealing millions of published works spanning over a century. Microsoft is one of OpenAI’s biggest partners and has invested at least $13 billion into the startup. Microsoft’s supercomputers power OpenAI’s research and in turn, Microsoft integrates OpenAI products into their own offerings, such as Bing Search and Microsoft Office.

The complaint claims that OpenAI and Microsoft have copied and used millions of copyrighted articles, opinion pieces and investigations without permission or payment, and that generative artificial intelligence tools relying on large language models (LLMs) were built by copying text from the Times’ copyrighted work.

The Times’ complaint claims that OpenAI’s ChatGPT and Microsoft’s Bing Copilot seek to free-ride on the organization’s substantial investment in its journalism by using published work to built products without permission, and that OpenAI gave particular emphasis to content from The New York Times.

The complaint also demonstrates how OpenAI’s LLM based generative AI chatbot ChatGPT can generate output that recites NYT content verbatim, depriving the NYT of subscription, licensing and advertising revenue.

Microsoft’s integration of LLMs trained on NYT content on the other hand, have boosted the company’s market capitalization by over a trillion dollars in the past year alone, and OpenAI now has a valuation north of $90 billion.

“The integration of GPT-4 into Microsoft’s Bing search engine increased the search engine’s usage and advertising revenues associated with it. Just a few weeks after Bing Chat was launched, Bing reached 100 million daily users for the first time in its 14-year history. Similarly, page visits on Bing rose 15.8% in the first approximately six weeks after Bing Chat was unveiled.”

The Times also claims that it entered into negotiations with both Microsoft and OpenAI “in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products.” These negotiations however, have not led to a resolution and the NYT continues to allege systematic copyright infringement.

“A business model based on mass copyright infringement”

OpenAI was created in December 2015 as a non-profit company dedicated to research in artificial intelligence, and promised that its work would be open and freely available to the public. Just three years later, the company morphed into a profit seeking enterprise in March 2019.  

This transition was “built in large part on the unlicensed exploration of copyrighted works belonging to The Times and others.”

OpenAI released the first two iterations of its flagship generative artificial intelligence model GPT1 and GPT2 as open-source projects, with details of the training set, design and hardware. Since becoming a for-profit company, OpenAI has released GPT3.5 and 4, the design and training for which was kept entirely secret.

A version of ChatGPT powered by GPT3.5 is available for free. OpenAI charges $20 per month for GPT4, and has become a commercial success owing to the popularity of ChatGPT Enterprise and the ChatGPT API, which enables incorporation into bespoke applications. Over 80% of Fortune 500 companies use ChatGPT, and OpenAI makes over $80 million in subscription revenues a month.

How do generative artificial intelligence models work?

OpenAI and Microsoft’s generative AI offerings rely on a kind of computer program called a large language model, or LLM. An LLM works by predicting words that are likely to follow a given string of words based on the billions of examples that it has been trained on.

LLMs encode the information from the training corpus that they use to make predicted called parameters. There are approximately 1.76 trillion parameters in the GPT4 LLM. The process of training involves setting the values for an LLM’s parameters by storing encoded copies of the training works in memory, repeatedly passing them through the model with words redacted or masked out, and adjusting the parameters to minimize the difference between the masked-out works and the words that the model predicts to fill them in.

Models trained in this way also exhibit a behavior called memorization. Given the right prompt, the model will repeat large portions of materials they were trained on, demonstrating that LLM parameters encode retrievable copies of many of their original training works.

The training dataset for OpenAI’s LLMS includes an internal corpus called WebText, which includes the text content of 45 million links posted by Reddit users, and a dataset called Common Crawl, which is a copy of the Internet made by a non-profit organization of the same name. In both of these datasets, the New York Times’ domain is the most highly represented proprietary source, and only the third overall behind Wikipedia and a database of US patent documents.

The filed complaint demonstrates several examples of ChatGPT and Bing Copilot providing users with verbatim excerpts from articles published in the NYT, and examples of AI hallucinations, a phenomenon where AI chatbots produce false information and attribute it to real sources.

The New York Times’ legal claim

Microsoft and OpenAI have used “almost a century’s worth of copyrighted content, for which they have not paid The Times fair compensation. This lost market value of The Times’s copyrighted content represents a significant harm to The Times.”

The complaint alleges that “if individuals can access The Times’s highly valuable content through Defendants’ own products without having to pay for it and without having to navigate through The Times’s paywall, many will likely do so. Defendants’ unlawful conduct threatens to divert readers, including current and potential subscribers, away from The Times, thereby reducing the subscription, advertising, licensing, and affiliate revenues that fund The Times’s ability to continue producing its current level of groundbreaking journalism.”

The New York Times has not included a monetary demand in the complaint, but claims that Microsoft and OpenAI should be “held responsible for billions of dollars in statutory and actual damages related to the unlawful copying and use of the Times’ uniquely valuable works.”

Hamza Hashim serves as an Assistant Editor for The Friday Times, and is an educator. He is an alumnus of Swarthmore College.