On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.
While it strikes me as perfectly plausible that the Books2 dataset contains Silverman’s book, this quote from the complaint seems obviously false.
First, even if the model never saw a single word of the book’s text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book’s Wikipedia page.
Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.
We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT’s training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman’s book.
I chose “The Ruby of Kishmoor” at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn’t even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn’t know anything about the story and it isn’t part of its training data.
If ChatGPT’s ability to summarize Silverman’s book comes from the book itself being part of the training data, why can it not do the same for other books?
As the commentor points out, I could recreate this result using a smaller offline model and an excerpt from the Wikipedia page for the book.
You are treating publicly available information as free from copyright, which is not the case. Wikipedia content is covered by the Creative Commons Attribution-ShareAlike License 4.0. Images might be covered by different licenses. Online articles about the book are also covered by copyright unless explicitly stated otherwise.
My understanding is that the copyright applies to reproductions of the work, which this is not. If I provide a summary of a copyrighted summary of a copyrighted work, am I in violation of either copyright because I created a new derivative summary?
Not a lawyer so I can’t be sure. To my understanding a summary of a work is not a violation of copyright because the summary is transformative (serves a completely different purpose to the original work). But you probably can’t copy someone else’s summary, because now you are making a derivative that serves the same purpose as the original.
So here are the issues with LLMs in this regard:
LLMs have been shown to produce verbatim or almost-verbatim copies of their training data
LLMs can’t figure out where their output came from so they can’t tell their user whether the output closely matches any existing work, and if it does what license it is distributed under
You can argue that by its nature, an LLM is only ever producing derivative works of its training data, even if they are not the verbatim or almost-verbatim copies I already mentioned
Quoting this comment from the HN thread:
As the commentor points out, I could recreate this result using a smaller offline model and an excerpt from the Wikipedia page for the book.
You are treating publicly available information as free from copyright, which is not the case. Wikipedia content is covered by the Creative Commons Attribution-ShareAlike License 4.0. Images might be covered by different licenses. Online articles about the book are also covered by copyright unless explicitly stated otherwise.
My understanding is that the copyright applies to reproductions of the work, which this is not. If I provide a summary of a copyrighted summary of a copyrighted work, am I in violation of either copyright because I created a new derivative summary?
Not a lawyer so I can’t be sure. To my understanding a summary of a work is not a violation of copyright because the summary is transformative (serves a completely different purpose to the original work). But you probably can’t copy someone else’s summary, because now you are making a derivative that serves the same purpose as the original.
So here are the issues with LLMs in this regard:
Don’t all three of those points apply to humans?
Aren’t summaries and reviews covered under fair use? Otherwise Newspapers have been violating copyrights for hundreds of years.