• Zetaphor@zemmy.cc
    link
    fedilink
    English
    arrow-up
    17
    arrow-down
    3
    ·
    edit-2
    1 year ago

    Quoting this comment from the HN thread:

    On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

    While it strikes me as perfectly plausible that the Books2 dataset contains Silverman’s book, this quote from the complaint seems obviously false.

    First, even if the model never saw a single word of the book’s text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book’s Wikipedia page.

    Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

    We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT’s training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman’s book.

    I chose “The Ruby of Kishmoor” at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn’t even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn’t know anything about the story and it isn’t part of its training data.

    If ChatGPT’s ability to summarize Silverman’s book comes from the book itself being part of the training data, why can it not do the same for other books?

    As the commentor points out, I could recreate this result using a smaller offline model and an excerpt from the Wikipedia page for the book.

    • patatahooligan@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      1 year ago

      You are treating publicly available information as free from copyright, which is not the case. Wikipedia content is covered by the Creative Commons Attribution-ShareAlike License 4.0. Images might be covered by different licenses. Online articles about the book are also covered by copyright unless explicitly stated otherwise.

      • Zetaphor@zemmy.cc
        link
        fedilink
        English
        arrow-up
        7
        arrow-down
        4
        ·
        edit-2
        1 year ago

        My understanding is that the copyright applies to reproductions of the work, which this is not. If I provide a summary of a copyrighted summary of a copyrighted work, am I in violation of either copyright because I created a new derivative summary?

        • patatahooligan@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          1 year ago

          Not a lawyer so I can’t be sure. To my understanding a summary of a work is not a violation of copyright because the summary is transformative (serves a completely different purpose to the original work). But you probably can’t copy someone else’s summary, because now you are making a derivative that serves the same purpose as the original.

          So here are the issues with LLMs in this regard:

          • LLMs have been shown to produce verbatim or almost-verbatim copies of their training data
          • LLMs can’t figure out where their output came from so they can’t tell their user whether the output closely matches any existing work, and if it does what license it is distributed under
          • You can argue that by its nature, an LLM is only ever producing derivative works of its training data, even if they are not the verbatim or almost-verbatim copies I already mentioned
        • Banzai51@midwest.social
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 year ago

          Aren’t summaries and reviews covered under fair use? Otherwise Newspapers have been violating copyrights for hundreds of years.