• internet-user@lemmy.ml
    link
    fedilink
    arrow-up
    3
    ·
    1 year ago

    I listened to the interview of Apollo’s dev, and the interviewer brought up a good point (the only good point I’ve heard on the other side of this). Natural language models are becoming very popular, and lots of companies are building them. To do this, they are scraping the web, and especially places like Reddit. It sounds like Reddit wants to capitalize on this by increasing their API’s to these (absurdly) high prices.

    • Lvxferre@lemmy.mlOP
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      1 year ago

      I mentioned this in another discussion, but even if Reddit is trying to milk the people paying to access the API for data models for NLP, this is still a bad move. Even the ones who stay behind in Reddit will be less engaged; and lack of engagement in this case means shorter, decontextualised sentences, conveying almost nothing. It’s the difference between a well-thought reply and a “lol”.

      So the data will quickly become useless, and even the ones who might pay for it at the start will eventually say “why bother? Reddit only adds noise to our models.” and stop paying.

      And it’s a bit off-topic, but about NLP, I think that brute-forcing (feed it more data) is counter-productive in the long run, too. Humans actually learn language (how to use it, not just how to utter grammatically sound but meaningless sentences) with considerably less exposure.

      • internet-user@lemmy.ml
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        Yes, very good points. I am not a ML expert by any means, but it does seem like companies are in a bit of an arms race right now, and are just trying to grow large models without doing it properly.