The whole article is quite funny, especially the lists of most used tankie words, or the branding of foreignpolicy as a left-wing news source.

    • redtea
      link
      fedilink
      English
      arrow-up
      23
      ·
      edit-2
      1 year ago

      USSR, soviet, the, art, of

      Nvivo is a comrade. Nice.

      Edit, #10 is:

      capitalism, yes, why, capitalist, yep

      And tbf, this does about sum up most of our conversations.

      • 133arc585@lemmy.ml
        link
        fedilink
        English
        arrow-up
        28
        arrow-down
        1
        ·
        1 year ago

        As someone who did some natural language processing research in undergrad, they obviously have no idea what they’re doing. To get meaningful data you need[1] to remove words such as “the”, “is”, “it”, etc. And that’s not the only normalization you need to do.

        What’s offensive for something claiming to be an academic paper is their lack of explanation of their data processing techniques. Meaningful conclusions can only be made if your data is reasonable. And to make sure you have meaningful data, especially when the source is extremely noisy human-generated online comments, you need to do several things to process your data before you can feed it into an analysis. The goal of publishing academic research is not only to publish a result, but to publish methodology to enable independent reproducibility: if you have the paper, and the data, you should be able to follow the methods and come to the same conclusions; if you can’t, the paper’s bad. Yes, these details are boring, and a lot of people will put them in an appendix instead of in the main body of the paper, but if you’re being honest you do provide these details.

        They also don’t even pretend to be objective; the paper reads more like a speculative opinion piece on sociology than it does a “data-driven” paper. Their assumptions drive their analysis and thus their conclusions. Moreover, when they attempt to make the distinction between TOXICITY and SEVERE_TOXICITY, they are not making these objective categories: the definitions they give are pure air and the distinction between the two categories is purely subjective.

        It’s honestly an embarassment; I wouldn’t want my name on a paper of such poor quality. I wouldn’t want my university to be named on a paper of such poor quality (nor would I think the university would want themselves to be named on such a paper).

        Either these are genuinely ignorant undergrads who don’t realize that they’re producing wildly questionable and meaningless “research”, or they’re dishonest grifters taking federal taxpayer money[2] and producing garbage.

        Being published in ArXiv is not automatically a bad thing; but it makes me wonder if they were rejected from peer-reviewed journals. There’s no argument that they didn’t want to or were unable to spend money to submit to a “real” journal since they are receiving outside funding.


        1. Stopwords aren’t totally useless at early stages in the pipeline or depending on what you’re doing. For example, being grammatical terms they can help get a proper parse tree. But this type of analysis, sentiment analysis, is not using a full parse tree and the leaving in of stopwords only increases noise and decreases the ability of the model to produce meaningful results. ↩︎

        2. The researchers have received nearly a half a million $USD in federal taxpayer money through an NSF grant. ↩︎

        • CriticalResist8A
          link
          fedilink
          English
          arrow-up
          22
          ·
          1 year ago

          One of them is an associate prof, and the other is the dean of the tech and engineering department at his university 💀

          Last one is a PhD candidate, but that info maybe be a bit outdated

          • 133arc585@lemmy.ml
            link
            fedilink
            English
            arrow-up
            20
            arrow-down
            1
            ·
            1 year ago

            Oh good god. I had given them the benefit of the doubt and assumed there was no way an actual professor would be any of the names on it. I figured such poor work could only be explained by being ignorant undergrads. I genuinely would question their previous work if they are comfortable publishing this garbage.

            This is downright shameful. I’d be embarassed to be a student of these profs, or of the department.

            Now I’m genuinely curious if they embezzled some of the NSF money, or are otherwise being paid for this? I extremely rarely take up the whole “paid shill” angle, because frankly it’s almost never the case, but how in the everloving shit would these people produce and publish such trash and not feel embarassed?

        • WhatWouldKarlDo
          link
          fedilink
          English
          arrow-up
          13
          ·
          1 year ago

          I didn’t actually notice that! I got as as the word “the”, and tuned out. It’s hilarious that words like “thanks” and “sorry” rate so highly on this very toxic extremist community.

      • 🏳️‍⚧️Edward [it/its]
        link
        fedilink
        English
        arrow-up
        11
        ·
        edit-2
        1 year ago

        Well, yes. But also, as long as most comments has “the” at least once, “the” will be listed.

        Orange gallbladder golfcart

      • REEEEvolution
        link
        fedilink
        English
        arrow-up
        9
        ·
        1 year ago

        As most used word. If anything it shows how bad the compiling of the data is. “the” is basically white noise and they have not even filtered it out.

    • REEEEvolution
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      1 year ago

      “inshallah” - I laughed harder than I should have.

      “Chen” - The master gets his deserved mention, I see.

      and lots of thanking - Very toxic, yes,yes.