• @CriticalResist8MA
    link
    8
    edit-2
    3 years ago

    UPDATE: script is functional and could be used for other things as well (as long as you want to match a list of keywords to a series of files), but I’m pulling out of this project because the sheer amount of files is fucking up my hard drive. It’s a bit old now and suddenly it was just slowing down the whole computer trying to access them, so I decided to save my drive and stop while I was ahead.

    However I can send anyone who’s interested the script and explain how to use it (pretty straightforward tho).

    I downloaded the 19 gigs list and it’s currently unzipping. I’m also writing a python script in parallel to parse all files with certain keywords. Then it will put them neatly in subfolders named after the keyword for easy going through.

    I noticed you can safely delete the files that weigh 6KB or less (8KB on disk), as they only contain HTML code and no actual user-generated data.

    I’m not a great coder so bear with me lol, but I got the “steps” of what I want down, now it’s only a matter of putting it into code.

    • @CriticalResist8MA
      link
      23 years ago

      Update: script is functional if not a bit primitive. When this comment is 2 hours old the archive will have finished unzipping (why make it a zip when it only saves 3 gigs is beyond me but fine), and I’ll be able to run the script. Not sure how long it will take though, but it’s quite fast. ~15 seconds runtime for 200 files means about 40 hours of running overall, which is not unrealistic at all. I mean I can leave the computer running overnight lol, no problem.

      If anyone has interesting keywords feel free to reach out. Here is the current list: https://zerobin.net/?89346f4a2ba5bd87#wehvA6U2wSsfCRMA4Wr0fu2qtSOfqwApwyj8lS/UfGk= (WARNING that it contains clear-text racist words and other dogwhistles, for the purpose of matching them of course).

      The matches are not strict, meaning the key can be anywhere on the page. if someone wrote “I’m suretrumpis goingtowin” (e.g. made a typo) it will match “trump” all the same.

    • @CriticalResist8MA
      link
      43 years ago

      It’s not really in a convenient format anyway. You have to pay for the bandwith you use to download the archive, which I can understand because 32 TB is a lot of data, but you also have to download the whole of it.

      A torrent would make more sense but again, that means they have to seed 32TB of data and have it on their computer somewhere. Either that or distribute it between selected seeders, and everyone can start seeding say 2TB of data at a time.

      Still I’m downloading the text-only, 19gigs archive and will see what’s in there. I don’t think I’ll have the patience to go through all of it. Hope it doesn’t put me on any list lmao

      • lemmygrabber
        link
        3
        edit-2
        3 years ago

        Slightly off topic but why put it in S3 buckets instead of just hosting the file on a server? What are the advantages of it?

        edit: nvm just watched a video about it

        • @CriticalResist8MA
          link
          43 years ago

          What were the reasons they gave? I would guess speed, reliability, and overall size of the archive. My web host gives me only 100 gigs (which is plenty enough for a normal website).

          • lemmygrabber
            link
            33 years ago

            Biggest reason is that it is convenient (more than buying a VPS and self hosting) and it’s very cheap (something like USD 0.0080 per GB (I am exaggerating upwards because I don’t remember the exact number)). The cost is also offset to the receiver a little bit so it’s easier on the uploader financially. Plus there is fault tolerance because the data is mirrored to two other physical locations.

            I was curious because my employer also uses something similar (called MinIO) for hosting their datasets. But they also self host the MinIO clusters so I am still not sure what benefit it provides to them, especially since the data from this cluster is only accessed internally.