• @CriticalResist8MA
    link
    8
    edit-2
    3 years ago

    UPDATE: script is functional and could be used for other things as well (as long as you want to match a list of keywords to a series of files), but I’m pulling out of this project because the sheer amount of files is fucking up my hard drive. It’s a bit old now and suddenly it was just slowing down the whole computer trying to access them, so I decided to save my drive and stop while I was ahead.

    However I can send anyone who’s interested the script and explain how to use it (pretty straightforward tho).

    I downloaded the 19 gigs list and it’s currently unzipping. I’m also writing a python script in parallel to parse all files with certain keywords. Then it will put them neatly in subfolders named after the keyword for easy going through.

    I noticed you can safely delete the files that weigh 6KB or less (8KB on disk), as they only contain HTML code and no actual user-generated data.

    I’m not a great coder so bear with me lol, but I got the “steps” of what I want down, now it’s only a matter of putting it into code.

    • @CriticalResist8MA
      link
      23 years ago

      Update: script is functional if not a bit primitive. When this comment is 2 hours old the archive will have finished unzipping (why make it a zip when it only saves 3 gigs is beyond me but fine), and I’ll be able to run the script. Not sure how long it will take though, but it’s quite fast. ~15 seconds runtime for 200 files means about 40 hours of running overall, which is not unrealistic at all. I mean I can leave the computer running overnight lol, no problem.

      If anyone has interesting keywords feel free to reach out. Here is the current list: https://zerobin.net/?89346f4a2ba5bd87#wehvA6U2wSsfCRMA4Wr0fu2qtSOfqwApwyj8lS/UfGk= (WARNING that it contains clear-text racist words and other dogwhistles, for the purpose of matching them of course).

      The matches are not strict, meaning the key can be anywhere on the page. if someone wrote “I’m suretrumpis goingtowin” (e.g. made a typo) it will match “trump” all the same.