Raw master archive of /r/GenZhou

@DongFangHong · 2 years ago

Raw master archive of /r/GenZhou

Muad'Dibber · 2 years ago

Thank you for your service. We’ll have to properly resurrect some of the better posts here now.

Lenins2ndCat · 2 years ago

It can’t just be a few posts resurrected. The entire thing was incredibly valuable and it needs to be restored in a browsable and searchable format, posts and comments. A lot of people used the search function on it all the time for things they needed or things they wanted to learn.

Muad'Dibber · 2 years ago

Someone who knows SQL could write a DB import script.

Lenins2ndCat · 2 years ago

Any idea what the cost might be to have someone do that? Could be tempted to pay for it.

Muad'Dibber · 2 years ago

Someone already made one, and it mostly worked. !genzhouarchive@lemmygrad.ml

Chaewon · 2 years ago

Thanks comrade!

MexicanCCPBot · 2 years ago

Fucking legend. o7

@TheBlurstOfGuys · 2 years ago

This is great, mind telling us how you did it?

@DongFangHong · edit-2 2 years ago

Yeah I wrote up a quick Python script to call the Pushshift API, which gets a list of post IDs from the subreddit. Then for each post, you can the Reddit json API to get a json with all of the information in the submission. Then, I inserted the json into a database. Here’s my code if you’re interested

import datetime
import html
import pymongo
import requests
import time

subreddit = 'GenZedong'
sort = 'asc'
sort_type = 'created_utc'
size = 100

client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client.subredditArchiveDB
collection = db[subreddit]

def main():
    query_params = {
        'subreddit': subreddit,
        'sort': sort,
        'sort_type': sort_type,
        'size': size,
        'after': 1646299223, # use this to start the search after a specific timestamp
    }

    while True:
        r = requests.get('https://api.pushshift.io/reddit/search/submission/', params=query_params)
        r.raise_for_status()

        j = r.json()
        for post in j['data']:
            id = post['id']
            timestamp = datetime.datetime.utcfromtimestamp(post['created_utc'])
            timestamp_str = timestamp.strftime("%Y-%m-%d %H:%M")

            reddit_r = requests.get(f'https://www.reddit.com/comments/{id}/.json', headers={'User-Agent': 'Subreddit archiver', 'Cookie': 'Paste your Reddit browser cookie here (needed to access quarantined subreddit)' })
            reddit_r.raise_for_status()
            reddit_json = reddit_r.json()
        
            post_archive = {
                'id': id,
                'timestamp': timestamp,
                'json': reddit_json
            }

            collection.insert_one(post_archive)
            print(f'Added {id} from {timestamp_str} to the collection')

        query_params['after'] = timestamp


if __name__ == '__main__':
    main()

@TheBlurstOfGuys · 2 years ago

That’s just incredible man. Thanks so much!

@DongFangHong · 2 years ago

No problem!

@holdengreen · 2 years ago

what do you feed as ‘Cookie’?

@DongFangHong · 2 years ago

You can use the cookie that Reddit stores on your browser. An easy way to do this is to open up the browser dev tools console to the network tab, load Reddit, and then click on the request that was made to reddit.com in your console. You should be able to find a list of headers, one of which being Cookie. Copy that and paste it in the code.

red_red_revolution · edit-2 2 years ago

I know this sounds dumb but I have no idea what you’re talking about. What tools or programs do I need to open this and explore the subreddit? Do I need to download anything? Or know code?

@DongFangHong · 2 years ago

No it’s not dumb. If your goal is just to be able to explore the content on /r/GenZhou, that would be pretty difficult to do. I don’t know if you’ve taken a look at the archive file but it’s essentially just a bunch of Javascript code that stores the data. It’s pretty much impossible to read easily as-is, even for a programmer. What the next step is going to be is formatting the data so that it becomes human-readable. Some folks are already starting to work doing that. Hopefully eventually we can view everything that was in GenZhou, but on a Lemmy site.

@holdengreen · 2 years ago

I was using another script but this looks better.

@AverageUlyanovFan · edit-2 2 years ago

deleted by creator