What is to be done? In linux

小莱卡 · 2 years ago

What is to be done? In linux

Muad'Dibber · 2 years ago

Converting most works on marxists.org to markdown would be a worthwhile goal, but probably a ton of work, even with HTML -> markdown converters. They do provide backups of the whole site tho as zips, so you wouldn’t have to scrape everything. But if all those works were in markdown, anyone could easily create front ends, viewers, epubs, etc using those files as sources.

We do got some open source comms platforms going (like this one, matrix, masto), so it’d be good to contribute to those.

Arsen6331 ☭ · 2 years ago

If they provide zip files, it shouldn’t be too hard to make a custom HTML to Markdown converter that goes through the archive and converts each file concurrently, creating the same directory structure but containing markdown files instead of html. Markdown is an extremely simple format.

Muad'Dibber · 2 years ago

Once you start parsing the marxists.org HTML, you’ll see what I mean. There’s no standardization at all. Half the section headers are just bolds, there’s a ton of weird styling in the HTML itself… no one thought to clean any of that up before throwing it up there.

Arsen6331 ☭ · edit-2 2 years ago

I had a look at the HTML using inspect, and I see what you mean, but theoretically that just requires some trial and error with setting up the markdown conversion filters/rules. I can’t actually find the zip archive to take a closer look, which is a much bigger problem because the lack of standardization also applies to links which means it’s not a very good candidate for a crawler to collect the HTML pages.

Edit: it seems they no longer have archives because it’s gotten massive with all the PDFs. I guess I could just grab the site with a recursive wget command instead.

Muad'Dibber · 2 years ago

Ah that’s too bad, and a bit dangerous if they don’t have backups. You might be able to email the site runners and see if they’ll make a torrent for you.

Arsen6331 ☭ · 2 years ago

Recursive wget seems to be working fine. I just had to add a 500ms request delay. I’ve already got 683 MB of data and I’m only downloading HTML, so it’s definitely massive.

Muad'Dibber · 2 years ago

Nice. Will be interested to see how big everything is after a first pass converting to markdown. Keep me posted.