markdown

glans [it/its]@hexbear.net · 7 months ago

markdown

RedWizard [he/him, comrade/them]@hexbear.net · edit-2 7 months ago

For format transportability and readability I love it. It has its limitations but I’m not designing posters over here.

I’ve been using Logseq recently and it’s really made me love Markdown more. The simplicity of being able to scrape a site using external tools into markdown and getting a knowledge network db out of it is sweet.

For fun the other day I wrote a powershell command that scraped the Marxists.org encyclopedia and pulled all its sub pages, and ripped each term on the page into its own md file. Could have been cool but it was like 3000 files and Logseq choked on it lol.

There’s probably a better way to do that, maybe a list of terms per alphabetical letter or something as an MD file. Either way, I just like that I can scrape content into an easily readable format and get like a wiki out of it for my own personal use.

glans [it/its]@hexbear.net · 7 months ago

the Marxism.org encyclopedia

is that the correct website? do you maybe mean marxists not marxism?

RedWizard [he/him, comrade/them]@hexbear.net · 7 months ago

I meant this: https://www.marxists.org/encyclopedia/ my bad!

glans [it/its]@hexbear.net · 7 months ago

that’s what I thought. but maybe you wanted an encycolopedia of every marxist email list that existed in the year 2000.

what was logsec’s role in the scraping? was the script itself in logsec or was it converting to md?

RedWizard [he/him, comrade/them]@hexbear.net · edit-2 7 months ago

Powershell did all the scraping using a module called PSParseHTML. I did my own markdown conversion logic of the paragraph elements. It dumped each term into a directory as a md file formatted for logseq.

Logseq was going to be a means of graphing the relationships between the terms based on where they are mentioned. I never got to that point because the volume of pages were to much for logseq and it would hard lock on loading the directory.

glans [it/its]@hexbear.net · 7 months ago

volume of pages were to much for logseq and it would hard lock on loading the directory.

maybe you could break them up in chunks some how.

glans [it/its]@hexbear.net · 7 months ago

OK sorry if I am telling you something you already know, but the classic tool to use to scrape a website is wget. Starting from one URL it can find all linked URLs and download them, or not, based on configuration. I have always found it pretty much accessible it was one of the earliest cli tools I used. You can learn a little bit or a lot. I think for this task you could learn a little. So that would get you a perfect local mirror or the site.

GNU has the comprehensive docs of course https://www.gnu.org/software/wget/. But starting with some random tutorial might be better choice.

To get it to markdown you’ll have to add another step, perhaps turndown or pandoc.

Other projects to check out is httrack and tangentially archivebox.

RedWizard [he/him, comrade/them]@hexbear.net · 7 months ago

Its all good. I’ve heard of wget and used it in the past. I use PowerShell at work so I’m more accustomed to using it and it’s an object oriented programing language which I have a good understanding of.

The module is masked on angular and so I could do element queries for the dom nodes I wanted and created some logic to move to the next paragraph under a given header until I reached the next header. Each header + paragraphs were collected then written to the HDD as markdown files.

But thanks for the tool suggestion, its good to know of alternatives.