Thoughts and ideas on FOSS software that needs to be done. Personally i’ve been thinking on working in an offline marxist.org (like arch-wiki-docs)

  • Muad'DibberA
    link
    122 years ago

    Converting most works on marxists.org to markdown would be a worthwhile goal, but probably a ton of work, even with HTML -> markdown converters. They do provide backups of the whole site tho as zips, so you wouldn’t have to scrape everything. But if all those works were in markdown, anyone could easily create front ends, viewers, epubs, etc using those files as sources.

    We do got some open source comms platforms going (like this one, matrix, masto), so it’d be good to contribute to those.

    • Arsen6331 ☭
      link
      52 years ago

      If they provide zip files, it shouldn’t be too hard to make a custom HTML to Markdown converter that goes through the archive and converts each file concurrently, creating the same directory structure but containing markdown files instead of html. Markdown is an extremely simple format.

      • Muad'DibberA
        link
        52 years ago

        Once you start parsing the marxists.org HTML, you’ll see what I mean. There’s no standardization at all. Half the section headers are just bolds, there’s a ton of weird styling in the HTML itself… no one thought to clean any of that up before throwing it up there.

        • Arsen6331 ☭
          link
          3
          edit-2
          2 years ago

          I had a look at the HTML using inspect, and I see what you mean, but theoretically that just requires some trial and error with setting up the markdown conversion filters/rules. I can’t actually find the zip archive to take a closer look, which is a much bigger problem because the lack of standardization also applies to links which means it’s not a very good candidate for a crawler to collect the HTML pages.

          Edit: it seems they no longer have archives because it’s gotten massive with all the PDFs. I guess I could just grab the site with a recursive wget command instead.

          • Muad'DibberA
            link
            42 years ago

            Ah that’s too bad, and a bit dangerous if they don’t have backups. You might be able to email the site runners and see if they’ll make a torrent for you.

            • Arsen6331 ☭
              link
              52 years ago

              Recursive wget seems to be working fine. I just had to add a 500ms request delay. I’ve already got 683 MB of data and I’m only downloading HTML, so it’s definitely massive.

              • Muad'DibberA
                link
                42 years ago

                Nice. Will be interested to see how big everything is after a first pass converting to markdown. Keep me posted.