Thoughts and ideas on FOSS software that needs to be done. Personally i’ve been thinking on working in an offline marxist.org (like arch-wiki-docs)

  • Muad'DibberA
    link
    52 years ago

    Once you start parsing the marxists.org HTML, you’ll see what I mean. There’s no standardization at all. Half the section headers are just bolds, there’s a ton of weird styling in the HTML itself… no one thought to clean any of that up before throwing it up there.

    • Arsen6331 ☭
      link
      3
      edit-2
      2 years ago

      I had a look at the HTML using inspect, and I see what you mean, but theoretically that just requires some trial and error with setting up the markdown conversion filters/rules. I can’t actually find the zip archive to take a closer look, which is a much bigger problem because the lack of standardization also applies to links which means it’s not a very good candidate for a crawler to collect the HTML pages.

      Edit: it seems they no longer have archives because it’s gotten massive with all the PDFs. I guess I could just grab the site with a recursive wget command instead.

      • Muad'DibberA
        link
        42 years ago

        Ah that’s too bad, and a bit dangerous if they don’t have backups. You might be able to email the site runners and see if they’ll make a torrent for you.

        • Arsen6331 ☭
          link
          52 years ago

          Recursive wget seems to be working fine. I just had to add a 500ms request delay. I’ve already got 683 MB of data and I’m only downloading HTML, so it’s definitely massive.

          • Muad'DibberA
            link
            42 years ago

            Nice. Will be interested to see how big everything is after a first pass converting to markdown. Keep me posted.