Thoughts and ideas on FOSS software that needs to be done. Personally i’ve been thinking on working in an offline marxist.org (like arch-wiki-docs)

  • Arsen6331 ☭
    link
    3
    edit-2
    2 years ago

    I had a look at the HTML using inspect, and I see what you mean, but theoretically that just requires some trial and error with setting up the markdown conversion filters/rules. I can’t actually find the zip archive to take a closer look, which is a much bigger problem because the lack of standardization also applies to links which means it’s not a very good candidate for a crawler to collect the HTML pages.

    Edit: it seems they no longer have archives because it’s gotten massive with all the PDFs. I guess I could just grab the site with a recursive wget command instead.

    • Muad'DibberA
      link
      42 years ago

      Ah that’s too bad, and a bit dangerous if they don’t have backups. You might be able to email the site runners and see if they’ll make a torrent for you.

      • Arsen6331 ☭
        link
        52 years ago

        Recursive wget seems to be working fine. I just had to add a 500ms request delay. I’ve already got 683 MB of data and I’m only downloading HTML, so it’s definitely massive.

        • Muad'DibberA
          link
          42 years ago

          Nice. Will be interested to see how big everything is after a first pass converting to markdown. Keep me posted.