Thoughts and ideas on FOSS software that needs to be done. Personally i’ve been thinking on working in an offline marxist.org (like arch-wiki-docs)

  • Arsen6331 ☭
    link
    52 years ago

    If they provide zip files, it shouldn’t be too hard to make a custom HTML to Markdown converter that goes through the archive and converts each file concurrently, creating the same directory structure but containing markdown files instead of html. Markdown is an extremely simple format.

    • Muad'DibberA
      link
      52 years ago

      Once you start parsing the marxists.org HTML, you’ll see what I mean. There’s no standardization at all. Half the section headers are just bolds, there’s a ton of weird styling in the HTML itself… no one thought to clean any of that up before throwing it up there.

      • Arsen6331 ☭
        link
        3
        edit-2
        2 years ago

        I had a look at the HTML using inspect, and I see what you mean, but theoretically that just requires some trial and error with setting up the markdown conversion filters/rules. I can’t actually find the zip archive to take a closer look, which is a much bigger problem because the lack of standardization also applies to links which means it’s not a very good candidate for a crawler to collect the HTML pages.

        Edit: it seems they no longer have archives because it’s gotten massive with all the PDFs. I guess I could just grab the site with a recursive wget command instead.

        • Muad'DibberA
          link
          42 years ago

          Ah that’s too bad, and a bit dangerous if they don’t have backups. You might be able to email the site runners and see if they’ll make a torrent for you.

          • Arsen6331 ☭
            link
            52 years ago

            Recursive wget seems to be working fine. I just had to add a 500ms request delay. I’ve already got 683 MB of data and I’m only downloading HTML, so it’s definitely massive.

            • Muad'DibberA
              link
              42 years ago

              Nice. Will be interested to see how big everything is after a first pass converting to markdown. Keep me posted.