• dRLY [none/use name]@hexbear.net
        link
        fedilink
        English
        arrow-up
        16
        ·
        6 months ago

        I think the only issue with a txt version would come down to not being able to view the items that aren’t actually regular text (scans of originals and the ability to see stuff that was handwritten or whatever) or images. Of course most of the docs will be just text, but it would be easy to lose information. What is your main issue with them being in PDF?

        • SUPAVILLAIN
          link
          fedilink
          English
          arrow-up
          6
          ·
          6 months ago

          can’t control-f names we feel WOULD turn up in this doc; which means now about 1200 pages have to be trawled through entirely manually

            • SUPAVILLAIN
              link
              fedilink
              English
              arrow-up
              3
              ·
              6 months ago

              If there are, I certainly don’t know about 'em-- stuff I could’ve used for my textbook epubs last semester

              • IzyaKatzmann [he/him]@hexbear.net
                link
                fedilink
                English
                arrow-up
                3
                ·
                edit-2
                6 months ago

                pdf2text and tesseract, i believe pdf2text uses tesseract. i have them installed on an apple silicon mac with homebrew (e.g. brew install tesseract or brew install pdf2text)

                could probably use some ai computer vision package (i haven’t checked, i remember looking around before settling on pdf2text) like opencv.

                when i used pdf2text it was with pdf slides my prof provided, they ONLY gave pdfs. something about copyright and IP. super interesting prof, great scientist, great researcher, actually a member of some cool orgs like Linnaeus Society, and annoying with her lecture files.

                EDIT: if anyone wants it enough i can try to do a proof-of-concept for like ~15 random pages of a random doc and see how well it goes