• SUPAVILLAIN
    link
    fedilink
    English
    arrow-up
    6
    ·
    6 months ago

    can’t control-f names we feel WOULD turn up in this doc; which means now about 1200 pages have to be trawled through entirely manually

      • SUPAVILLAIN
        link
        fedilink
        English
        arrow-up
        3
        ·
        6 months ago

        If there are, I certainly don’t know about 'em-- stuff I could’ve used for my textbook epubs last semester

        • IzyaKatzmann [he/him]@hexbear.net
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          6 months ago

          pdf2text and tesseract, i believe pdf2text uses tesseract. i have them installed on an apple silicon mac with homebrew (e.g. brew install tesseract or brew install pdf2text)

          could probably use some ai computer vision package (i haven’t checked, i remember looking around before settling on pdf2text) like opencv.

          when i used pdf2text it was with pdf slides my prof provided, they ONLY gave pdfs. something about copyright and IP. super interesting prof, great scientist, great researcher, actually a member of some cool orgs like Linnaeus Society, and annoying with her lecture files.

          EDIT: if anyone wants it enough i can try to do a proof-of-concept for like ~15 random pages of a random doc and see how well it goes