Hi everyone. Need to extract text from a pdf but it’s an image?
Enter … ocrmypdf.
My day job (not for much longer - yay!) is modifying texts for students with reduced or no vision - it’s frustrating using Okular (pdf and lots of other things viewer) only to find you cannot extract text.
Install ocrmypdf - it’s a command line utility so for example you have a pdf called text.pdf (but really its only contains an image of text, and it is in your Downloads folder say. Open a terminal and:
cd Downloads
ocrmypdf text.pdf output_pdf
The output_pdf is now a fully strippable pdf in Okular! Yay!
DRM pdf? Use LIOS - Linux Intelligent OCR Software.
Just open the pdf in LIOS - it calls them files then then the output in the left pane are termed ‘images’ - recognise all images from the menu then on the centre bottom pane just do any deleting that is necessary, select all, copy and paste into your Text Processor - job done.