OCR apps you cannot live without!

swarfendor437 · December 18, 2020, 11:36pm

Hi everyone. Need to extract text from a pdf but it's an image?
Enter ... ocrmypdf.

https://ocrmypdf.readthedocs.io/en/latest/

My day job (not for much longer - yay!) is modifying texts for students with reduced or no vision - it's frustrating using Okular (pdf and lots of other things viewer) only to find you cannot extract text.

Install ocrmypdf - it's a command line utility so for example you have a pdf called text.pdf (but really its only contains an image of text, and it is in your Downloads folder say. Open a terminal and:

cd Downloads
ocrmypdf text.pdf output_pdf

The output_pdf is now a fully strippable pdf in Okular! Yay!

DRM pdf? Use LIOS - Linux Intelligent OCR Software.

Just open the pdf in LIOS - it calls them files then then the output in the left pane are termed 'images' - recognise all images from the menu then on the centre bottom pane just do any deleting that is necessary, select all, copy and paste into your Text Processor - job done.

swarfendor437 · March 13, 2021, 8:11pm

Well work needed some work doing on Editable pdf fields. I had a very old copy of Adobe Acrobat lying around and it would not run in 64-bit Windows 7. I was toying with purchasing Master PDF Editor - free for personal use on Linux but limited functionality. I then searched 'Alternative to' and found rave reviews for Qoppa (pdf StudioPro 2020). One of the few Pdf suites that supports Linux - absolute gem piece of software - got it on offer reduced from $129 to $109.

StarTreker · March 13, 2021, 8:36pm

Hello SWARF!

Please do review that software, once you've gone through it all, and have used it for awhile. I am sure there are other's who need a good PDF program as well.

swarfendor437 · March 13, 2021, 8:46pm

When I have time I will take some screen shots and post to imgBB!

swarfendor437 · September 11, 2021, 10:10am

Well, following StarTreker's request for 'a guide' on pdf recovery, as I had already started a '.pdf' thread, this one is about how to fix a corrupt .pdf file. I thought this was only a Linux tool but it is also available to Windows users but I think the Windows version has to be paid for: pdftk stands for pdf tool kit. You should be able to find it in Synaptic Package Manager.
(Having to finish this later as problems with my root password being excepted!

Basically, open a terminal, cd .. to the folder where the dodgey pdf is then enter:

pdftk broken.pdf output fixed.pdf

broken.pdf - name of dodgy/corrupt pdf
fixed.pdf - you could use same name as dodgy one with the suffix of a letter or number added if you so wished but fixed.pdf is good for a quick find to avoid having to say use a long file name and can be easily found.

Extensive coverage of what you can do with pdftk here:

https://linux.die.net/man/1/pdftk

In terms of pdf Studio Pro - rather than waste time in loads of screenshots - check out the guides at the official site here:

https://kbpdfstudio.qoppa.com/pdf-studio-user-guide/

The latest version has accessibility options too.