The pdf thread

swarfendor437 · July 28, 2022, 7:42pm

So as not to lose sight of a thread I posted an item about Okular over acrobat I felt it would be useful to list all pdf applications, tools that work well in GNU/Linux.

Now sadly for me, speech output does not work under Devuan in respect of Okular but it worked out of the box with Feren OS. You normally have to add backends into Okular and Feren OS uses 'flite'.
You cannot use Okular to edit pdf's directly, i.e., modify them but you can extract text, images and even tables into your favourite Text Processor - for me SoftMaker's TextMaker 2021 was my daily go to whilst working from home. What you can do with Okular is add annotations such as little post-it notes - the downside to this is that you cannot change the font size on a post-it note annotation (reviewing tools are what they are called in Okular). However the inline text tool does allow you to increase font and change the font.
When you add reviewing tools to a pdf you need to save it as an okular archive pdf or it won't keep the annotations you add. You can select an entire pdf and output to a .txt file. Okular can view pdf's as presentations and even view Presentation files and run as a presentation tool. I had to modify a Music department PowerPoint presentation and I was able to run it in Okular and play back the embedded music files! It can magnify upt to 1,600%!

[Website: Okular - The Universal Document Viewer]

For actual editing I used to use the free version of Master PDF Editor but as I was unsure about how the paid for version would work out I did some more delving and on the alternative to website, the pdf editing suite with the most favourable views was pdf Studio Pro 2020/21 (I upgraded to 2021). This suite has all the bells and whistles that Adobe Acrobat has for a fifth of the price. It's normal price was $130 dollars - I got a special discount at the time of $109 dollars which worked out at about £79 at time of purchase/currency conversion. It was particularly useful for editing a past mock Physics Paper.

[Website: PDF Studio - PDF Editor Software for Mac, Windows and Linux]

Often, some pdf's were constructed as images which required using a command line tool, 'ocrmypdf' - to use it you open a terminal and enter (don't use the square brackets these were used just to define the different parts of the command line):

ocrmypdf [my.pdf] [output (whatever file name you want).pdf]

In some cases, such as exam papers, the front page might be ocr ready but the rest of the document are images, in which case you would use:

ocrmypdf --force-ocr

This worked incredibly well.

[Documentation: Introduction — ocrmypdf 13.6.2.dev13+g21fb6c82 documentation]

To extract text in one fell swoop, you might want to consider using LIOS (Linux Intelligent OCR Solution). This can read an entire pdf and adds it's own page numbers which you can remove but useful for seeing where pages of text start and end. You can select all the output text in the bottom window by using Ctrl+A, then Ctrl+ C to copy to clipboard and Ctrl+ V to paste into your text processor. It does not maintain any style or font format of the original but if you need text extraction quickly it is ideal.

[Linux-Intelligent-Ocr-Solution download | SourceForge.net]

When preparing Exam papers that a Science Department had used using some questions but not all from previous papers, I would use 'pdfArranger' to delete questions not in the paper that the Science department had produced and then changed the question numbers using 'pdf Studio Pro 2021'

[https://www.linuxuprising.com/2018/12/pdfarranger-merge-split-rotate-crop-or.html]

When pdf's become corrupt, use 'pdftk' (pdf toolkit). Windows users have to pay for the version for Windows - it has a GUI. In GNU/Linux it is a command line tool.
Whilst working on a mock Science paper my hard drive was beginning to fail and the pdf I was working on for an upcoming Mock Exam became corrupted. I had to use several tools to restore what I had worked on. The first was to use 'pdftk':

pdftk [broken.pdf] [output fixed.pdf]

ignore the square brackets (i.e., remove them) when using the command.

[Repair PDF Files using the PDF Toolkit (pdftk) - Techies Guide]

Once I had repaired the pdf, it was fine in respect of text whilst editing in pdf Studio Pro, but I had created a lot of Inkscape diagrams that I had embedded into the corrupted pdf. Solution? I used LibreOffice Draw to open the pdf and then just copied and pasted the diagrams directly into where they should be in pdf Studio Pro!

Another useful tool is OCR feeder. Where Okular uses the ‘de facto’ Tesseract engine, OCR Feeder offers the user 3 additional OCR engines - Cuneiform, GOCR, and Ocrad.

Open a file (Ctrl+ O)

Import page from Scanner (Shift+ Ctrl+ I)

Export (Shift Ctrl+ E) - options - default is ODT (LibreOffice Template), HTML,

PDF, Texto simples.

Recognize document (Shift+ Ctrl+ D)

Recognize page (Shift+ Ctrl+G)

Website: Apps/OCRFeeder - GNOME Wiki!