Manipulating PDFs

Grad school means reading through a lot of PDFs, sometimes they’re already nicely formatted like a journal article, but sometimes they’re scans from a book or archived material. This tutorial goes through some ways to manually or automatically fix the formatting of your PDF files.

I’m going to cover:

PS: I think that your are accessing this page on a and have automatically hidden instructions for other operating systems. Change this drop-down menu to see instructions for other operating systems.

Cropping

If you scan a chapter of a book you’ll usually end up with a file that’s two-pages per sheet and that you cannot highlight. When taking notes, I like keeping my readings open on one side of my screen and my notes open on the other side, so this two-page layout is really inconvenient, as is having to draw lines manually under text rather than highlighting.

Adobe Pro or Abby Fine Reader, which are both installed on computers in the Stanford Political Science Lab, can handle both of these issues but usually I’m on my laptop. After some research, I have settled on a few favorite (free) programs to manage cropping and OCRing my PDFs.

BRISS is a great free application that is designed to solve this two-pages per sheet problem. It scans through your PDF and creates a layered preview of the text so that you can see exactly what you would be cropping. You then click and drag to create zones to crop, in this case the left page and the right page. Here’s what BRISS looks like on the right:

BRISS is a Java application so it works across platforms. Download the latest version from Sourceforge, linked to on the mainpage above. Unzip the contents and place it in your applications folder.



Installing on a PC, use the briss.exe executable file. No further steps, good thing you’re not on a Mac!

Installing on a Mac, use the briss.jar java file to run the application. The complication for a Mac is that you will not be able to drag briss.jar directly onto the dock because it technically is not an executable file, rather its a java file that is run by the Java program on your computer.

To get around this, create a custom application in Automator. Automator is an amazing program that comes with your Mac that helps automate tasks by stringing together commanders. Search for and open Automator on your computer, and create a New Application. You’ll see a bunch of empty grey space on the left, add the “Get Specified Folder Items” action, and add the path to “briss.jar”, then add the “Open Finder Items” action to run it.

Save this application into the same folder that the BRISS executables live and you can drag it onto the Dock. To change the application Icon, which will at first show the default Automator icon, Google search for a pretty icon and copy it, then right click on the script and choose “Get Info”, click on the current icon so that it highlights, then paste your chosen icon.

OCRing - Manually

Optical character recognition (OCR) turns the PDF from a collection of images into highlightable text. There’s great open source code to OCR PDFs, and I will walk through the steps below, but the easiest process is just install a version of PDF-XChange Editor which builds in an easy user interface.

For a PC, just download and install the free program.

Once it is installed, run it and click “Open” to select a PDF, and then click “Document” to select the “OCR Pages” option (also control+shift+c). If your goal is just to highlight the PDF you can select the lowest accuracy setting, which will speed up OCRing significantly. Once its done, save the PDF and open in it Adobe Reader, which has more user friendly commenting options.

I don’t know of any great free OCR program available for a Mac, so we’re going to take advantage of a program called Wine that allows us to run PC programs directly, without dual-booting or virtual environments. This is going to require some initial working with the terminal but won’t worry because we’ll also make a little application so that you can use PDF-XChange Editor just like any other program.

First, install XCode from the App store, it allows us to compile code ourselves rather than using executables. Then, install Homebrew, Homebrew Cask, Java, XQuartz, and Wine from the terminal, using the following code, which I found on David Baumgold’s website. These are steps 0 through 4 on his tutorial page.

Here’s how to install those programs. Enter these lines into the terminal one by one. Each will download and install a large number of files, so just wait until the terminal returns a new line for you to type. The final step to install wine may take up to an hour depending on your computer’s speed and internet connection.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew doctor
brew tap caskroom/cask
brew cask install java xquartz
brew install wine

Conveniently, we don’t have to worry about installing PDF-XChange Editor because they make a portable version available. Download the “Editor Portable Version” from the website, being sure not to choose the option with “No OCR”. Unzip this file and move it to your Applications folder.

At this point we could just run the program from the terminal, but you don’t want to have to type the code every time so let’s make a script instead. In Automator, choose the “Run Shell Command” option and paste the following text, updating the wine version number to match yours.

/usr/local/Cellar/wine/1.8.3/bin/wine ~/Applications/PDFX_Vwr_Port/PDFXCview.exe

As with BRISS, save this script into your application folder and you can drag it onto the Dock and change its icon if you want.

Once it is running, click “Open” to select a PDF, and then click “Document” to select the “OCR Pages” option (also control+shift+c). If your goal is just to highlight the PDF you can select the lowest accuracy setting, which will speed up OCRing significantly. Once its done, save the PDF and open in it Adobe Reader, which has more user friendly commenting options.

OCRing - R Script

We can also avoid the PDF-XChange Editor program all together and use open source OCR code. I have had difficulty getting these steps to work in Automator but you should at least be able to execute it in through the shell (PC) or system (Mac) command in R. To be clear, these steps are for when you have a PDF file that is only an image, but you want to be able to highlight it.

First install imagemagick and tesseract from the terminal, after installing homebrew using the steps above. Imagemagick offers some tools for working with images, like splitting a pdf, and tesseract has the actual OCRing functions.

brew install --with-libtiff --with-ghostscript imagemagick
brew install --all-languages tesseract

Then install pdftk server from the PDFLabs website. This toolkit offers a lot of tools for working with PDFs, we will be utilizing the page merging function.

Then run the following code in R, which I modeled off of code from Ryan Baumann.

input <- "the path to your pdf file"

#Split and convert the PDF with ImageMagick convert
system(paste0("convert -density 300 ",input, "type Grayscale -compress lzw -background white +matte -depth 32 tmp/page_%05d.tif"))

setwd("/tmp")
npages <- length(list.files())

#OCR the pages with Tesseract individually
for (i in npages){
  system(paste0("tesseract ", list.files()[i], " pdf"))
}

#Join your individual PDF files into a single, searchable PDF with PDFtk
system("pdftk page_*.pdf cat output merged.pdf")

The PDFTK documentation also has the steps for outputting this OCRed content as a textfile instead of a highlightable PDF.

UPDATE

I read a post on the Rbloggers website that looks like you can install and run tesseract and ImageMagick all from inside R, here’s the link.

https://www.r-bloggers.com/tesseract-update-options-and-languages/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

I this post also covers the relevant function

https://www.r-bloggers.com/tesseract-and-magick-high-quality-ocr-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Reading OCRed PDFs into R

Alternatively, you could have a PDF that is already OCRed but that you want to be able to read in R, i.e. to carry out text analysis, or to transform its content from a text document into a spreadsheet.

First we need a program called XPDF. On a Mac you can install this via the terminal. After following the steps above to install homebrew, you then run these steps to install XPDF. Once its installed we’ll never actually have to interact with XPDF again, there’s an R package that will do all the work for us.

$ brew tap homebrew/x11
$ brew install xpdf

Now to get everything into R, just use the readPDF() function from the tm library. The options are a little confusing so I have included the code here and commented on each of the settings.

library(tm) #using the text mining package but it requires an external program 
pdffile <- path/to/pdf/file.pdf  #the path to your pdf

pdf <- readPDF( #the tm function 
    engine="xpdf", #specify your engine 
    control = list(text = "-layout"))( #tells it to try and keep the layout
        elem = list(uri = pdffile), #specify your pdf path
        language = "en") #english

pdf$content #returns the text, each line is a new vector
}

First we need a program called XPDF. On a PC we can install this from the internet, using this link. Once its installed we’ll never actually have to interact with XPDF again, there’s an R package that will do all the work for us.

Now to get everything into R, just use the readPDF() function from the tm library. The options are a little confusing so I have included the code here and commented on each of the settings. Also notice that I am changing the working directory to the folder where I have kept the XPDF executable. I think that this step would be unnecessary if I could figure out how to set the PATH variable correctly but I did not have much luck at that and this works fine.

library(tm) #using the text mining package but it requires an external program 
setwd("C:/Users/Vincent/SkyDrive/Documents/R/programs/xpdf") #locate the shell program 

pdffile <- path/to/pdf/file.pdf  #the path to your pdf

pdf <- readPDF( #the tm function 
    engine="xpdf", #specify your engine 
    control = list(text = "-layout"))( #tells it to try and keep the layout
        elem = list(uri = pdffile), #specify your pdf 
        language = "en") #english

pdf$content #returns the text, each line is a new vector