Working with PDFs in Python

Gaurav Kumar
8 min readJan 26, 2024

--

Portable Document Format (PDF) files are a ubiquitous format for document exchange due to their platform independence and consistent formatting. In Python, several libraries provide tools to work with PDFs, allowing developers to manipulate, extract information, and create PDF files programmatically. In this article, we’ll explore some popular Python libraries for working with PDFs.

How to Work With a PDF in Python

Working with PDFs in Python can be a valuable skill for tasks such as extracting information, manipulating content, or creating new documents. In this guide, we’ll explore the basic steps and some popular Python libraries to help you get started with PDF operations.

1. Understanding PDFs in Python:

Before diving into the code, it’s essential to have a basic understanding of how PDFs work. PDFs, or Portable Document Format files, are a standardized format for document exchange. They can contain text, images, hyperlinks, forms, and more. In Python, several libraries simplify the process of working with PDFs.

2. Installing PDF Libraries:

To begin, you need to install a PDF manipulation library. Two commonly used libraries are PyPDF2 and PyMuPDF. You can install them using the following commands:

pip install PyPDF2
pip install pymupdf

3. Extracting Text from a PDF:

Using PyPDF2:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)

# printing number of pages in pdf file
print(len(pdfReader.pages))

# creating a page object
pageObj = pdfReader.pages[0]

# extracting text from page
print(pageObj.extract_text())

# closing the pdf file object
pdfFileObj.close()
  • Let’s attempt to comprehend the provided code by breaking it down into smaller sections.
pdfFileObj = open('example.pdf', 'rb')
  • We accessed the “example.pdf” file in binary mode and assigned the resulting file object to the variable pdfFileObj.
pdfReader = PyPDF2.PdfReader(pdfFileObj)
  • In this context, we instantiate a PdfReader class object from the PyPDF2 module, utilizing a PDF file object as a parameter, thereby obtaining a PDF reader object.
print(len(pdfReader.pages))
  • The ‘pages’ attribute indicates the quantity of pages present in the PDF document, as evidenced by the initial line of the output, which states that there are 3 pages.
pageObj = pdfReader.pages[0]
  • Instantiate an object from the PageObject class within the PyPDF2 module. The PDF reader object provides a pages[] function that, when given a page number (indexed from 0), yields the corresponding page object.
print(pageObj.extract_text())
  • The page object provides a method called extract_text() designed for retrieving text content from the PDF page.
pdfFileObj.close()
  • At last, we close the PDF file object.

Note: Please be aware that although PDFs are effective for presenting text in a format suitable for printing and human reading, their inherent complexity poses challenges for software to seamlessly convert them into plain text. Consequently, PyPDF2 may encounter errors in accurately extracting text from certain PDFs and, in some cases, may even be incapable of opening specific files. Unfortunately, there is limited recourse for addressing this issue. PyPDF2 may face difficulties in handling certain PDFs, and regrettably, there may be instances where it cannot effectively process your particular PDF files.

Using PyMuPDF:

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page_number in range(doc.page_count):
page = doc[page_number]
text += page.get_text()
doc.close()
return text

pdf_path = 'example.pdf'
text = extract_text_from_pdf(pdf_path)
print(text)
  1. Importing PyMuPDF:
  • Importing the PyMuPDF module and aliasing it as fitz.
import fitz  # PyMuPDF

2. Defining the Text Extraction Function:

  • Creating a function named extract_text_from_pdf that takes a PDF file path (pdf_path) as a parameter.
def extract_text_from_pdf(pdf_path):

3. Opening the PDF Document:

  • Within the function, opening the specified PDF file using fitz.open(pdf_path).
  • Creating a doc object representing the opened PDF.
doc = fitz.open(pdf_path)

4. Initializing an Empty Text String:

  • Initializing an empty string (text) to store the extracted text.
text = ""

5. Looping Through Pages for Text Extraction:

  • Using a for loop to iterate over each page in the PDF document.
  • Accessing each page using doc[page_number] and assigning it to the page variable.
  • Extracting text from the page and appending it to the text string.
for page_number in range(doc.page_count):
page = doc[page_number]
text += page.get_text()

6. Closing the PDF Document:

  • After extracting text from all pages, closing the PDF document using doc.close().
doc.close()

7. Returning the Extracted Text:

  • Returning the accumulated text as the result of the function.
return text

8. Providing Input PDF Path and Extracting Text:

  • Specifying the path of the PDF file (pdf_path) from which text will be extracted.
pdf_path = 'example.pdf'

9. Calling the Text Extraction Function and Printing Result:

  • Calling the extract_text_from_pdf function with the specified PDF path and storing the result in the text variable.
  • Printing the extracted text.
text = extract_text_from_pdf(pdf_path)
print(text)

In summary, this code defines a function to extract text from each page of a PDF using PyMuPDF. It opens the PDF, iterates through each page, extracts text, closes the PDF, and returns the accumulated text. The function is then called with a sample PDF file, and the extracted text is printed.

4. Merging PDFs:

Using PyPDF2:

import PyPDF2

def merge_pdfs(pdf_list, output_path):
pdf_merger = PyPDF2.PdfMerger()
for pdf in pdf_list:
pdf_merger.append(pdf)
with open(output_path, 'wb') as merged_pdf:
pdf_merger.write(merged_pdf)

pdf_list = ['file1.pdf', 'file2.pdf', 'file3.pdf']
output_path = 'merged.pdf'
merge_pdfs(pdf_list, output_path)
  1. Importing PyPDF2:
  • It is importing the PyPDF2 module to access its functionalities.
import PyPDF2

2. Defining the Merge Function:

  • Creating a function named merge_pdfs that takes a list of PDF files (pdf_list) and an output path (output_path) as parameters.
def merge_pdfs(pdf_list, output_path):

3. Initializing the PdfMerger:

  • Within the function, initializing a PdfMerger object from PyPDF2. This object will be used to merge the PDF files.
pdf_merger = PyPDF2.PdfMerger()

4. Looping Through PDFs to Append:

  • Using a for loop to iterate over each PDF file in the provided list (pdf_list).
  • Appending each PDF file to the pdf_merger object.
for pdf in pdf_list:
pdf_merger.append(pdf)

5. Writing the Merged PDF:

  • Opening the specified output_path in write-binary mode ('wb').
  • Writing the merged content from the pdf_merger object to the newly created file.
with open(output_path, 'wb') as merged_pdf:
pdf_merger.write(merged_pdf)

6. Providing Input PDF List and Output Path:

  • Defining a list of input PDF files (pdf_list) to be merged.
  • Specifying the desired output path for the merged PDF (output_path).
pdf_list = ['file1.pdf', 'file2.pdf', 'file3.pdf']
output_path = 'merged.pdf'

7. Calling the Merge Function:

  • Invoking the merge_pdfs function with the specified input parameters.
merge_pdfs(pdf_list, output_path)

In summary, this code defines a function to merge multiple PDF files into a single PDF. The function utilizes the PdfMerger class from PyPDF2, iterates through the provided list of PDFs, appends them to the merger object, and then writes the merged content to the specified output file. Finally, the function is called with a sample list of input PDFs and an output path.

5. Creating a PDF:

Using PyPDF2:

import PyPDF2

def create_pdf(output_path, content):
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(PyPDF2.PdfFileReader(content).getPage(0))
with open(output_path, 'wb') as new_pdf:
pdf_writer.write(new_pdf)

content = 'existing_pdf.pdf' # Replace with an existing PDF file
output_path = 'new_pdf.pdf'
create_pdf(output_path, content)
  1. Defining the PDF Creation Function:
  • Creating a function named create_pdf that takes an output path (output_path) and a content source (content) as parameters.
def create_pdf(output_path, content):

2. Initializing the PdfWriter:

  • Within the function, initializing a PdfWriter object from PyPDF2. This object will be used to create the new PDF.
pdf_writer = PyPDF2.PdfWriter()

3. Adding a Page from an Existing PDF:

  • Using the add_page method of the pdf_writer object to add a page from an existing PDF specified by the content parameter. In this case, it adds the first page (indexed at 0).
pdf_writer.add_page(PyPDF2.PdfReader(content).get_page(0))

4. Opening the New PDF File for Writing:

  • Opening the specified output_path in write-binary mode ('wb').
with open(output_path, 'wb') as new_pdf:

5. Writing the Content to the New PDF:

  • Writing the content from the pdf_writer object to the newly created PDF file.
    pdf_writer.write(new_pdf)

6. Providing Input Values:

  • Defining the source PDF file (content) from which a page will be extracted.
  • Specifying the desired output path for the new PDF (output_path).
content = 'existing_pdf.pdf'  # Replace with an existing PDF file
output_path = 'new_pdf.pdf'

7. Calling the PDF Creation Function:

  • Invoking the create_pdf function with the specified input parameters.
create_pdf(output_path, content)

In summary, this code defines a function to create a new PDF by extracting a page from an existing PDF. It utilizes the PdfWriter class from PyPDF2, adds a specific page from the source PDF, and then writes the content to the newly created PDF file. The function is called with a sample input PDF and an output path.

6. Rotating PDF pages:

Using PyPDF2:

# importing the required modules
import PyPDF2

def PDFrotate(origFileName, newFileName, rotation):

# creating a pdf File object of original pdf
pdfFileObj = open(origFileName, 'rb')

# creating a pdf Reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)

# creating a pdf writer object for new pdf
pdfWriter = PyPDF2.PdfWriter()

# rotating each page
for page in range(len(pdfReader.pages)):

# creating rotated page object
pageObj = pdfReader.pages[page]
pageObj.rotate(rotation)

# adding rotated page object to pdf writer
pdfWriter.add_page(pageObj)

# new pdf file object
newFile = open(newFileName, 'wb')

# writing rotated pages to new file
pdfWriter.write(newFile)

# closing the original pdf file object
pdfFileObj.close()

# closing the new pdf file object
newFile.close()


def main():

# original pdf file name
origFileName = 'example.pdf'

# new pdf file name
newFileName = 'rotated_example.pdf'

# rotation angle
rotation = 270

# calling the PDFrotate function
PDFrotate(origFileName, newFileName, rotation)

if __name__ == "__main__":
# calling the main function
main()

Here, you can see how the first page of rotated_example.pdf looks like ( right image) after rotation:

Some important points related to the above code:

  • To initiate the rotation process, we begin by establishing a PDF reader entity for the initial PDF document.
pdfWriter = PyPDF2.PdfWriter()
  • Pages that have been rotated will be saved to a new PDF file. To accomplish this, we employ an instance of the PdfWriter class from the PyPDF2 module for PDF writing.
for page in range(len(pdfReader.pages)):
pageObj = pdfReader.pages[page]
pageObj.rotate(rotation)
pdfWriter.add_page(pageObj)
  • In the process of iterating through each page of the initial PDF, we obtain the page object using the .pages[] function of the PDF reader class. Subsequently, we apply rotation to the page using the rotate() method of the page object class. Following this, we append the rotated page object to the PDF writer instance using the addPage() method of the PDF writer class.
newFile = open(newFileName, 'wb')
pdfWriter.write(newFile)
pdfFileObj.close()
newFile.close()
  • To create a new PDF file, initiate a file object and employ the PDF writer’s write() method to transfer the existing PDF pages. Subsequently, ensure to close both the original and new file objects for proper handling of the files.

Conclusion:

Working with PDFs in Python involves selecting the right library for your task and leveraging its features to manipulate or extract information from PDF documents. Whether you’re extracting text, merging multiple files, or creating new PDFs, these examples provide a foundation for your PDF-related endeavors in Python. Explore the documentation for each library to discover more advanced functionalities and refine your PDF manipulation skills.

I write about tech and finance. Check out my profile for more such articles here. Happy Learning !!!

Want to keep me Motivated ?

--

--