Python PDF 3: Writing With HTML and XML

Alas, I have discovered the potent mixture of Jinja, weasyprint and Pandas. Mixing these tools with matplotlib and Python image modules yields a way to write PDF documents with relative ease and with the styling help of HTML. It would also be able to use a tool like xmltopdf for generating pdf files from XML. Previous Posts dealt with this using a more complicated tool, PyPDF2.

A Basic HTML Template

In this tutorial, I am using jinja to create tables. My tables will not have much in the way of styling but it is also possible to add styles with jinja or by using a tool such as Django-Tables2. Both tools are incredibly similar to the Django platform.

A template is needed in order to generate HTML pages for conversion to pdf format. Jinja follows a basic format with double curly braces used to mark where items are entered encapsulating the title of the property.


<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title>{{ title }}</title>
</head>
<body>
<div>
<h1>Weekly Summary Report</h1>
{{ summary_pivot_table }}
</div>

<div>
<h1>Frequency Report</h1>
{{ frequency_table }}
</div>

<div>
<h1>Weekly Source Reports</h1>
{{ source_pivot_table }}
</div>
</body>
</html>

In this case, there is a title and three reports. It would be easy to add CSS tags and generate different styles using the division tags. These will be converted by weasyprint later.

Writing to the Template

Writing to a template with Jinja requires using the dictionary data structure.

adminVars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}

Generating Data

Generating data is simple with Pandas. This is especially true with databases. One only needs to connect to a database using a SQLAlchemy engine and perform any necessary query. It is also possible to concatenate as many queries as necessary to generate a table.

import sqlalchemy
import pandas

#create alchemy engine      

dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar
("db","port","string")+"/"+cfp.getVar("db","dbname","string")

engine=sqlalchemy.create_engine(dsn)
        
#get totals table
query=cfp.getVar("sql","weekly_totals_complete","string")
tframe=pandas.read_sql_query(query,engine)

Concatenation is not difficult either using the concat function.

pandas.concat([pandas.read_sql_query(query,engine) for query in tables])

New columns will be generated with NaN values.

Performing Basic Operations on Dataframes

Performing operations on dataframes is easy with numpy or scipy.

import numpy

#operate on tframe from above
tframe.apply(numpy.average,axis=0)

Dataframes themselves have operations that can be formed on them and use numpy.

#tframe from above
tframe.mean()

A list of operations is provided in the Pandas documentation.

More complicated operations may require unpacking the values or using generator functions

Using Weazy Print

Once the resources and template are prepared, simply call on weazy print to convert the html resulting from the template to a PDF.

An extra import is needed to fetch resources such as images from links embedded within the url.

Otherwise, generate a pandas data frame, conver the frame to html and place as the value attached to the appropriate template key in your dictionary and then convert. The example code uses SQLAlchemy to fetch resources from a PostgreSQL database.

from crawleraids.ConfigVars import Config
from jinja2 import Environment,FileSystemLoader
import pandas
from weasyprint import HTML,default_url_fetcher
import sqlalchemy

def fetchURL(url):
   '''
   Provide a resource obtainer for getting urls to weazy print
   '''
   return weasyprint.default_url_fetcher(url)


def generatePDF(fpath):
       '''
       Generate the pdf.
       '''
       #create alchemy engine
       cfp=Config(fpath)
       
dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar("db","port","string")+"/"+cfp.getVar("db","dbname","string")
        engine=sqlalchemy.create_engine(dsn)
        
        #get totals table
        query=cfp.getVar("sql","weekly_totals_complete","string")
        tframe=pandas.read_sql_query(query,engine)
        
        #get summary stats table
        query=cfp.getVar("sql","weekly_summary_complete","string")
        sframe=pandas.read_sql_query(query,engine)
        
        #get frequencies
        query=cfp.getVar("sql","weekly_frequency","string")
        freqFrame=pandas.read_sql_query(query,engine)
         
        #get the resource loader 
        env=Environment(loader=FileSystemLoader('.'))
        template=env.get_template("aboveTemplate.html")
         
       #fill the template
       vars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}
       html=template.render(vars)
       
       #use weazy print to convert to pdf
       HTML(string=html).write_pdf(target="/reports/report.pdf",stylesheets=cfp.getVar("style","aboveTemplate","string"))

All of the power of Pandas is now at the disposal of the programmer along with anything that can be embedded in a url.

Generating and Saving Graphs with Pandas and Matplotlib or PyPlot

It is possible to embed graphs into a pdf by saving them as images.

Obviously, the folks behind pdf allow most things made of bytes to be placed in objects in a PDF (a pdf is a series of pdf objects much like xml with byte strings in base 64 as the text). See my magic numbers post and try to parse or write your own image to a pdf if you really want to dive into the subject.

Generating graphs is simple Pandas. Just make sure to match the template graph with the image url.

import matplotlib.pyplot as plt

data=[[1,3,5,2],[1,3,4]] #perform operations on the data to transform the graph. Each array is a new plot line.
df = DataFrame(data,columns=[['PlotA','PlotB']))
fig=df.plot()
fig=fig.get_figure()
fig.savefig('graph.png') #also can save as own pdf to be merged as described in an earlier post

It is possible to do this directly with pyplot as well.

Using Flask to Create a PDF Web Server

It appears from comments and questions that pfd servers are often a request. The clunkiness of Spring can now be replaced easily with the combination of the mentioned tools and the Flask web framework. These tools allow for the quick and easy creation of a pdf web server. However, asyncore with socket, Spring with Java based tools, or other tools will need to be run if the plan is to use something akin to the proxy pattern, a sad state of affairs.

To create the server, simply create a method with an annotation specifying the path, much as would happen in spring.

from flask import Flask
try:
   from cStringIO import StringIO
except:
   import StringIO

from flask import send_file
app = Flask(__name__)

def otherFunc():
    pass

@app.route("/")
def generatePDF():
    #code to generate PDF........
    StringIO(pdf)
    return send_file(pdf, attachment_filename='file.pdf')

if __name__ == "__main__":
    app.run()

Weasyprint also includes a way to incorporate pre-generated pdfs from within the same application.

Advertisements

Python PDF 2: Writing and Manipulating a PDF with PyPDF2 and ReportLab

Note: PdfMiner3K is out and uses a nearly identical API to this one. Fully working code examples are available from my Github account with Python 3 examples at CrawlerAids3 and Python 2 at CrawlerAids (both currently developed)

In my previous post on pdfMiner, I wrote on how to extract information from a pdf. For completeness, I will discuss how PyPDF2 and reportlab can be used to write a pdf and manipulate an existing pdf. I am learning as I go here. This is some low hanging fruit meant to provide a fuller picture. Also, I am quite busy.

PyPDF and reportlab do not offer the completeness in extraction that pdfMiner offers. However, they offer a way of writing to existing pdfs and reportlab allows for document creation. For Java, try PDFBox.

However, PyPdf is becoming extinct and pyPDF2 has broken pages on its website. The packages are still available from pip,easy_install, and from github. The mixture of reportlab and pypdf is a bit bizzare.


PyPDF2 Documentation

PyPdf, unlike pdfMiner, is well documented. The author of the original PyPdf also wrote an in depth review with code samples. If you are looking for an in depth manual for use of the tool, it is best to start there.

Report Lab Documentation

Report lab documentation is available to build from the bitbucket repositories.

Installing PyPdf and ReportLab

Pypdf2 and reportlab are easy to install. Additionally, PyPDF2 can be installed from the python package site and reportlab can be cloned.

   easy_install pypdf2
   pip install pypdf2
   
   easy_install reportlab
   pip install reportlab

ReportLab Initialization

The necessary part of report lab is the canvas objects. Report lab has several sizes. They are letter,legal, and portrait. The canvas object is instantiated with a string and size.

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import portrait

PyPdf2 Initialization

PyPdf2 has a relatively simple setup. A PdfFileWriter is initialized to add to the document, as opposed to the PdfReader which reads from the document. The reader takes a file object as its parameter. The writer takes an output file at write time.

   from pyPDF2 import PdfFileWriter, PdfFileReader
   # a reader
   reader=PdfFileReader(open("fpath",'rb'))
   
   # a writer
   writer=PdfFileWriter()
   outfp=open("outpath",'wb')
   writer.write(outfp)

All of this can be found under the review by the writer of the original pyPdf.

Working with PyPdf2 Pages and Objects

Before writing to a pdf, it is useful to know how to create the structure and add objects. With the PdfFileWriter, it is possible to use the following methods (an IDE or the documentation will give more depth).

  • addBlankPage-create a new page
  • addBookmark-add a bookmark to the pdf
  • addLink- add a link in a specific rectangular area
  • addMetaData- add meta data to the pdf
  • insertPage-adds a page at a specific index
  • insertBlankPage-insert a blank page at a specific index
  • addNamedDestination-add a named destination object to the page
  • addNamedDestinationObject-add a created named destination to the page
  • encrypt-encrypt the pdf (setting use_128bit to True creates 128 bit encryption and False creates 40 bit encryption with a default of 128 bits)
  • removeLinks-removes links by object
  • removeText-removes text by text object
  • setPageMode-set the page mode (e.g. /FullScreen,/UseOutlines,/UseThumbs,/UseNone
  • setPageLayout-set the layout(e.g. /NoLayout,/SinglePage,/OneColumn,/TwoColumnLeft)
  • getPage-get a page by index
  • getLayout-get the layout
  • getPageMode-get the page mode
  • getOutlineRoot-get the root outline

ReportLab Objects

Report lab also contains a set of objects. Documentation can be found here. It appears that postscript or something similar is used for writing documents to a page in report lab. Using ghostscript, it is possible to learn postscript. Postscript is like assembler and involves manipulating a stack to create a page. It was developed at least in part by Adobe Systems, Inc. back in the 1980s and before my time on earth began.

Some canvas methods are:

  • addFont-add a font object
  • addOutlineEntry-add an outline type to the pdf
  • addPostscriptCommand-add postscript to the document
  • addPageLabel-add a page label to the document canvas
  • arc-draw an arc in a postscript like manner
  • beginText-create a text element
  • bezier-create a postscript like bezier curve
  • drawString-draw a string
  • drawText-draw a text object
  • drawPath-darw a postscript like path
  • drawAlignedString-draw a string on a pivot character
  • drawImage-draw an image
  • ellipse-draw an elipse on a bounding box
  • circle-draw a circle
  • rect-draw a rectangle

Write a String to a PDF

There are two things that dominate the way of writing pdf files, writing images, and writing strings to the document. This is handled entirely in

Here, I have added some text and a circle to a pdf.

def writeString():
    fpath="C:/Users/andy/Documents/temp.pdf"
    packet = StringIO.StringIO()
    packet=StringIO.StringIO()
    cv=canvas.Canvas(packet, pagesize=letter)
    
    #create a string
    cv.drawString(0, 500, "Hello World!")
    #a circle. Do not add another string. This draws on a new page.
    cv.circle(50, 250, 20, stroke=1, fill=0)
    
    #save to string
    cv.save()
    
    #get back to 0
    packet.seek(0)
    
    #write to a file
    with open(fpath,'wb') as fp:
        fp.write(packet.getvalue())

The output of the above code:
Page 1
 photo page1.png

Page 2
 photo page2.png

Unfortunately, adding a new element occurs on a new page after calling the canvas’ save method. Luckily the “closure” of the pdf just creates a new page object. A much larger slice of documentation by reportlab goes over writing a document in more detail. The documentation includes alignment and other factors. Alignments are provided when adding an object to a page.

Manipulating a PDF
Manipulation can occur with ReportLab. ReportLab allows for deletion of pages,insertion of pages, and creation of blank pages. The author of pyPDF goes over this in depth in his review.

This code repeats the previous pages twice in a new pdf. It is also possible to merge (overlay) pdf pages.

    from PyPDF2 import PdfFileWriter,PdfFileReader

    pdf1=PdfFileReader(open("C:/Users/andy/Documents/temp.pdf"))
    pdf2=PdfFileReader(open("C:/Users/andy/Documents/temp.pdf"))
    writer = PdfFileWriter()
    
    # add the page to itself
    for i in range(0,pdf1.getNumPages()):
         writer.addPage(pdf1.getPage(i))
    
    for i in range(0,pdf2.getNumPages()):
         writer.addPage(pdf2.getPage(i))
    
    # write to file
    with file("destination.pdf", "wb") as outfp:
        writer.write(outfp)

Overall Feedback
Overall, PyPDF is useful for merging and changing existing documents in terms of the the way they look and reportlab is useful in creating documents from scratch. PyPDF deals mainly with the objects quickly and effectively and reportlab allows for in depth pdf creation. In combination, these tools rival others such as Java’s PdfBox and even exceed it in ways. However, pdfMiner is a better extraction tool.


””