Morning Joe: Normalization with Climate Data Streams

snowy-tree

So I want to gather climate data and make some predictions on my own using a variety of factors and an Ardunio nano through assembler. That requires storing data I collect and ensuring that it can stream and is accountable. Which normalization level do I use?

  1. Only reduces horizontal redundancy so no.
  2. Only reduces vertical redundancy so no.
  3. Closer. Everything relates to the key. BNCF is even closer since the key explains everything with all candidate keys separated.
  4. Splits out multiple redundancies and further reduces data. So weather data can be separated by sensor or snow-water equivalent by area and layer.
  5. Accounts for more business-like rules. Is this overdoing it? It is semantic. Do I know enough to use it?
  6. Takes over all of the set of related values with a join. It is good for temporal data.

 

My data is meant to persist once it is inserted. It must be separated for easy mathematical calculations. Finally, it deals with nature, so relationships should probably not be rule defined. I n particular, it deals with a side of nature that no one really knows much about. I want to preserve all possible relationships. Therefore, 5 NF is a bit much.

I do need to relate things to keys so I can grab by specific area, day, weather, type of phenomena; whatever else I need. I also need to separate attributes into easy to grab attributes with an appropriate impact. The goal is prediction and calculation.

I am going to use 4 NF. Look back for more on this project.

Getting Started Extracting Tables With PDFMiner

PDFMiner has evolved into a terrific tool. It allows direct control of pdf files at the lowest level, allowng for direct control of the creation of documents and extraction of data. Combined with document writer, recognition, and image manipulation tools as well as a little math magic and the power of commercial tools can be had for all but the most complex tasks. I plan on writing on the use of OCR, Harris corner detection, and contour analysis in OpenCV, homebrew code, and tesseract later.

However, there is little in the way of documentation beyond basic extraction and no python package listing. Basically the methods are discoverable but not listed in full. In fact, existing documentation consists mainly of examples despite the mainy different modules and classes designed to complete a multitude of tasks. The aim of this article, one in a hopefully two part series is to help with extraction of information. The next step is the creation of a pdf document using a tool such as pisa or reportlab since PdfMiner performs extraction.

The Imports
There are several imports that will nearly alwasy be used for document extraction. All are under the main pdfminer import. The imports can get quite large.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import pdfminer.layout
from pdfminer.layout import LAParams,LTTextBox,LTTextLine,LTFigure,LTTextLineHorizontal,LTTextBoxHorizontal
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

Some imports are meant to perform extraction and others are meant to check for and support the extraction of different types.



Visul Representation of Outputs

The following image is taken from pdfminer’s limited documentation.

Source: Carleton University

Imports for Extraction
The following table goes over the imports that perform actions on the document.

Import Description
PDFParser The parser class normally passed to the PDFDocument that helps obtain elements from the PDFDocument.
PDFResourceManager Helps with aggregation. Performs some tasks between the interpreter and device.
PDFPageInterpreter Obtains pdf objects for extraction. With the ResourceManager, it changes page objects to instructions for the device.
PDFPageAggregator Takes in LAParams and PDFResourceManager for getting text from individual pages.
PDFDocument Holds the parser and allows for direct actions to be taken.
PDFDevice Writes instructions to the document.
LAParams Helps with document extraction

Imports that act as Types
Other imports act as a means of checking against types and utilizing the other classes properties. The PDFDocument contains a variety of pdf objects that which hold their own information. That information includes the type, the coordinates, and the text displayed in the document. Images are also handleable.

The objects include:

These types are useful for pulling information from tables as eplained by Julian Todd.

Creating the Document
Creating the document requires instantiating each of the parts present in the diagram above. The order for setting up the document is to create a parser and then the document with the parser. The resourcemanager and LAParams accepted as arguments by the PageAggregator device used for this task. The PageInterpretor accepts the aggregator and the resource manager. This code is typical of all parsers and as part of the pdf writers.

StringIO will make extraction run more quickly. The resulting object’s code is written in C.

        cstr=StringIO()
        with open(fpath,'rb') as fp:
            cstr.write(fp.read())
        cstr.seek(0)
        doc=PDFDocument(PDFParser(cstr))
        rsrcmgr=PDFResourceManager()
        laparams=LAParams()
        device=PDFPageAggregator(rsrcmgr,laparams=laparams)
        interpreter=PDFPageInterpreter(rsrcmgr,device)
        
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
        
            layout=device.get_result()
            self.parsePage(layout)

The get_result() method adds to the StringIO. The results are passed ot the ParsePage definition. Another method can be used for pure extraction (.get_value()).

The PDF Text Based Objects
The layout received from get_result() parses the strings into separate objects. These objects have several key components. They are the type, the coordinates (startingx, startingy, endingx, endingy), and the content.

Accessing the type can be found using type(object) and compared to the direct type (e.g. type(object)==LTRect). In this instance, a comparison to LTRect returns True.

Getting the Output
Output is obtained and parsed through a series of method calls. The following example shows how to extract content.

        tcols=[]
        objstack=list(reversed(layout._objs))
    
        tcols=[]
        
        while objstack:
            b=objstack.pop()
            print b
            if type(b) in [LTFigure, LTTextBox, LTTextLine, LTTextBoxHorizontal]:
                objstack.extend(reversed(b._objs)) 
            elif type(b) == LTTextLineHorizontal:
                tcols.append(b)

This code takes the object stack as a list, which contains the method pop since python, although having a collections (import collections) package with data structurs such as a set, is highly flexible.

This example is a modification of Julian Todd’s code since I could not find solid documentation for pdfminer. It takes the objects from the layout, reverses them since they are placed in the layout as if it were a stack, and then iterates down the stack, finding anything with text and expanding it or taking text lines and adding them to the list that stores them.

The resulting list (tcols), looks much like other pure extractions that can be performed in a variety of tools including Javas pdfbox, pypdf, and even pdfminer. However, the objects are placed into the bbox (bounding box coordinate list) and the text object accessible from .get_text().

Images

Images are handled using the LTImage type which has a few additional attributes in addition to coordinates and data. The image contains bits, colorspace, height,imagemask,name,srcsize,stream, and width.

Extracting an image works as follows:

if type(b) == LTImage:
     imgbits=b.bits

PDFMiner only seems to extract jpeg objects. However, xpdf extracts all image.

A more automated and open source solution would be to use subprocess.Popen() to call a java program that extracts images to a specific or provided folder using code such as this (read the full article).

import shlex
import subprocess
pipe=subprocess.Popen(shlex.split("java -jar myPDFBoxExtractorCompiledJar /home/username/Documents/readable.pdf  /home/username/Documents/output.png"),stdout=subprocess.STDOUT) 
pipe.wait()

Handling the Output
Handling the code is fairly simple and forms the crux of this articles benefits besides combining a variety of resources in a single place.

Just iterate down the stack and pull out the objects as needed. It is possible to form the entire structure using the coordinates. The bounding box method allows for objects to be input in a new data structure as they appear in any pdf document. With some analysis, generic algorithms are posslbe. It may be a good idea to write some code first with the lack of documentation.

The following extracts specific columns of an existing pdf. The bounding box list/array is set up as follows. bbox[0] is the starting x coordinate, bbox[1] is the starting y coordinate, bbox[2] is the ending x coordinate, and bbox[3] is the ending y coordinate.

records,cases,dates,times,types,locations,attorneys=self.convertToDict([[x for x in tcols if float(x.bbox[0]) <= 21.0 and "Name\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=176.0 and "Case\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=257.0 and "Date\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=307.0 and "Time\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=354.0 and "Type\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=607.0 and "Location\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=645.0 and "Attorney\n" not in x.get_text()]])

This code uses python's list comprehension. The reason for the inequalities is that slight differentiations exist in the placement of object. The newline escape character represents an underline in this case.

Pure Text Extraction

In order to see how to perform pure text extraction and move to a better understanding of the code, analyze the following code.

       with open(fpath,'rb') as fp:
            doc=PDFDocument(PDFParser(fp))
            rsrcmgr=PDFResourceManager()
            retstr=StringIO()
            laparams=LAParams()
            codec='utf-8'
            device=TextConverter(rsrcmgr,retstr,codec=codec,laparams=laparams)
            interpreter=PDFPageInterpreter(rsrcmgr,device)
            lines=""
            for page in PDFPage.create_pages(doc):
                interpreter.process_page(page)
                rstr=retstr.getvalue()
                
                if len(rstr.strip()) >0:
                    lines+="".join(rstr)
            return lines

Conclusion
PdfMiner is a useful tool that can write and read pdfs and their actual formating. The tool is flexible and can easily control strings. Extracting data is made much easier compared to some full text analysis which can produced garbled and misplaced lines. Not all pdfs are made equal.