Fluff Stuff: Better Governments, Better Processes, Simplr Insites

Cities are heading towards bankruptcy. The credit rating of Stockton, CA was downgraded. Harrisburg, PA is actually bankrupt.  It is only a matter of time before Chicago implodes. Since 1995, city debt rose by between $1.3 and $1.8 trillion. While a large chunk of this cost is from waste, more is the result of using intuition over data when tackling infrastructure and new projects. Think of your city government as the boss who likes football more than his job so he builds a football stadium despite your company being in the submarine industry.

This is not an unsolvable nightmare.

Take an effective use case where technologies and government processes were outsourced. As costs rose in Sandy Sprints, GA, the city outsourced and achieved more streamlined processes, better technology, and lower costs. Today, without raising taxes, the city is in the green. While Sandy Springs residents are wealth, even poorer cities can learn from this experience.

Cities run projects in an extremely scientific manner and require an immense amount of clean, quality, well-managed data isolated into individual projects to run appropriately. With an average of $8000 spent per employee on technology each year and with an immense effort spent in acquiring analysts and modernizing infrastructure, cities are struggling to modernize.

It is my opinion, one I am basing a company on, that the provision of quality data management, sharing and collaboration tools, IT infrastructure, and powerful project and statistical management systems in a single SAAS tool can greatly reduce the $8000 per employee cost and improve budgets. These systems can even reduce the amount of administrative staff, allowing money to flow to where it is needed most.

How can a collaborative tool tackle the cost problem. Through:

  • collaborative knowledge sharing of working, ongoing, and even failed solutions
  • public facing project blogs and information on organizations, projects, statistical models, results, and solutions that allow even non-mathematical members of an organization to learn about a project
  • a security minded institutional resource manager (IRM better thought of as a large, securable, shared file storage system) capable of expanding into the petabytes while maintaining FERPA, HIPPA, and state and local regulations
  • the possibility to share data externally, keep it internal, or keep the information completely private while obfuscating names and other protected information
  • complexity analysis (graph based analysis) systems for people, projects, and organizations clustered for comparison
  • strong comparison tools
  • potent and learned aggregation systems with validity in mind ranging from streamed data from sensors and the internet to surveys to uploads
  • powerful drag and drop integration and ETL with mostly automated standardization
  • deep diving upload, data set, project, and model exploration using natural language searching
  • integration with everything from a phone to a tablet to a powerful desktop
  • access controls for sharing the bare minimum amount of information
  • outsourced IT infrastructure including infrastructure for large model building
  • validation using proven algorithms and increased awareness of what that actually means
  • anomaly detection
  • organization of models, data sets, people, and statistical elements into a single space for learning
  • connectors to popular visualizers such as Tableau and Vantara with a customize-able dashboard for general reporting
  • downloadable sets with two entity verification if required that can be streamed or used in Python and R

Tools such as ours significantly reduce the cost of IT by as much as 65%. We eliminate much of the waste in the data science pipeline while trying to be as simple as possible.

We should consider empowering and streamlining the companies, non-profits, and government entities such as schools and planning departments that perform vital services before our own lives are greatly effected. Debt and credit are not solutions to complex problems.

Take a look, or don’t. This is a fluff piece on something I am passionately building. Contact us if you are interested in a beta test.

Demystifying Sparse Matrix Generation for NLP

Hashing, its simple, its powerful, and when it is unique it can take off time in generating sparse matrices from text. There are multiple ways to build a sparse matrix but do you really know just how scipy comes up with its fast solution or the powers at some of the big companies develop matrices for NLP. The simple hashing trick has proven immensely powerful and quite fast in aiding in sparse matrix generation. Check out my github for a sample and where I also have C99 and text tiling tokenizer algorithms written.

The Sparse Matrix

Sparse matrices are great tools for data with large numbers of dimensions but many, many gaps. When a matrix of m x n would become quite large (millions or more data points) and yet coverage is much smaller (say 50 percent even), a sparse matrix can fit everything into memory and allow testing of new or improved algorithms without blowing memory or needing to deploy to a distributed framework. This is also useful when dealing with results from frameworks like Spark or Hadoop.

Sparse matrices are composed of index and data pairings. Two common forms are Compressed Sparse Row matrices and Compressed Sparse Column matrices which store an index array pointing to where vectors (rows or columns) end and the data array that they point to.

Which Hash for NLP

Text hashes such as Horners method have been around for some time. For a matrix, we want to limit collisions. That is no small feat as hashes are sometimes guaranteed to generate collisions. Keeping the number of attributes large is helpful but not a complete solution. It is also important to generate a fairly unique hash.

In my TextPreprocessor software on github, I used murmurhash. There are 32 bit and 128 bit versions of the hash for the current murmurhash3 implementation available under scala.util.hashing and even a 64 bit optimized version. Murmurhash, with its large number of bytes and ability to effect the low end of the digit range, and filtering helps generate a fairly unique hash. Sci-kit Learn uses a similar variant.

Even murmurhash may not always work. A single bit hashing function has proven effective in eliminating much in the way of collisions. If the bit is 1, add 1. If not, subtract. Using modulus or another function may prove useful but testing is needed. In any case, the expected mean is now 0 for each column since there is a 50 percent chance you will be high and 50 percent chance low.

Building with the Hashing Trick

Building the matrix is fairly straightforward. First, generate and index and row or column array.

Then, get the feature counts. Take in text, split to sentences to lines and then lines to words, and for each line hash the words and count their instances. Each sentence is a vector. It is highly advisable to remove the most frequent words by which I mean those with an occurrence 1.5 or more standard deviations beyond the typical occurrence and to remove common words (stop words) like a, the, most conjunctions and some others. Stemming is another powerful tool which removes endings like -ing. It is possible to use the power of multiple cores by returning a map with hash to count key value pairs. From here simply iterate line by line and add an entry to the index array as array[n – 1] + count if n > 0 or array[0] = count if n = 0 and the features to the data array. If we know the maximum number of features per line, it is possible to add the lines in a synchronized method. Again, this is all on my github.

val future = getFeatures(getSentences(doc).map({x => x.splitWords(x)})).flatMap({x => addToMatrix(x)})

The reason I mention splitting the task is that it can be faster, especially if using a tool like Akka. It would be advisable to assign a number to the document and insert each row based on its number to be able to track which English language sentence it belongs to as well.

Thanks for sticking with me as I explore a simple yet potent concept.

Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.

Resources

Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.

Smoothing: When and Which?

Smoothing is everywhere. It is preprocessing for signal processing, it makes text segmentation work well, and it is used in a variety of programs to cull noise. However, there are a wide variety of ways to smooth data to achieve more appropriate predictions and lines of fit. This overview should help get you started.

When to Use Smoothing

The rather simple art of smoothing data is best performed when making predictions with temporal data where data comes from one or more potentially noisy sources (think a weather station with a wind speed monitor that is a bit loose or only partially unreliable) and dealing with tasks such converting digital sound to analogue waves for study. When a source appears capable of making decent predictions but is relatively noisy, smoothing helps. It is even used in fitting smoothing splines.

Types of Smoothing to Consider

Now that the rather blunt explanation is out of the way, the proceeding list is based on my own use with text mining and some sound data. I have tried to include the best resources I could for these.

  • Rectangular or Triangular: Best used for non-temporal and fairly well fitting data where more than past concerns are important (text segmentation is an example).
  • Simple Exponential: Creates a moving average smoothing and considers past events. Is not a terrific smoothing tool but is quick and works well when data is correlated (could work well with sound which may be better with a Hamming or Hanning Window). Unlike double and triple exponential smoothing, the algorithm requires no past experience or discovery to do correctly, for better or worse.
  • Double and Triple Exponential Smoothing: Works well with time series data. Peaks and valleys are more preserved. Triple exponential Smoothing works well with seasonal data. They require some manual training or an algorithm relying on previous experience to generate an alpha value to perfect. Again, past events are more heavily weighted.
  • Hanning and Hamming WindowsPeriodic data may work well with this type of smoothing (wave forms). They are based on the cosine function. Past experience is not needed. For really noisy data, try the more intensive Savitsky-Golay filter.
  • Savitzky–Golay: This smoothing works fairly well but preserves peaks and valleys within the broader scope of the data. Savitsky-Golay is not ideal for a truly smooth curve. However, if some noise is really important, this is a great method. Its publication was actually considered one of the most important by Analytical Chemistry for spectral analysis. It uses a localized least squares technique to accomplish its feat.

    However, do not rely on the matrix based calculation for OLS as the most efficient as gradient descent is clearly the winner. No self-respecting programmer will use a matrix implementation on large data sets. Spark contains an optimized gradient descent algorithm for distributed and even single node programming. The algorithm is tail recursive and seeks to minimize a cost function.

  • Distribution Hint

    For the programmer or developer looking to distribute a Savitsky-Golay calculation and not using Spark gradient descent. Map partitions works well on the local areas. It also works well when smoothing many splines or for the Hanning and Hamming Window based smoothing.

    NotePad: A NickName Recognition Idea

    So this is just an idea. Nicknames can apparently be attached to multiple names. That means that we could have a graph of nicknames and also that a direct list may be tricky. A quick thought on how to approach a solution to this.

     

    A. Graph the Names and their known connections to formal names.

    B. Attach probability data to each Node specifying which characteristics trigger that solution (Markovian sort of)

    C. Start a Server accepting requests

    D. When a request comes in first look for all potential connections from a nickname to a name, if none save this as a new node. Search can be performed as depth first or breadth first and even split into random points on the graph in a more stochastic mannersuch as a random walk (will need to study these more). This could also be done by just choosing multiple random search points.

    E. If there is one node, return the formal name

    F. If there are multiple names take one of several approaches:

    1. Approach is to calculate Bayesian probabilities given specified characteristics and return the strongest match (this is the weakest solution).

    2. Approach is to train a neural net with the appropriate kernel (RBF, linear; etc.) and return the result from this net. (This is slow as having a million neural nets in storage seems like a bad idea)

    When generating stand alone nodes, it may be possible to use a Levenshtein distance and other characteristics to attach nickname nodes to formal nodes based on a threshold. A clustering algorithm could use the formal name averages as cluster centers and a hard cutoff could specify (e.g. Levenshtein of 1 or 2) could solidify the connection and throw out Type I and Type II error.

    Stay tuned while I flesh this post out in this post with actual code. It may just be a proposal for a while.

    A Cosine Cutoff Value

    I recently discovered this but cannot remember where. The article was relating to ultra-sounds and mapping them although I was surprised that fuzzy c-means was not mentioned. I have deployed it with some decent effect in text mining algorithms I am writing.

    threshold = sqrt((1/matrix.shape[0])*Sum(xij – column_meanj))

     

    Please comment if you have found something better or know where this came from. Cheers!

     

    Mixing Java XML and Scala XML for Maximum Speed

    Lets face it. ETL and data tasks are expected to be finished quickly. However, the amount of crud pushed through XML can be overwhelming. XML manipulation itself in java can take a while despite being relatively easy due to the need to loop through every attribute that one wished to create or read from. On the other hand, Scala, despite being a terrifically complete language based on the JVM suffers from some of the issues of a functional language. Only Scales XML offers a way to actually perform in place changes on XML. It is not, however, necessary to download an entirely new dependency on top of the existing Java and Scala libraries already available when using Scala.

    Scala XML

    There are definitive advantages to Scala XML. Imported under scala.xml._ and an other relevant subpackages, creation of XML is much faster and easier than with Java. All nodes and attributes are created as strings, making picking up Scala XML a breeze.

       
       val txtVal="myVal"
       val otherTxtVal="Hello World"
       val node = <tag attr={txtVal} attr2="other"> {otherTxtVal} </tag>
    

    In order to generate child nodes, simply create an iterable with the nodes.

       val attrs:List[String] = List("1","2","3") //because we rarely know what the actual values are, these will be mapped
       val node = <property name="list"><list>{attrs.map(<value> li </value>)}</list></property> // a Spring list
    

    Another extremely decent addition that brings speed to Scala XML parsing is the Pretty Printer under scala.xml.PrettyPrinter(width:int,spacer:int)

         val printer = new PrettyPrinter(250,5) //will create lines of width 250 and a tab space of 5 chars at each depth level
         val output = printer.format(xml)
    

    There is one serious issue with Scala XML. Scala XML uses immutable lists. This means that nodes are difficult to append to. Rewrite
    Rules and transformations
    are available and particularly useful for filtering and deleting
    but not necessarily good at appending.

       class Filter extends RewriteRule{
             override def transform(n:Node):Seq[Node] = n match{
                case (n \\ "bean[@id='myId']).length > 0 => NodeSeq.Empty //please bear with me as I am still learning Scala
                case n.Elem(prefix, label, attribs, scope, child @ _*) => NodeSeq.Empty
                case e:Elem => n
                case _ =>{
                   println("Unrecognized")
                   throw new RuntimeExceptione
                }
             }
       }
    
       val newXML = new RuleTransformer(new Filter()).transform(xml) //returns filtered xml
    

    Such code can become cumbersome when writing a ton of appendage rules each with a special case. Looking at the example from the github page makes this abundantly clear. Another drawback is the speed impact a large number of cases can generate since XML is re-written with every iteration. >a href=”https://github.com/chris-twiner/scalesXml”>Scales XML overcomes this burden but has a much less Scala-esque API and requires an additional download. Memory will be impacted as well as copies of the original information are stored.

    Java Document

    Unlike Scala XML, Java already has a built in in-place XML editor available after the release of Java 4. Importing this library is not outside the standard uses of Scala. Perhaps the Scala developers intended for their library to mainly be used in parsing. Hence, there is a mkString within the native library but a developer or programmer must download sbt with sbt.IO for a scala File System API that itself uses Java’s File class. There is also the lack of networking tools leading to third party tools and a Json library not quite excelling at re-writing Json in older Scala versions (non-existant in 2.11.0+). While it would be nice to see this change, there is no guarantee that it will. Instead, java fills in the gaps.

    Here the javax.xml libraries come particularly in handy. The parsers and transformers can read and generate strings, making interoperability a breeze. Xpath expressions may not be as simple as in Scala where xml // “xpath” or xml / “xpath” will return the nodes but it is painless.

       import javax.xml.parsers._ //quick and dirty
       import org.w4c.dom._
       import javax.xml._    
       val xmlstr = <tag>My Text Value</tag>
       val doc:Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
       val nodes:NodeSet = xpath.compile("//tag").evaluate(doc,NodeConstants.NodeSet)
    

    With a nodelist, one can getAttributes(), setAttribute(key,value), appendChild(node), insertBefore(nodea,nodeb) and call other useful methods, such as getNextSibling() or getPreviousSibling(), from the API. Just be careful to use the document element when making calls on the nodes where necessary.

    Inter-operability

    Inter-operability between scala and java is quite simple. Use a builder with an input stream obtained from the string to read an XML document to Java and a transformer to read that XML back to something Scala enjoys.

    //an example getting a string with Pretty Printer from an existing document
       val xmlstr = <tag>My Text Value</tag>
       val doc: Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
       val printer = new scala.xml.PrettyPrinter(250,5)
       val sw = new java.io.StringWriter()
       val xmlString = javax.xml.transform.TransformerFactory.newInstance().newTransformer().transform(new javax.xml.transform.dom.DOMSource(doc),new javax.xml.transform.stream.StreamResult(sw))
       val xml = printer.format()
    

    Morning Joe: Categorizing Text, Which Algorithms are Best in What Situations

    There is a ton of buzz around data mining and, as always, many new names being injected into the same topics showing a lack of study. While the buzz exists, knowing when to deploy an algorithm can be tricky. Based on a deep dive into the subject recently and a dire need to program these algorithms in Java, I present a brief overview of tools with benchmarks and examples to hopefully follow later. Basic concepts are presented first followed by some algorithms that I have not really honed yet (benchmarks are not feasible for now but will be incredibly soon).

    To read this thoroughly, I highly recommend following the links. This is a starting point and is also a summary of what I have found to now. Basically, the goal is to help avoid the hours it takes to find information on each and do some stumbling by reading this document.

    A word of caution, effective matrix based solutions use a large amount of data. Other fuzzy algorithms exist for discovering relate-ability between small sets of items. For strings, there is distance matching such as Jaro-Winkler or Levenshtein with rule based comparisons and lookup tables to minimize error (say between Adama and Obama). Statistics can enhance this process if there is a need to take the best rating. Train a model for discovering whether the hypothesis that two entities distances makes them the same as opposed to the null hypothesis that it does not after filtering out some common issues.

    The Matrix

    Linear algebra is a critical foundation of text mining. Different elements are thought of as equations When we have different documents or images, each document or image is often considered to form an equation. This equation can then be presented in a matrix, a simple and easy way to get rid of thinking in really complex terms.

    You may have heard of the vaulted differential equation. If not, some reading is in order from a site I used in college when there was not time for the book. The basis of a large portion of differential equations can be written in a matrix. This is important due to the eigen vector and eigen value. To be sure these concepts are crucial for solving matrices to find equations that explain a set of models. Drexel Universities eigenfaces tutorial provides a fairly solid understanding of the way a matrix is used in most text mining. However, similarity ratings are used to compare documents rather than a co-variance matrix for most tasks.

    The end result of studying these methods is the ability to look under the hood at today’s hottest text mining technologies.

    The Vector Space and TF-IDF

    Understanding vectors and vector operations is another crucial step to understanding the mining process. A basic vector is a set of points representing a position in these planes . Vectors can be added, subtracted,multiplied, and, most importantly, stuffed in a matrix where their points can be used with basic linear algebra to find relationships.

    Vectors have magnitude and distance. The angles and distances can be compared. Note that, while data loss may be present in finding the right magnitude and distance, the units used should be the same (it would be a terrible idea to think in terms of say millisecond-meter-documents-ice cream cones), it provides a sane basis for considering data. It is up to the miner to choose the most representative points for use in the vector.

    In text mining, term frequency-inverse document frequency rating is used in many commercial algorithms including search engines. If the name is not enough, it is basically a frequency ratio based on the ratio of individual document . It works best on more than one document and an offset of 0.5 for term frequency helps offset the effect of large documents slightly by bumping up the rating. Inverse document frequency utilizes a logarithm function to ensure that the rating remains between 0 and 1.

    Multiply the following equations together to find the result as described by Wikepedia:

    Similarity Ratings

    No matter what you do, similarity ratings are the key to making the process work. There are a several that can be used. If the data can be represented fairly well, co-variance is an option . However, text data is not that well suited to using co-variance. This is due to varying styles that represent the same object and, most importantly, issues with quantization. Natural language is naturally fuzzy. Therefore, cosines usually offers a much better solution.

    This cosines equation takes the product of two vectors or the sum of two vectors and divides by the product of there normalized vectors or the sum of their normalized vectors. It follows from vector algebra. The result is an angle representing the ‘degree’ of similarity. This can be used for comparison.

    Word Net, Disambiguation, and Stemming

    The process of disambiguation and stemming are crucial to text mining. There are many sentence processing methods as NLTK shows. At their core is WordNet and other dictionaries. WordNet is a freely available graph of an english dictionary. Most tools work with WordNet for finding root words, disambiguation, and cleaning.

    Part of Speech or POS tagging is involved in both disambiguation or stemming. Maximum entropy models are used to discover a part of speech based on common usage.

    Disambiguation attempts to resolve words with multiple meanings to their most probable meaning. The worst algorithm is original Lesk but involves only the use of WordNet. Accuracy hovers around 50 percent. Simplified Lesk achieves better results. Lesk finds overlapping words and frequencies to determine the best synonym to replace an ambiguous word. The better algorithms try to use clustering bayes for word sense discovery. Cosines may be used to improve Lesk as well.

    Stemming reduces words to their roots. Most WordNet tools use existing classifications with POS tagging to achieve this result.

    A Note on Regression Models

    Lets be certain, prediction is not well suited to categorization. Changes in word choice across a large number of documents and decisions on importance do not always mean the same thing. Therefore, regression models tend to work poorly. The data is not likely continuous as well. Think of writing like a magnetic field with eddy currents. Predicting the effect due to an encounter with these currents is really, really difficult. Basically, run into an eddy current and you are going to have a really, really bad day. That is not to say that an equation can be created that fits most of the data with respect to location of a point, basically a differential equation. It will likely not be generic and be incredibly difficult to find.

    Regression works well on continuous and more natural events.

    Classification Tree and a Random Forest

    Another often poor performer in categorization of text data is the classification tree. They are as good as the number of rules you are willing to create. However, they may be combined with multinomial Bayes for writing that is uniform and professional (say a legal document) to achieve some success. They are particularly useful after filtering data using LSA/HDP or Multinomial Bayes with decisions that work like a bayesian model when thinking about the bigger picture.

    Basically, a classification tree uses probabilities within groupings to ascertain an outcome and moving down to the appropriate left or right child node based on a yes or no response to the question do you belong?

    This process works well with defined data when there is a good degree of knowledge about a subject (say gene mapping) but text mining often uses fuzzy data with multiple possible meanings and disambiguation is not entirely accurate, Lesks original algorithm only acheived 50 percent accuracy and an LSA model hovers between 80-90%. Improving the quality can be done with multiple trees or possibly by training off of an extremely large set using cosines instead of raw frequency.

    There are multiple methods for tree creation, two are random forests and bagging. Bagging takes multiple trees and averages the probabilities for decisions, using this average for their respective nodes. Random forests find random subsets of features, find probabilities based on them, and select the stronger predictor for a node. The latter approach is best with a much larger set of known features. The number of forests is usually the square root of the total number of features.

    Again, the features must be known and fairly homogeneous. Text data is often not.

    Multinomial Bayesian Classifier

    Multinomial Bayesian Classification is a method that classifies data based on the frequency of words in different categories and their probabilities of occurrence. It is fairly straightforward, find the frequencies or train a set of frequencies on a word by word or gram by gram (a gram being an n-pairing and thus n-gram of words), find probabilities by sentence, take the best one.

    MNB works well when writing differs starkly, say with subject matters that differ greatly. It is good for tasks such as separating spam from policies and code in html data when large amounts of training data are present.

    Clustering with LSA or HDP

    Clustering works well when something is known about the data but categorization is not well done manually. Most algorithms avoid affinity propagation which usually uses the square root of total inputs as the number of clusters any way. Matrices are used heavily here as eigen values and eigen vectors derive an equation that can be used to find relate-ability between documents.

    LSA uses raw frequencies or more effectively cosines in the same manner as eigen faces to compare vectors. The end result, however, is an equation representing a category. By matrix inversion and multiplication, all elements are compared. In this case each ij entry in the matrix is a cosine or frequency of a word in a document. HDP (hierarchical direchlet process) is similar but attempts to learn more about the results and improve on the process. It takes much longer than LSA and is experiemental.

    If trying to discover new information about text or trying to find the best fit of a number of categories, these methods are useful.

    Maximum Entropy Models

    Maximum entropy models work well on heterogenous data in a manner similar to Bayes. Gensims sentence tagger classifies sentences from non-sentences in this way. The models find entropy using the maxent principle which uses frequencies and likelihood of occurrence to find outcomes. It works quite well with the correct training sets.

    If conditional independence is not assumable and nothing is known about a set, this model is useful. Categories should be known beforehand.

    Tools

    Java

    Python

    Common Resources

    Python PDF 2: Writing and Manipulating a PDF with PyPDF2 and ReportLab

    Note: PdfMiner3K is out and uses a nearly identical API to this one. Fully working code examples are available from my Github account with Python 3 examples at CrawlerAids3 and Python 2 at CrawlerAids (both currently developed)

    In my previous post on pdfMiner, I wrote on how to extract information from a pdf. For completeness, I will discuss how PyPDF2 and reportlab can be used to write a pdf and manipulate an existing pdf. I am learning as I go here. This is some low hanging fruit meant to provide a fuller picture. Also, I am quite busy.

    PyPDF and reportlab do not offer the completeness in extraction that pdfMiner offers. However, they offer a way of writing to existing pdfs and reportlab allows for document creation. For Java, try PDFBox.

    However, PyPdf is becoming extinct and pyPDF2 has broken pages on its website. The packages are still available from pip,easy_install, and from github. The mixture of reportlab and pypdf is a bit bizzare.


    PyPDF2 Documentation

    PyPdf, unlike pdfMiner, is well documented. The author of the original PyPdf also wrote an in depth review with code samples. If you are looking for an in depth manual for use of the tool, it is best to start there.

    Report Lab Documentation

    Report lab documentation is available to build from the bitbucket repositories.

    Installing PyPdf and ReportLab

    Pypdf2 and reportlab are easy to install. Additionally, PyPDF2 can be installed from the python package site and reportlab can be cloned.

       easy_install pypdf2
       pip install pypdf2
       
       easy_install reportlab
       pip install reportlab
    

    ReportLab Initialization

    The necessary part of report lab is the canvas objects. Report lab has several sizes. They are letter,legal, and portrait. The canvas object is instantiated with a string and size.

    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import portrait
    

    PyPdf2 Initialization

    PyPdf2 has a relatively simple setup. A PdfFileWriter is initialized to add to the document, as opposed to the PdfReader which reads from the document. The reader takes a file object as its parameter. The writer takes an output file at write time.

       from pyPDF2 import PdfFileWriter, PdfFileReader
       # a reader
       reader=PdfFileReader(open("fpath",'rb'))
       
       # a writer
       writer=PdfFileWriter()
       outfp=open("outpath",'wb')
       writer.write(outfp)
    

    All of this can be found under the review by the writer of the original pyPdf.

    Working with PyPdf2 Pages and Objects

    Before writing to a pdf, it is useful to know how to create the structure and add objects. With the PdfFileWriter, it is possible to use the following methods (an IDE or the documentation will give more depth).

    • addBlankPage-create a new page
    • addBookmark-add a bookmark to the pdf
    • addLink- add a link in a specific rectangular area
    • addMetaData- add meta data to the pdf
    • insertPage-adds a page at a specific index
    • insertBlankPage-insert a blank page at a specific index
    • addNamedDestination-add a named destination object to the page
    • addNamedDestinationObject-add a created named destination to the page
    • encrypt-encrypt the pdf (setting use_128bit to True creates 128 bit encryption and False creates 40 bit encryption with a default of 128 bits)
    • removeLinks-removes links by object
    • removeText-removes text by text object
    • setPageMode-set the page mode (e.g. /FullScreen,/UseOutlines,/UseThumbs,/UseNone
    • setPageLayout-set the layout(e.g. /NoLayout,/SinglePage,/OneColumn,/TwoColumnLeft)
    • getPage-get a page by index
    • getLayout-get the layout
    • getPageMode-get the page mode
    • getOutlineRoot-get the root outline

    ReportLab Objects

    Report lab also contains a set of objects. Documentation can be found here. It appears that postscript or something similar is used for writing documents to a page in report lab. Using ghostscript, it is possible to learn postscript. Postscript is like assembler and involves manipulating a stack to create a page. It was developed at least in part by Adobe Systems, Inc. back in the 1980s and before my time on earth began.

    Some canvas methods are:

    • addFont-add a font object
    • addOutlineEntry-add an outline type to the pdf
    • addPostscriptCommand-add postscript to the document
    • addPageLabel-add a page label to the document canvas
    • arc-draw an arc in a postscript like manner
    • beginText-create a text element
    • bezier-create a postscript like bezier curve
    • drawString-draw a string
    • drawText-draw a text object
    • drawPath-darw a postscript like path
    • drawAlignedString-draw a string on a pivot character
    • drawImage-draw an image
    • ellipse-draw an elipse on a bounding box
    • circle-draw a circle
    • rect-draw a rectangle

    Write a String to a PDF

    There are two things that dominate the way of writing pdf files, writing images, and writing strings to the document. This is handled entirely in

    Here, I have added some text and a circle to a pdf.

    def writeString():
        fpath="C:/Users/andy/Documents/temp.pdf"
        packet = StringIO.StringIO()
        packet=StringIO.StringIO()
        cv=canvas.Canvas(packet, pagesize=letter)
        
        #create a string
        cv.drawString(0, 500, "Hello World!")
        #a circle. Do not add another string. This draws on a new page.
        cv.circle(50, 250, 20, stroke=1, fill=0)
        
        #save to string
        cv.save()
        
        #get back to 0
        packet.seek(0)
        
        #write to a file
        with open(fpath,'wb') as fp:
            fp.write(packet.getvalue())
    

    The output of the above code:
    Page 1
     photo page1.png

    Page 2
     photo page2.png

    Unfortunately, adding a new element occurs on a new page after calling the canvas’ save method. Luckily the “closure” of the pdf just creates a new page object. A much larger slice of documentation by reportlab goes over writing a document in more detail. The documentation includes alignment and other factors. Alignments are provided when adding an object to a page.

    Manipulating a PDF
    Manipulation can occur with ReportLab. ReportLab allows for deletion of pages,insertion of pages, and creation of blank pages. The author of pyPDF goes over this in depth in his review.

    This code repeats the previous pages twice in a new pdf. It is also possible to merge (overlay) pdf pages.

        from PyPDF2 import PdfFileWriter,PdfFileReader
    
        pdf1=PdfFileReader(open("C:/Users/andy/Documents/temp.pdf"))
        pdf2=PdfFileReader(open("C:/Users/andy/Documents/temp.pdf"))
        writer = PdfFileWriter()
        
        # add the page to itself
        for i in range(0,pdf1.getNumPages()):
             writer.addPage(pdf1.getPage(i))
        
        for i in range(0,pdf2.getNumPages()):
             writer.addPage(pdf2.getPage(i))
        
        # write to file
        with file("destination.pdf", "wb") as outfp:
            writer.write(outfp)
    

    Overall Feedback
    Overall, PyPDF is useful for merging and changing existing documents in terms of the the way they look and reportlab is useful in creating documents from scratch. PyPDF deals mainly with the objects quickly and effectively and reportlab allows for in depth pdf creation. In combination, these tools rival others such as Java’s PdfBox and even exceed it in ways. However, pdfMiner is a better extraction tool.


    ””