How Can the Tools of Big Data Detect Malicious Activity?

With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.

Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.

Similarity: The Neural Net or Cluster

The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.

Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.

Traits that Can be Useful

Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.

  • Types of Items purchased : Are they trivial like a stick of gum?
  • Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
  • Number of Malformed Requests : Are they sending bad headers?
  • Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
  • Frequency of Requests to your website : Does it look like a DNS attack?
  • Time spent on each Page : Is it too brief to be human?
  • Number of Recent Purchases : Perhaps they appear to be window shopping
  • Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
  • Validity or threat of a given email address : Is it a known spam address or even real?
  • Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
  • Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
  • Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?

Traits that May not Be Useful

People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.

  • Commonality of User Name : Not extremely informative but good to study
  • Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised

Do not Immediately Discount Traits and Always Test

Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.

Bayes, Cosines, and Text Content

Not all attacks can be detected by behaviour. Perhaps a vulnerability is already known. In this case, it is useful to look at Bayesian probabilities and perhaps cosine similarities. Even obfuscated code contains specific key words. For example, variables in javascript are always declared with var, most code languages use semi-colons, and obfuscated code is often a one line mess. Bayesian probability would state that the presence of one item followed by another when compared to frequencies from various categories yields a certain probability of belonging to a category.

If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.

Getting Started Extracting Tables With PDFMiner

PDFMiner has evolved into a terrific tool. It allows direct control of pdf files at the lowest level, allowng for direct control of the creation of documents and extraction of data. Combined with document writer, recognition, and image manipulation tools as well as a little math magic and the power of commercial tools can be had for all but the most complex tasks. I plan on writing on the use of OCR, Harris corner detection, and contour analysis in OpenCV, homebrew code, and tesseract later.

However, there is little in the way of documentation beyond basic extraction and no python package listing. Basically the methods are discoverable but not listed in full. In fact, existing documentation consists mainly of examples despite the mainy different modules and classes designed to complete a multitude of tasks. The aim of this article, one in a hopefully two part series is to help with extraction of information. The next step is the creation of a pdf document using a tool such as pisa or reportlab since PdfMiner performs extraction.

The Imports
There are several imports that will nearly alwasy be used for document extraction. All are under the main pdfminer import. The imports can get quite large.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import pdfminer.layout
from pdfminer.layout import LAParams,LTTextBox,LTTextLine,LTFigure,LTTextLineHorizontal,LTTextBoxHorizontal
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

Some imports are meant to perform extraction and others are meant to check for and support the extraction of different types.

Visul Representation of Outputs

The following image is taken from pdfminer’s limited documentation.

Source: Carleton University

Imports for Extraction
The following table goes over the imports that perform actions on the document.

Import Description
PDFParser The parser class normally passed to the PDFDocument that helps obtain elements from the PDFDocument.
PDFResourceManager Helps with aggregation. Performs some tasks between the interpreter and device.
PDFPageInterpreter Obtains pdf objects for extraction. With the ResourceManager, it changes page objects to instructions for the device.
PDFPageAggregator Takes in LAParams and PDFResourceManager for getting text from individual pages.
PDFDocument Holds the parser and allows for direct actions to be taken.
PDFDevice Writes instructions to the document.
LAParams Helps with document extraction

Imports that act as Types
Other imports act as a means of checking against types and utilizing the other classes properties. The PDFDocument contains a variety of pdf objects that which hold their own information. That information includes the type, the coordinates, and the text displayed in the document. Images are also handleable.

The objects include:

These types are useful for pulling information from tables as eplained by Julian Todd.

Creating the Document
Creating the document requires instantiating each of the parts present in the diagram above. The order for setting up the document is to create a parser and then the document with the parser. The resourcemanager and LAParams accepted as arguments by the PageAggregator device used for this task. The PageInterpretor accepts the aggregator and the resource manager. This code is typical of all parsers and as part of the pdf writers.

StringIO will make extraction run more quickly. The resulting object’s code is written in C.

        with open(fpath,'rb') as fp:
        for page in PDFPage.create_pages(doc):

The get_result() method adds to the StringIO. The results are passed ot the ParsePage definition. Another method can be used for pure extraction (.get_value()).

The PDF Text Based Objects
The layout received from get_result() parses the strings into separate objects. These objects have several key components. They are the type, the coordinates (startingx, startingy, endingx, endingy), and the content.

Accessing the type can be found using type(object) and compared to the direct type (e.g. type(object)==LTRect). In this instance, a comparison to LTRect returns True.

Getting the Output
Output is obtained and parsed through a series of method calls. The following example shows how to extract content.

        while objstack:
            print b
            if type(b) in [LTFigure, LTTextBox, LTTextLine, LTTextBoxHorizontal]:
            elif type(b) == LTTextLineHorizontal:

This code takes the object stack as a list, which contains the method pop since python, although having a collections (import collections) package with data structurs such as a set, is highly flexible.

This example is a modification of Julian Todd’s code since I could not find solid documentation for pdfminer. It takes the objects from the layout, reverses them since they are placed in the layout as if it were a stack, and then iterates down the stack, finding anything with text and expanding it or taking text lines and adding them to the list that stores them.

The resulting list (tcols), looks much like other pure extractions that can be performed in a variety of tools including Javas pdfbox, pypdf, and even pdfminer. However, the objects are placed into the bbox (bounding box coordinate list) and the text object accessible from .get_text().


Images are handled using the LTImage type which has a few additional attributes in addition to coordinates and data. The image contains bits, colorspace, height,imagemask,name,srcsize,stream, and width.

Extracting an image works as follows:

if type(b) == LTImage:

PDFMiner only seems to extract jpeg objects. However, xpdf extracts all image.

A more automated and open source solution would be to use subprocess.Popen() to call a java program that extracts images to a specific or provided folder using code such as this (read the full article).

import shlex
import subprocess
pipe=subprocess.Popen(shlex.split("java -jar myPDFBoxExtractorCompiledJar /home/username/Documents/readable.pdf  /home/username/Documents/output.png"),stdout=subprocess.STDOUT) 

Handling the Output
Handling the code is fairly simple and forms the crux of this articles benefits besides combining a variety of resources in a single place.

Just iterate down the stack and pull out the objects as needed. It is possible to form the entire structure using the coordinates. The bounding box method allows for objects to be input in a new data structure as they appear in any pdf document. With some analysis, generic algorithms are posslbe. It may be a good idea to write some code first with the lack of documentation.

The following extracts specific columns of an existing pdf. The bounding box list/array is set up as follows. bbox[0] is the starting x coordinate, bbox[1] is the starting y coordinate, bbox[2] is the ending x coordinate, and bbox[3] is the ending y coordinate.

records,cases,dates,times,types,locations,attorneys=self.convertToDict([[x for x in tcols if float(x.bbox[0]) <= 21.0 and "Name\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=176.0 and "Case\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=257.0 and "Date\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=307.0 and "Time\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=354.0 and "Type\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=607.0 and "Location\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=645.0 and "Attorney\n" not in x.get_text()]])

This code uses python's list comprehension. The reason for the inequalities is that slight differentiations exist in the placement of object. The newline escape character represents an underline in this case.

Pure Text Extraction

In order to see how to perform pure text extraction and move to a better understanding of the code, analyze the following code.

       with open(fpath,'rb') as fp:
            for page in PDFPage.create_pages(doc):
                if len(rstr.strip()) >0:
            return lines

PdfMiner is a useful tool that can write and read pdfs and their actual formating. The tool is flexible and can easily control strings. Extracting data is made much easier compared to some full text analysis which can produced garbled and misplaced lines. Not all pdfs are made equal.