How Can the Tools of Big Data Detect Malicious Activity?

With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.

Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.

Similarity: The Neural Net or Cluster

The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.

Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.

Traits that Can be Useful

Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.

  • Types of Items purchased : Are they trivial like a stick of gum?
  • Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
  • Number of Malformed Requests : Are they sending bad headers?
  • Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
  • Frequency of Requests to your website : Does it look like a DNS attack?
  • Time spent on each Page : Is it too brief to be human?
  • Number of Recent Purchases : Perhaps they appear to be window shopping
  • Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
  • Validity or threat of a given email address : Is it a known spam address or even real?
  • Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
  • Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
  • Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?

Traits that May not Be Useful

People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.

  • Commonality of User Name : Not extremely informative but good to study
  • Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised

Do not Immediately Discount Traits and Always Test

Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.

Bayes, Cosines, and Text Content

Not all attacks can be detected by behaviour. Perhaps a vulnerability is already known. In this case, it is useful to look at Bayesian probabilities and perhaps cosine similarities. Even obfuscated code contains specific key words. For example, variables in javascript are always declared with var, most code languages use semi-colons, and obfuscated code is often a one line mess. Bayesian probability would state that the presence of one item followed by another when compared to frequencies from various categories yields a certain probability of belonging to a category.

If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.

The Very Sad and Disturbing State of JVM Based Sparse Matrix Packages

Big data is the rage, distribution is the rage, and so to is the growth of streaming data. The faster a company is, the better. Such speed requires, no demands, solid matrix performance. Worse yet, big data is inherently sparse and testing and implementation of new algorithms requires sparse matrices (CSR,CSC, COO; the like). Sadly, Java is not up to the task.

Let’s revisit some facts. Java is faster than Python at its core. Many tasks require looping over data in ways numpy or scipy simply do not support. A recent benchmark on Python3 v. Java highlights this. Worse, Python2 and Python3 use the global interpreter lock (GIL) making attempts at speed through concurrency often slower than single threading and forcing developers to use the dreaded multiprocessing (large clunky programs using only as many processes as cores). Still, multiprogramming and other modern operating systems concepts are helping Python achieve better albeit still quite slow speeds.

That said, Numpy and Scipy are the opposite of any of these things. They require the slower Cython but are written in C, performing blazingly fast, and leave all Java matrix libraries in the dust. In fact, in attempting to implement some basic machine learning tasks, I found myself not just writing things like text tiling which I fully expected to do but also starting down the path of creating a full fledged sparse matrix library with hashing library.

The following is the sad state of my Matrix tests.

The Libraries

The following libraries were used in the test:

The Cosines Test

An intensive test of a common use case is the calculation of the dot product (a dot b, a * b.t). Taking this result and dividing by norm(a)*norm(b) yields the cosine of pheta.

This simple test includes multiplication, transposing, and mapping division across all active values.

The Machine and Test Specs

The following machine specifications held:

  • CPU : Core i3 2.33 ghz
  • RAM : 8 GB (upgraded on laptop)
  • Environment: Eclipse
  • Test: Cosine Analysis
  • Languages: Scala and Python(scipy and numpy only)
  • Iterations: 32
  • Alloted Memory: 7gb either with x64 Python3 or -Xmx7g
  • Average: Strict non-smoothed average

The Scipy/Numpy Gold Standard

Not many open source libraries can claim the speed and power of the almighty Scipy and Numpy. The library can handle sparse matrices with m*n well over 1,000,000. More telling, it is fast. The calculation of the cosines is an extremely common practice in NLP and is a valuable similarity metric in most circumstances.

import scipy.sparse
import scipy.sparse.linalg

mat = sparse.rand(1000,50000,0.15)

Result : 5.13 seconds

The Results

The following resulted from each library:

  • Breeze : Crashes with Out of Memory Error (developer notified) [mat * mat.t]
  • UJMP : 128.73 seconds
  • MTJ : 285.13 seconds
  • La4j : 420.2 seconds
  • BidMach : Crashes (sprand(1000,50000,0.15))

Here is what the author of Breeze had to say. Rest assured, Numpy has been stable for over a decade now with constant improvements.


Java libraries are slow and seemingly undeserving of praise. Perhaps, due to the potential benefits of not necessarily distributing every calculation, they are not production grade. Promising libraries such as Nd4J/Nd4s still do not have a sparse matrix library and have claimed for some time to have one in the works. The alternatives are to use Python or program millions of lines of C code. While I find C fun, it is not a fast language to implement. Perhaps, for now Python will do. After all, PySpark is a little over 6 months old.

Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.


Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.