2019/2020 – The Years of Data Engineering (Opinion)


Photo by Flickr user Stuck in Customs/Creative Commons

The new year brings new hopes for all of the hottest technologies built over the past few years. The last few years were filled with visualization tools and frameworks. Tableau is now a household name, Salesforce is a workhorse for analytics, SAS continues to grow through jmp, and small players such as Panoply are well funded.

This underlies an important and missing component in data. Data management tools and frameworks are severely deficient. Many merely perform materialization.

That is changing this year and it means that data engineering will be an important term over the next few years. Automation will become a reality.

The Data Engineering Problem

Data engineers create pipelines. This means automating the handling of data from aggregation and ingestion to the modeling and reporting process.

As they cover the entire pipeline for your data and often implement analytics in a repeatable manner, data engineering is a broad task. Terms such as ETL, ELT, verification, testing, reporting, materialization, standardization, normalization, distributed programing, crontab, Kubernetes, microservices, Docker, Akka, Spark, AWS, REST, Postgres, Kafka, and statistics are commonly slung with ease by data engineers.

Until 2019, integrating systems often meant combing a variety of tools into a cluttered wreck. A company might deploy python scripts for visualization, Vantara (formerly Pentaho) for ETL, use a variety of aggregation tools combined in Kafka, have data warehouses in PostgreSQL, and may even still use Microsoft Excel to Store data.

The typical company spends $4000 – $8000 per employ providing these pipelines. This cost is unacceptable and  can be avoided in the coming years.

ELT will Not Kill Data Engineers

ELT applications promise to get rid of data engineers but that is pure nonsense meant to attract an ignorant investors money:

  • ELT is often performed on data sources that already underwent ETL by the companies it was purchased from such as Axciom, Nasdaq, and TransUnion
  • ELT eats resources in a significant way and often limits its use to small data sets
  • ELT ignores issues related to streaming from surveys and other sources which greatly benefit from the requirements analysis and transformations of ETL
  • ELT is horrible for integration tasks where data standards differ or are not existant
  • You cannot run good AI or build models on poorly or non-standardized data

This means that ETL will continue to be an important part of a data engineers job.

Of course, since data engineers translate business analyst requirements into reality, the job will continue to be secure. Coding may become less important as new products are released but will never go away in the most efficient organizations.

Limits of Python and GoLang Benefits the Data Engineering Stack

Many people point to Python as a means for making data engineers redundant. This is simply false.

Python is limited and GoLang is only about 30 times faster than Python. This means that jvm will rise in popularity with data scientist and even analysts as companies want to make money on the backs of their algorithms. This benefits data engineers who are typically proficient in at least Java or Scala.

Python works for developers, analysts, and data scientists who want to control tools written in a more powerful language such as C++ or Java. Pentaho dabbled in this before being bought by Hitachi. However, being 60 times slower than the jvm and often requiring three times the resources,  it is not an enterprise grade language.

Python does not provide power. It is not great at parallelism and is single threaded. Any language can achieve parallelism. Python uses heavy OS threads to perform anything asynchronously. This is horrendous.

Consider the case of using Python’s Celery versus Akka, a Scala and Java based tool. Celery and Akka perform the same tasks across a distributed system.

Parsing millions of records in celery can quickly eat up more than fifty percent of a typical servers resources with a mere ten processes. RabbitMQ, the messaging framework behind Celery, can only parse 50 million messages per second on a cluster. Depending on the use case, Celery may also require Redis to run effectively. This means that an 18 logical core server with 31 gigabytes of RAM can be severely bogged down simply processing tasks.

Akka, on the other hand, is the basis for Apache Spark. It is lightweight and all inclusive. 50 million messages per second is attainable  with 10 million actors running concurrently at much less than fifty percent of a typical servers resources. With not every use case requiring spark, even in data engineering, this is an outstanding difference. Not requiring a message routing and results backend means that less skill is required for deployment as well.

The Rise, Fall, and Rise of Scala and Streamlining

When I started programming in Scala, the language was fairly unheard of. Many co-workers merely looked at this potent language as a curiosity. Eventually, Scala’s popularity started to wain as java developers were still focused on websites and ignored creating the same frameworks for Scala that exist in Python.

That is changing. With the rise of R, whose syntax is incredibly similar to Scala, mathematicians and analysts are becoming incredibly used to programming in increasingly complex languages.

Perhaps due to this, Scala is making it back into the lexicon of developers. The power of Python was greatly reduced in 2017 as non-existent or previously non-production level tools were released for the jvm.

Consider what is now at least version 1.0 in Scala:

  • Nd4j and Nd4s: A Scala and Java based non-dimensional array framework that boasts speeds faster than Numpy
  • Dl4J: Skymind is a terrific company producing tools comparable to torch
  • Tensor Flow: Contains APIs for both Java and Scala
  • Neanderthal: A clojure based linear algebra system that is blazing fast
  • OpenNLP: A new framework that, unlike the Stanford NLP tools, is actively developed and includes named entity recognition and other powerful transformative tools
  • Bytedeco: This project is filled with angels (I actually think they came from heaven) whose innovative and nearly automated JNI creator has created links to everything from Python code to Torch, libpostal, and OpenCV
  • Akka: Lightbend continues to produce distributed tools for Scala with now open sourced split brain resolvers that go well beyond my majority resolver
  • MongoDB connectors: Python’s MongoDB connectors are resource intensive due to the rather terrible nature of Python byte code
  • Spring Boot: Scala and Java are interoperable but benchmarks of Spring Boot show at least a 10000 request per second improvement over Django
  • Apereo CAS: A single sign on system that adds terrific security to disparate applications

Many of these frameworks are available in Java.  Since Scala runs any Java programs, the languages are interoperable. Scala is cleaner, functional, highly modular, and requires much less code than Java which puts these tools in the reach of analysts.

What does this mean for a data engineer?

It means attaining 1000 times the speed on a single machine with significant cost reduction and up to a 33 percent code reduction over Java. Some developers report a 20 percent speed boost over Java but that is likely due to poor coding practices.

It also means moving from millions of moving parts to a steadfast system.

The result is clear. My own company is switching off of Python everywhere except for our responsive and front end heavy web application for a fifty percent cost reduction in hardware.

Putting it all Together to Unclutter a Mess

With everything that Scala and the jvm offers, Data Engineers now have a potent tool for automation. These valuable employees may not be creating the algorithms but they will be transforming data in smart ways that produce real value.

Companies no longer have to rely on archaic languages to produce messy systems and this will translate directly into value. Data engineers will be behind this increase in value as they can more easily combine tools into a coherent and flexible whole.


The continued rise of jvm backed tools starting in 2018 will make data pipeline automation a significant part of a companies IT cost. Data engineers will be behind the evolution of data pipelines from disparate systems to a streamlined whole backed by custom code and new products.

2019 and 2020 will be the years of data engineering. After this, we may just be seeing the creation of skynet.

Fluff Stuff: Better Governments, Better Processes, Simplr Insites

Cities are heading towards bankruptcy. The credit rating of Stockton, CA was downgraded. Harrisburg, PA is actually bankrupt.  It is only a matter of time before Chicago implodes. Since 1995, city debt rose by between $1.3 and $1.8 trillion. While a large chunk of this cost is from waste, more is the result of using intuition over data when tackling infrastructure and new projects. Think of your city government as the boss who likes football more than his job so he builds a football stadium despite your company being in the submarine industry.

This is not an unsolvable nightmare.

Take an effective use case where technologies and government processes were outsourced. As costs rose in Sandy Sprints, GA, the city outsourced and achieved more streamlined processes, better technology, and lower costs. Today, without raising taxes, the city is in the green. While Sandy Springs residents are wealth, even poorer cities can learn from this experience.

Cities run projects in an extremely scientific manner and require an immense amount of clean, quality, well-managed data isolated into individual projects to run appropriately. With an average of $8000 spent per employee on technology each year and with an immense effort spent in acquiring analysts and modernizing infrastructure, cities are struggling to modernize.

It is my opinion, one I am basing a company on, that the provision of quality data management, sharing and collaboration tools, IT infrastructure, and powerful project and statistical management systems in a single SAAS tool can greatly reduce the $8000 per employee cost and improve budgets. These systems can even reduce the amount of administrative staff, allowing money to flow to where it is needed most.

How can a collaborative tool tackle the cost problem. Through:

  • collaborative knowledge sharing of working, ongoing, and even failed solutions
  • public facing project blogs and information on organizations, projects, statistical models, results, and solutions that allow even non-mathematical members of an organization to learn about a project
  • a security minded institutional resource manager (IRM better thought of as a large, securable, shared file storage system) capable of expanding into the petabytes while maintaining FERPA, HIPPA, and state and local regulations
  • the possibility to share data externally, keep it internal, or keep the information completely private while obfuscating names and other protected information
  • complexity analysis (graph based analysis) systems for people, projects, and organizations clustered for comparison
  • strong comparison tools
  • potent and learned aggregation systems with validity in mind ranging from streamed data from sensors and the internet to surveys to uploads
  • powerful drag and drop integration and ETL with mostly automated standardization
  • deep diving upload, data set, project, and model exploration using natural language searching
  • integration with everything from a phone to a tablet to a powerful desktop
  • access controls for sharing the bare minimum amount of information
  • outsourced IT infrastructure including infrastructure for large model building
  • validation using proven algorithms and increased awareness of what that actually means
  • anomaly detection
  • organization of models, data sets, people, and statistical elements into a single space for learning
  • connectors to popular visualizers such as Tableau and Vantara with a customize-able dashboard for general reporting
  • downloadable sets with two entity verification if required that can be streamed or used in Python and R

Tools such as ours significantly reduce the cost of IT by as much as 65%. We eliminate much of the waste in the data science pipeline while trying to be as simple as possible.

We should consider empowering and streamlining the companies, non-profits, and government entities such as schools and planning departments that perform vital services before our own lives are greatly effected. Debt and credit are not solutions to complex problems.

Take a look, or don’t. This is a fluff piece on something I am passionately building. Contact us if you are interested in a beta test.

The Case for Using an IRM to Scale Data Intake

Among many, there are three major problems faced by an analyst before data is useful:

  • data aggregation and storage
  • data security and access
  • data wrangling (ETL/ELT)

This article deals with data security and access using an information resource management system, IRM. My own company, Simplr Insites LLC, is writing such a system alongside a file storage solution in an effort to modernize the research process.


One significant problem faced in research and cooperation is the attainment of clean and useful data. Obtaining this data often means gaining access to systems, forming legal agreements, obfuscating certain data, and embarking on the painful process of data wrangling.

While ETL and ELT are critical steps, just obtaining sensitive data, even from within an organization, is tricky. Consider the following cases related directly to access:

  • data sets include confidential information
  • data sets are ensnared in legal agreements regarding who can access data
  • users want to control access to data to ensure it is not misused
  • external users are allowed varying degrees of access

IRM as a Solution

Oracle generated a solution that attempts to tackle the data security issue. The Oracle IRM documentation provides a rather informative graphical overview of their tool:


In this system, an external user accesses a load balanced IRM server application which controls rights and access to different resources and files. Several firewalls help to improve security along with authentication, access grants, and encryption. Web services  and internal users utilize the IRM server as well.

Beyond the visible components, tokens can be used to instantly manage resources and propagate access changes.

Most file systems also offer the capability to pull the date when a resource was created or modified and various permissions information. This is useful for logging purposes.

Setting Up an IRM

It is not necessary to rely on Oracle for an IRM solution. In fact, the Oracle IRM only works with Microsoft Windows.

Each component can be paired with a reliable tool, most of which I have blogged about. A set of pairings might include

Base Application and Resource Management Django with Secure Login
REST API Resource Access Django OAuth Toolkit
Access Management Django Oauth Toolkit and a Database System
Individual Resource Tokens Randomly Generated and Hashed Key
File Storage GlusterFS or an Encrpytable File System
Encryption of Resources PyCrypto or a Similar Tool
Firewalls IP Tables or another firewall
Two Step Verification through SMS Twilio
Key Storage Stack Exchange Blackbox
VPN Access Firefox
Logging and Anomaly Detection Elastic APM and the ElkStack


Logging is critical to security. Logs allow administrators to spot harmful activity, generate statistical models based on usage, and aid in auditing the system.


Tokens are a perfect solution for controlling document access in the system. They allow a user to gain access to a document, offer scopes for access, and often contain scopes that grant levels of access to a resource.

A user should be required to log in to the application to retrieve a token which refreshes on a regular schedule. These tokens can be revoked and changed by a resource owner or administrator much like using a file system.

Fernet Encryption

While RSA encryption is useful for two way encryption, Fernet encryption is stronger and more useful for storing files. If a system does not offer encryption, tools such as PyCrypto offer Fernet encryption.

Storing Keys

Keys should not be stored in the open. If compromised, it is extremely easy to gain access to a key stored in plain text. Instead, tools such as Stack Exchange’s Blackbox store keys in a system backed by a GPG key ring.

Two Step Downloading for Extra Security

Downloading a file in a secure manner might require extra protection, particularly when an external but trusted user desires access to a resource. To avoid spoofing and avoid a compromised computer from gaining access to a resource, two step verification is a recommended step.

In this process the external user provides an access token to obtain a document which is verified. On verification, a text message containing an access code is sent to the external user and the internal user is notified of the access. The external user enters the code and, if required, the resource owner or admin approves the download.

This type of process is not difficult to implement through desktop or web applications using push notifications or persistent storage.


Secured yet accessible storage is a critical problem for any data analyst or scientist. Using an established IRM or implementing a similar tool helps secure access and empower analytics.

PostgreSQL for Converting NoSQL to SQL

In need of a solution to prep for ETL and recognizing that drill may not be as comprehensive as I can build with PGPLSQL as it only goes one layer deep, it was time to find a way to move from dynamically created Jsonb in PostgreSQL to PostgreSQL relational tables.

The solution was this little function. This can be built to use jsonb_split_array and other functions to easily and quickly build up functions that delve deeper than drill. Add the future master replication and seemingly improving distriution and threading to Enterprise DB’s growing set of accomplishments with Postgres and why use drill.

breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text)

Only othercolumns and condition can be null.


CREATE OR REPLACE FUNCTION breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text,splitData boolean) RETURNS text[] AS
    k text;
    keys text[];
    stmt text;
    insertKeys text;
    IF outputTable IS NULL THEN
    END IF;

    if inputTable IS NULL THEN
    END IF;

    --get the initial keys
    if condition IS NOT NULL THEN
       IF splitData IS TRUE THEN
	  execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       END IF;
       IF splitData IS TRUE THEN
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1' into keys;
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1' into keys;
       END IF;
    end if;

    IF keys IS NULL OR array_length(keys,1) = 0 THEN
    END IF;

    --build the statement
    stmt = 'CREATE TABLE '||outputTable||' AS SELECT ';

    --build the insert keys statement 
    insertKeys = NULL;
      if insertKeys IS NULL THEN
         insertKeys = '';
         insertKeys = insertKeys||',';
      end if;
      insertKeys = insertKeys||'btrim(cast('||'j'||jsonColumn||'::jsonb->'||''''||k||''''||'as text),''"'') as '||k;

    if otherColumns IS NOT NULL THEN
           if insertKeys IS NULL THEN
            insertKeys = '';
             insertKeys = insertKeys||',';
           end if;  
           insertKeys = insertKeys||k;
       END LOOP;
     END IF;
    --concat to make full statement
    stmt = stmt||' '||insertKeys||' FROM '||' (SELECT *,';
    IF splitData IS TRUE THEN
      stmt = stmt||'jsonb_array_elements('||jsonColumn||'::jsonb) as j'||jsonColumn||' FROM '||inputTable||') as q1';
      stmt = stmt||jsonColumn||' as j'||jsonColumn||' FROM '||inputTable||') as q1';
    END IF;

    RAISE NOTICE 'QUERY: %',stmt;
    --execute and print statement
    RAISE NOTICE 'QUERY: %',stmt;
    execute stmt;
    --return the keys from json
    return keys;
Language plpgsql;

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,


Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).


There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.


Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Lack of Nested and Implicit Data Support in Drill, PostgreSQL and Pentaho when Working with Json Data

JSon is great. It can contain a variety of data types in an expected format. It is becoming easier and easier to work with Json in existing formats as well making it a future workhorse for NoSQL based ETL. However, and not in the least because NoSQL ingestion needs to result in relational tables using SQL standards, there is still one bug to work out. Ingestion with Json will not break out nested tables and requires direct knowledge of data to complete tasks.

This may seem petty but when millions of recods are being read, it clearly is not.

In drill, this could potentially be overcome by creating a table for every single submap we wish to analyze but CREATE TABLE from the tool itself will not work. Instead, it is necessary to limit use cases to the data we want to use.

In PostgreSQL, it is possible to concatenate JSon data using a query whose individual results can then be saved. It is also possible to ‘pop’ keys that are unneeded. However, this approach requires many different tables at one per somewhat normalized form. It also requires recombining data.

SELECT row_to_json(r.*) FROM (SELECT nonJson AS nonJson, ((((data->'level1')::jsonb - 'bkey1')::jsonb - 'bkey2')::jsonb -'bkey3')::jsonb AS jdata FROM table WHERE data::jsonb ?| array['mustHaveKey'] AND data::jsonb ?| array['notHaveKey'] IS FALSE) r

Drill is still much more ahead of the game than Pentaho and PostgreSQL in terms of conversion though. Postgresql can guess types but has no function to attempt to automatically generate tables. Pentaho requires explicit conversion as well.

Of course, if one already knows every key that will be present, this is not a problem. That, however, means more maintenance as it is then impossible to write programs to automatically handle changes to data. Perhaps implicit conversion will happen soon but any argument as to data type issues should really look at the depth of the standards and conform to them instead of complaining.

How Can the Tools of Big Data Detect Malicious Activity?

With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.

Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.

Similarity: The Neural Net or Cluster

The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.

Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.

Traits that Can be Useful

Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.

  • Types of Items purchased : Are they trivial like a stick of gum?
  • Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
  • Number of Malformed Requests : Are they sending bad headers?
  • Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
  • Frequency of Requests to your website : Does it look like a DNS attack?
  • Time spent on each Page : Is it too brief to be human?
  • Number of Recent Purchases : Perhaps they appear to be window shopping
  • Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
  • Validity or threat of a given email address : Is it a known spam address or even real?
  • Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
  • Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
  • Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?

Traits that May not Be Useful

People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.

  • Commonality of User Name : Not extremely informative but good to study
  • Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised

Do not Immediately Discount Traits and Always Test

Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.

Bayes, Cosines, and Text Content

Not all attacks can be detected by behaviour. Perhaps a vulnerability is already known. In this case, it is useful to look at Bayesian probabilities and perhaps cosine similarities. Even obfuscated code contains specific key words. For example, variables in javascript are always declared with var, most code languages use semi-colons, and obfuscated code is often a one line mess. Bayesian probability would state that the presence of one item followed by another when compared to frequencies from various categories yields a certain probability of belonging to a category.

If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.

Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.


Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.

Smoothing: When and Which?

Smoothing is everywhere. It is preprocessing for signal processing, it makes text segmentation work well, and it is used in a variety of programs to cull noise. However, there are a wide variety of ways to smooth data to achieve more appropriate predictions and lines of fit. This overview should help get you started.

When to Use Smoothing

The rather simple art of smoothing data is best performed when making predictions with temporal data where data comes from one or more potentially noisy sources (think a weather station with a wind speed monitor that is a bit loose or only partially unreliable) and dealing with tasks such converting digital sound to analogue waves for study. When a source appears capable of making decent predictions but is relatively noisy, smoothing helps. It is even used in fitting smoothing splines.

Types of Smoothing to Consider

Now that the rather blunt explanation is out of the way, the proceeding list is based on my own use with text mining and some sound data. I have tried to include the best resources I could for these.

  • Rectangular or Triangular: Best used for non-temporal and fairly well fitting data where more than past concerns are important (text segmentation is an example).
  • Simple Exponential: Creates a moving average smoothing and considers past events. Is not a terrific smoothing tool but is quick and works well when data is correlated (could work well with sound which may be better with a Hamming or Hanning Window). Unlike double and triple exponential smoothing, the algorithm requires no past experience or discovery to do correctly, for better or worse.
  • Double and Triple Exponential Smoothing: Works well with time series data. Peaks and valleys are more preserved. Triple exponential Smoothing works well with seasonal data. They require some manual training or an algorithm relying on previous experience to generate an alpha value to perfect. Again, past events are more heavily weighted.
  • Hanning and Hamming WindowsPeriodic data may work well with this type of smoothing (wave forms). They are based on the cosine function. Past experience is not needed. For really noisy data, try the more intensive Savitsky-Golay filter.
  • Savitzky–Golay: This smoothing works fairly well but preserves peaks and valleys within the broader scope of the data. Savitsky-Golay is not ideal for a truly smooth curve. However, if some noise is really important, this is a great method. Its publication was actually considered one of the most important by Analytical Chemistry for spectral analysis. It uses a localized least squares technique to accomplish its feat.

    However, do not rely on the matrix based calculation for OLS as the most efficient as gradient descent is clearly the winner. No self-respecting programmer will use a matrix implementation on large data sets. Spark contains an optimized gradient descent algorithm for distributed and even single node programming. The algorithm is tail recursive and seeks to minimize a cost function.

  • Distribution Hint

    For the programmer or developer looking to distribute a Savitsky-Golay calculation and not using Spark gradient descent. Map partitions works well on the local areas. It also works well when smoothing many splines or for the Hanning and Hamming Window based smoothing.

    NotePad: A NickName Recognition Idea

    So this is just an idea. Nicknames can apparently be attached to multiple names. That means that we could have a graph of nicknames and also that a direct list may be tricky. A quick thought on how to approach a solution to this.


    A. Graph the Names and their known connections to formal names.

    B. Attach probability data to each Node specifying which characteristics trigger that solution (Markovian sort of)

    C. Start a Server accepting requests

    D. When a request comes in first look for all potential connections from a nickname to a name, if none save this as a new node. Search can be performed as depth first or breadth first and even split into random points on the graph in a more stochastic mannersuch as a random walk (will need to study these more). This could also be done by just choosing multiple random search points.

    E. If there is one node, return the formal name

    F. If there are multiple names take one of several approaches:

    1. Approach is to calculate Bayesian probabilities given specified characteristics and return the strongest match (this is the weakest solution).

    2. Approach is to train a neural net with the appropriate kernel (RBF, linear; etc.) and return the result from this net. (This is slow as having a million neural nets in storage seems like a bad idea)

    When generating stand alone nodes, it may be possible to use a Levenshtein distance and other characteristics to attach nickname nodes to formal nodes based on a threshold. A clustering algorithm could use the formal name averages as cluster centers and a hard cutoff could specify (e.g. Levenshtein of 1 or 2) could solidify the connection and throw out Type I and Type II error.

    Stay tuned while I flesh this post out in this post with actual code. It may just be a proposal for a while.