The Case for Using an IRM to Scale Data Intake

Among many, there are three major problems faced by an analyst before data is useful:

  • data aggregation and storage
  • data security and access
  • data wrangling (ETL/ELT)

This article deals with data security and access using an information resource management system, IRM. My own company, Simplr Insites LLC, is writing such a system alongside a file storage solution in an effort to modernize the research process.

Problem

One significant problem faced in research and cooperation is the attainment of clean and useful data. Obtaining this data often means gaining access to systems, forming legal agreements, obfuscating certain data, and embarking on the painful process of data wrangling.

While ETL and ELT are critical steps, just obtaining sensitive data, even from within an organization, is tricky. Consider the following cases related directly to access:

  • data sets include confidential information
  • data sets are ensnared in legal agreements regarding who can access data
  • users want to control access to data to ensure it is not misused
  • external users are allowed varying degrees of access

IRM as a Solution

Oracle generated a solution that attempts to tackle the data security issue. The Oracle IRM documentation provides a rather informative graphical overview of their tool:

irm

In this system, an external user accesses a load balanced IRM server application which controls rights and access to different resources and files. Several firewalls help to improve security along with authentication, access grants, and encryption. Web services  and internal users utilize the IRM server as well.

Beyond the visible components, tokens can be used to instantly manage resources and propagate access changes.

Most file systems also offer the capability to pull the date when a resource was created or modified and various permissions information. This is useful for logging purposes.

Setting Up an IRM

It is not necessary to rely on Oracle for an IRM solution. In fact, the Oracle IRM only works with Microsoft Windows.

Each component can be paired with a reliable tool, most of which I have blogged about. A set of pairings might include

Base Application and Resource Management Django with Secure Login
REST API Resource Access Django OAuth Toolkit
Access Management Django Oauth Toolkit and a Database System
Individual Resource Tokens Randomly Generated and Hashed Key
File Storage GlusterFS or an Encrpytable File System
Encryption of Resources PyCrypto or a Similar Tool
Firewalls IP Tables or another firewall
Two Step Verification through SMS Twilio
Key Storage Stack Exchange Blackbox
VPN Access Firefox
Logging and Anomaly Detection Elastic APM and the ElkStack

Logging

Logging is critical to security. Logs allow administrators to spot harmful activity, generate statistical models based on usage, and aid in auditing the system.

Tokens

Tokens are a perfect solution for controlling document access in the system. They allow a user to gain access to a document, offer scopes for access, and often contain scopes that grant levels of access to a resource.

A user should be required to log in to the application to retrieve a token which refreshes on a regular schedule. These tokens can be revoked and changed by a resource owner or administrator much like using a file system.

Fernet Encryption

While RSA encryption is useful for two way encryption, Fernet encryption is stronger and more useful for storing files. If a system does not offer encryption, tools such as PyCrypto offer Fernet encryption.

Storing Keys

Keys should not be stored in the open. If compromised, it is extremely easy to gain access to a key stored in plain text. Instead, tools such as Stack Exchange’s Blackbox store keys in a system backed by a GPG key ring.

Two Step Downloading for Extra Security

Downloading a file in a secure manner might require extra protection, particularly when an external but trusted user desires access to a resource. To avoid spoofing and avoid a compromised computer from gaining access to a resource, two step verification is a recommended step.

In this process the external user provides an access token to obtain a document which is verified. On verification, a text message containing an access code is sent to the external user and the internal user is notified of the access. The external user enters the code and, if required, the resource owner or admin approves the download.

This type of process is not difficult to implement through desktop or web applications using push notifications or persistent storage.

Conclusion

Secured yet accessible storage is a critical problem for any data analyst or scientist. Using an established IRM or implementing a similar tool helps secure access and empower analytics.

PostgreSQL for Converting NoSQL to SQL

In need of a solution to prep for ETL and recognizing that drill may not be as comprehensive as I can build with PGPLSQL as it only goes one layer deep, it was time to find a way to move from dynamically created Jsonb in PostgreSQL to PostgreSQL relational tables.

The solution was this little function. This can be built to use jsonb_split_array and other functions to easily and quickly build up functions that delve deeper than drill. Add the future master replication and seemingly improving distriution and threading to Enterprise DB’s growing set of accomplishments with Postgres and why use drill.

breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text)

Only othercolumns and condition can be null.

Code

CREATE OR REPLACE FUNCTION breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text,splitData boolean) RETURNS text[] AS
$BODY$
DECLARE
    k text;
    keys text[];
    stmt text;
    insertKeys text;
BEGIN
    IF outputTable IS NULL THEN
        RAISE EXCEPTION 'OUTPUT TABLE CANNOT BE NULL';	
    END IF;

    if inputTable IS NULL THEN
        RAISE EXCEPTION 'INPUT TABLE CANNOT BE NULL';
    END IF;

    --get the initial keys
    if condition IS NOT NULL THEN
       IF splitData IS TRUE THEN
	  execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       END IF;
    else
       IF splitData IS TRUE THEN
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1' into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1' into keys;
       END IF;
    end if;

    IF keys IS NULL OR array_length(keys,1) = 0 THEN
	RAISE EXCEPTION 'NUMBER OF DISCOVERED KEYS WAS 0';
    END IF;

    --build the statement
    stmt = 'CREATE TABLE '||outputTable||' AS SELECT ';

    --build the insert keys statement 
    insertKeys = NULL;
    FOREACH k IN ARRAY keys LOOP
      if insertKeys IS NULL THEN
         insertKeys = '';
      else
         insertKeys = insertKeys||',';
      end if;
      insertKeys = insertKeys||'btrim(cast('||'j'||jsonColumn||'::jsonb->'||''''||k||''''||'as text),''"'') as '||k;
    END LOOP;

    if otherColumns IS NOT NULL THEN
	FOREACH k IN ARRAY otherColumns LOOP
           if insertKeys IS NULL THEN
            insertKeys = '';
           else
             insertKeys = insertKeys||',';
           end if;  
           insertKeys = insertKeys||k;
       END LOOP;
     END IF;
     	
    --concat to make full statement
    stmt = stmt||' '||insertKeys||' FROM '||' (SELECT *,';
    IF splitData IS TRUE THEN
      stmt = stmt||'jsonb_array_elements('||jsonColumn||'::jsonb) as j'||jsonColumn||' FROM '||inputTable||') as q1';
    ELSE
      stmt = stmt||jsonColumn||' as j'||jsonColumn||' FROM '||inputTable||') as q1';
    END IF;

    RAISE NOTICE 'QUERY: %',stmt;
    
    --execute and print statement
    RAISE NOTICE 'QUERY: %',stmt;
    execute stmt;
    
    --return the keys from json
    return keys;
END;
$BODY$
Language plpgsql;

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,

Ui4J

Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).

TestFX

There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.

Conclusion

Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Lack of Nested and Implicit Data Support in Drill, PostgreSQL and Pentaho when Working with Json Data

JSon is great. It can contain a variety of data types in an expected format. It is becoming easier and easier to work with Json in existing formats as well making it a future workhorse for NoSQL based ETL. However, and not in the least because NoSQL ingestion needs to result in relational tables using SQL standards, there is still one bug to work out. Ingestion with Json will not break out nested tables and requires direct knowledge of data to complete tasks.

This may seem petty but when millions of recods are being read, it clearly is not.

In drill, this could potentially be overcome by creating a table for every single submap we wish to analyze but CREATE TABLE from the tool itself will not work. Instead, it is necessary to limit use cases to the data we want to use.

In PostgreSQL, it is possible to concatenate JSon data using a query whose individual results can then be saved. It is also possible to ‘pop’ keys that are unneeded. However, this approach requires many different tables at one per somewhat normalized form. It also requires recombining data.


SELECT row_to_json(r.*) FROM (SELECT nonJson AS nonJson, ((((data->'level1')::jsonb - 'bkey1')::jsonb - 'bkey2')::jsonb -'bkey3')::jsonb AS jdata FROM table WHERE data::jsonb ?| array['mustHaveKey'] AND data::jsonb ?| array['notHaveKey'] IS FALSE) r

Drill is still much more ahead of the game than Pentaho and PostgreSQL in terms of conversion though. Postgresql can guess types but has no function to attempt to automatically generate tables. Pentaho requires explicit conversion as well.

Of course, if one already knows every key that will be present, this is not a problem. That, however, means more maintenance as it is then impossible to write programs to automatically handle changes to data. Perhaps implicit conversion will happen soon but any argument as to data type issues should really look at the depth of the standards and conform to them instead of complaining.

How Can the Tools of Big Data Detect Malicious Activity?

With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.

Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.

Similarity: The Neural Net or Cluster

The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.

Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.

Traits that Can be Useful

Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.

  • Types of Items purchased : Are they trivial like a stick of gum?
  • Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
  • Number of Malformed Requests : Are they sending bad headers?
  • Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
  • Frequency of Requests to your website : Does it look like a DNS attack?
  • Time spent on each Page : Is it too brief to be human?
  • Number of Recent Purchases : Perhaps they appear to be window shopping
  • Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
  • Validity or threat of a given email address : Is it a known spam address or even real?
  • Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
  • Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
  • Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?

Traits that May not Be Useful

People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.

  • Commonality of User Name : Not extremely informative but good to study
  • Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised

Do not Immediately Discount Traits and Always Test

Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.

Bayes, Cosines, and Text Content

Not all attacks can be detected by behaviour. Perhaps a vulnerability is already known. In this case, it is useful to look at Bayesian probabilities and perhaps cosine similarities. Even obfuscated code contains specific key words. For example, variables in javascript are always declared with var, most code languages use semi-colons, and obfuscated code is often a one line mess. Bayesian probability would state that the presence of one item followed by another when compared to frequencies from various categories yields a certain probability of belonging to a category.

If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.

Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.

Resources

Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.

Smoothing: When and Which?

Smoothing is everywhere. It is preprocessing for signal processing, it makes text segmentation work well, and it is used in a variety of programs to cull noise. However, there are a wide variety of ways to smooth data to achieve more appropriate predictions and lines of fit. This overview should help get you started.

When to Use Smoothing

The rather simple art of smoothing data is best performed when making predictions with temporal data where data comes from one or more potentially noisy sources (think a weather station with a wind speed monitor that is a bit loose or only partially unreliable) and dealing with tasks such converting digital sound to analogue waves for study. When a source appears capable of making decent predictions but is relatively noisy, smoothing helps. It is even used in fitting smoothing splines.

Types of Smoothing to Consider

Now that the rather blunt explanation is out of the way, the proceeding list is based on my own use with text mining and some sound data. I have tried to include the best resources I could for these.

  • Rectangular or Triangular: Best used for non-temporal and fairly well fitting data where more than past concerns are important (text segmentation is an example).
  • Simple Exponential: Creates a moving average smoothing and considers past events. Is not a terrific smoothing tool but is quick and works well when data is correlated (could work well with sound which may be better with a Hamming or Hanning Window). Unlike double and triple exponential smoothing, the algorithm requires no past experience or discovery to do correctly, for better or worse.
  • Double and Triple Exponential Smoothing: Works well with time series data. Peaks and valleys are more preserved. Triple exponential Smoothing works well with seasonal data. They require some manual training or an algorithm relying on previous experience to generate an alpha value to perfect. Again, past events are more heavily weighted.
  • Hanning and Hamming WindowsPeriodic data may work well with this type of smoothing (wave forms). They are based on the cosine function. Past experience is not needed. For really noisy data, try the more intensive Savitsky-Golay filter.
  • Savitzky–Golay: This smoothing works fairly well but preserves peaks and valleys within the broader scope of the data. Savitsky-Golay is not ideal for a truly smooth curve. However, if some noise is really important, this is a great method. Its publication was actually considered one of the most important by Analytical Chemistry for spectral analysis. It uses a localized least squares technique to accomplish its feat.

    However, do not rely on the matrix based calculation for OLS as the most efficient as gradient descent is clearly the winner. No self-respecting programmer will use a matrix implementation on large data sets. Spark contains an optimized gradient descent algorithm for distributed and even single node programming. The algorithm is tail recursive and seeks to minimize a cost function.

  • Distribution Hint

    For the programmer or developer looking to distribute a Savitsky-Golay calculation and not using Spark gradient descent. Map partitions works well on the local areas. It also works well when smoothing many splines or for the Hanning and Hamming Window based smoothing.

    NotePad: A NickName Recognition Idea

    So this is just an idea. Nicknames can apparently be attached to multiple names. That means that we could have a graph of nicknames and also that a direct list may be tricky. A quick thought on how to approach a solution to this.

     

    A. Graph the Names and their known connections to formal names.

    B. Attach probability data to each Node specifying which characteristics trigger that solution (Markovian sort of)

    C. Start a Server accepting requests

    D. When a request comes in first look for all potential connections from a nickname to a name, if none save this as a new node. Search can be performed as depth first or breadth first and even split into random points on the graph in a more stochastic mannersuch as a random walk (will need to study these more). This could also be done by just choosing multiple random search points.

    E. If there is one node, return the formal name

    F. If there are multiple names take one of several approaches:

    1. Approach is to calculate Bayesian probabilities given specified characteristics and return the strongest match (this is the weakest solution).

    2. Approach is to train a neural net with the appropriate kernel (RBF, linear; etc.) and return the result from this net. (This is slow as having a million neural nets in storage seems like a bad idea)

    When generating stand alone nodes, it may be possible to use a Levenshtein distance and other characteristics to attach nickname nodes to formal nodes based on a threshold. A clustering algorithm could use the formal name averages as cluster centers and a hard cutoff could specify (e.g. Levenshtein of 1 or 2) could solidify the connection and throw out Type I and Type II error.

    Stay tuned while I flesh this post out in this post with actual code. It may just be a proposal for a while.

    A Cosine Cutoff Value

    I recently discovered this but cannot remember where. The article was relating to ultra-sounds and mapping them although I was surprised that fuzzy c-means was not mentioned. I have deployed it with some decent effect in text mining algorithms I am writing.

    threshold = sqrt((1/matrix.shape[0])*Sum(xij – column_meanj))

     

    Please comment if you have found something better or know where this came from. Cheers!