The Case for Using an IRM to Scale Data Intake

Among many, there are three major problems faced by an analyst before data is useful:

  • data aggregation and storage
  • data security and access
  • data wrangling (ETL/ELT)

This article deals with data security and access using an information resource management system, IRM. My own company, Simplr Insites LLC, is writing such a system alongside a file storage solution in an effort to modernize the research process.

Problem

One significant problem faced in research and cooperation is the attainment of clean and useful data. Obtaining this data often means gaining access to systems, forming legal agreements, obfuscating certain data, and embarking on the painful process of data wrangling.

While ETL and ELT are critical steps, just obtaining sensitive data, even from within an organization, is tricky. Consider the following cases related directly to access:

  • data sets include confidential information
  • data sets are ensnared in legal agreements regarding who can access data
  • users want to control access to data to ensure it is not misused
  • external users are allowed varying degrees of access

IRM as a Solution

Oracle generated a solution that attempts to tackle the data security issue. The Oracle IRM documentation provides a rather informative graphical overview of their tool:

irm

In this system, an external user accesses a load balanced IRM server application which controls rights and access to different resources and files. Several firewalls help to improve security along with authentication, access grants, and encryption. Web services  and internal users utilize the IRM server as well.

Beyond the visible components, tokens can be used to instantly manage resources and propagate access changes.

Most file systems also offer the capability to pull the date when a resource was created or modified and various permissions information. This is useful for logging purposes.

Setting Up an IRM

It is not necessary to rely on Oracle for an IRM solution. In fact, the Oracle IRM only works with Microsoft Windows.

Each component can be paired with a reliable tool, most of which I have blogged about. A set of pairings might include

Base Application and Resource Management Django with Secure Login
REST API Resource Access Django OAuth Toolkit
Access Management Django Oauth Toolkit and a Database System
Individual Resource Tokens Randomly Generated and Hashed Key
File Storage GlusterFS or an Encrpytable File System
Encryption of Resources PyCrypto or a Similar Tool
Firewalls IP Tables or another firewall
Two Step Verification through SMS Twilio
Key Storage Stack Exchange Blackbox
VPN Access Firefox
Logging and Anomaly Detection Elastic APM and the ElkStack

Logging

Logging is critical to security. Logs allow administrators to spot harmful activity, generate statistical models based on usage, and aid in auditing the system.

Tokens

Tokens are a perfect solution for controlling document access in the system. They allow a user to gain access to a document, offer scopes for access, and often contain scopes that grant levels of access to a resource.

A user should be required to log in to the application to retrieve a token which refreshes on a regular schedule. These tokens can be revoked and changed by a resource owner or administrator much like using a file system.

Fernet Encryption

While RSA encryption is useful for two way encryption, Fernet encryption is stronger and more useful for storing files. If a system does not offer encryption, tools such as PyCrypto offer Fernet encryption.

Storing Keys

Keys should not be stored in the open. If compromised, it is extremely easy to gain access to a key stored in plain text. Instead, tools such as Stack Exchange’s Blackbox store keys in a system backed by a GPG key ring.

Two Step Downloading for Extra Security

Downloading a file in a secure manner might require extra protection, particularly when an external but trusted user desires access to a resource. To avoid spoofing and avoid a compromised computer from gaining access to a resource, two step verification is a recommended step.

In this process the external user provides an access token to obtain a document which is verified. On verification, a text message containing an access code is sent to the external user and the internal user is notified of the access. The external user enters the code and, if required, the resource owner or admin approves the download.

This type of process is not difficult to implement through desktop or web applications using push notifications or persistent storage.

Conclusion

Secured yet accessible storage is a critical problem for any data analyst or scientist. Using an established IRM or implementing a similar tool helps secure access and empower analytics.

Why Use Comps when We Live in an Age of Data Driven LSTMS and Analytics?

weather-climate-cover

Retail is an odd topic for this blog but I have a part time job. Interestingly, besides the fact you can make $20 – 25 per hour in ways I will not reveal, stores are stuck using comps and outdated mechanisms to determine success. In other words, mid-level managers are stuck in the dark ages.

Comps are horrible in multiple ways:

  • they fail to take into account the current corporate climate
  • they refuse to take into account sales from a previous year
  • they fail to take into account shortages in supply, price increases, and other factors
  • they are generally inaccurate and about as useful as the customer rating scale recently proven ineffective
  • an entire book worth of problems

Take into account a store in a chain where business is down 10.5 percent, that just lost a major sponsor, and recently saw a relatively poor general manager create staffing and customer service issues. Comps do not take into consideration any of these factors.

There are much better ways to examine whether specific practices are providing useful results and whether a store is gaining ground, remaining the same, or giving up.

Time Series Analysis

Time series analysis is a much more capable tool in retail. Stock investors already perform this type of analysis to predict when a chain will succeed. Why can’t the mid-level management receive the same information?

A time series analysis is climate driven. It allows managers to predict what sales should be for a given day and time frame and then examine whether that day was an anomaly.

Variable Selection

One area where retail fails is in variable selection. Just accounting for sales is really not enough to make a good prediction.

Stores should consider:

  • the day of the week
  • the month
  • whether the day was special (e.g. sponsored football game, holiday)
  • price of goods and deltas for the price of goods
  • price of raw materials and the price of raw materials
  • general demand
  • types of products being offered
  • any shortage of raw material
  • any shortage of staff

Better Linear Regression Based Decision Making

Unfortunately, data collection is often poor in the retail space. A company may keep track of comps and sales without using any other relevant variables or information. The company may not even store information beyond a certain time frame.

In this instance, powerful tools such as the venerable LSTM based neural network may not be feasible. However, it may be possible to use a linear regression model to predict sales.

Linear regression models are useful in both predicting sales and determining the number of standard deviations the actual result was from the reported result. Anyone with a passing grade and an undergraduate level of mathematics learned to create a solid model and trim variables for the most accurate results using more than intuition.

Still, such models do not change based on prior performance. They also require keep track of more variables than just sales data to be most accurate.

Even more problematic is the use of multiple factorizable variables. Using too many factorized variables will lead to poorly performing models. Poorly performing models lead to inappropriate decisions. Inappropriate decisions will destroy your company.

Power Up Through LSTMS

LSTMS are powerful devices capable of tracking variables over time while avoiding much of the factorization problem. Through a Bayesian approach, they predict information based on events from the past.

These models take into account patterns over time and are influenced by events from a previous day. They are useful in the same way as regression analysis but are impacted by current results.

Being Bayesian, an LSTM can be built in chunks and updated in real time, providing less need for maintenance and increasingly better performance.

Marketing Use Case as an Example

Predictive analytics and reporting are extremely useful in developing a marketing strategy, something often overlooked today.

By combining predictive algorithms with sales, promotions, and strategies, it is possible to ascertain whether there was an actual impact from using an algorithm. For instance, did a certain promotion generate more revenue or sales?

These questions posed over time (more than 32 days would be best), can prove the effectiveness of a program. They can reveal where to advertise to, how to advertise, and where to place the creative efforts of marketing and sales to best generate revenue.

When managers are given effective graphics and explanations for numbers based on these algorithms, they gain the power to determine optimal marketing plans. Remember, there is a reason business and marketing are considered a little scientific.

Conclusion

Comps suck. Stop using them to gauge success. They are illogical oddities from an era where money was easy and simple changes brought in revenue (pre 2008).

Companies should look to analytics and data science to drive sales and prove their results.

 

The Case for Microservices, Where To Segment

micro

There is a growing need for microservices and shared services in the increasingly complex and vibrant set of technologies a true IT firm runs. Licensing, authentication, database services, ETL, reporting, analytics, information management, and the plethora of tasks being completed on the backend are impossible to accomplish in only a single application.

This article examines boundaries discovered through my own company’s experience in building microservice related applications.

Related Articles:

Discovering Sharable Resources in a Microservices Environment

Security in a Microservices Environment

Segment On Need and Resource Usage

To be fair, where segmentation of systems occurs depends on the need for each service. Clients may need a core set of tasks to be completed in one area or another. Where those needs diverge is a perfect boundary for establishing a service.

For instance, our clients need ETL, secured cloud file storage, data sharing, text management, FERPA/HIPP and legally compliant storage of data, analytics, data streaming, surveying, and reporting. Each of these areas encompasses one company or another but is cheaper done under a single roof to the tune of $7000 in savings per employee per year at a small to medium sized company.

Our boundaries are specified directly around needs, security, and resource costs. ETL encompasses one boundary due to computation costs, cloud storage another for security reasons, logging for legal compliance another, analytics takes up another service due to computational costs, stream and survey intake and initial analysis comprises another more vulnerable piece, and reporting yet another. Overlapping everything is a service for authorization and the authentication of access rights through oauth2.

The different services were chosen for one of the following factors:

  • resource cost
  • shared tasks and resources
  • legal compliance and security

Segmenting for Security

The modern world is growing increasingly security and privacy conscious. Including authentication systems and the storage of information on the same system as a web server is not recommended.

Microservices allow for individual applications to be separated and controlled. Access can be granted to specific clusters based on a firewall and authentication. Even user access control is easier to maintain. Hardware boundaries can be easily established between vulnerable pieces of a system.

Essentially, never stick a vulnerable frontend, streaming, or survey application on the same hardware as your potentially identifying initial file storage and always have some sort of authentication and access rights mechanism.

Results

Our boundaries are helping us scale. Simplr Insites, LLC dedicates individual resources as needed to each service. It also allows the company to offer a pricing scheme offering variable levels of services tailored to a customers needs more easily.

Some clients do not need an ETL system and only want case note management. That is possible. At the same time, granting GPU resources to the analytics cluster while giving our reporting cluster more RAM is as well.

In essence, Simplr Insites was able to reduce the cost of running systems in a 42 U shared space, possibly by as much as $5000 per month for our small company, while remaining more secure and delivering faster and tailored results based on the needs of clients through a single web frontend based SAAS application.

Conclusion

Discovering where to place microservice boundaries is critical to the success of an application. It relies on many factors ranging from resource cost, to the ability to share resources, and even legal compliance and security. Appropriate splitting of services can reduce cost and increase speed.