Why Use Comps when We Live in an Age of Data Driven LSTMS and Analytics?

weather-climate-cover

Retail is an odd topic for this blog but I have a part time job. Interestingly, besides the fact you can make $20 – 25 per hour in ways I will not reveal, stores are stuck using comps and outdated mechanisms to determine success. In other words, mid-level managers are stuck in the dark ages.

Comps are horrible in multiple ways:

  • they fail to take into account the current corporate climate
  • they refuse to take into account sales from a previous year
  • they fail to take into account shortages in supply, price increases, and other factors
  • they are generally inaccurate and about as useful as the customer rating scale recently proven ineffective
  • an entire book worth of problems

Take into account a store in a chain where business is down 10.5 percent, that just lost a major sponsor, and recently saw a relatively poor general manager create staffing and customer service issues. Comps do not take into consideration any of these factors.

There are much better ways to examine whether specific practices are providing useful results and whether a store is gaining ground, remaining the same, or giving up.

Time Series Analysis

Time series analysis is a much more capable tool in retail. Stock investors already perform this type of analysis to predict when a chain will succeed. Why can’t the mid-level management receive the same information?

A time series analysis is climate driven. It allows managers to predict what sales should be for a given day and time frame and then examine whether that day was an anomaly.

Variable Selection

One area where retail fails is in variable selection. Just accounting for sales is really not enough to make a good prediction.

Stores should consider:

  • the day of the week
  • the month
  • whether the day was special (e.g. sponsored football game, holiday)
  • price of goods and deltas for the price of goods
  • price of raw materials and the price of raw materials
  • general demand
  • types of products being offered
  • any shortage of raw material
  • any shortage of staff

Better Linear Regression Based Decision Making

Unfortunately, data collection is often poor in the retail space. A company may keep track of comps and sales without using any other relevant variables or information. The company may not even store information beyond a certain time frame.

In this instance, powerful tools such as the venerable LSTM based neural network may not be feasible. However, it may be possible to use a linear regression model to predict sales.

Linear regression models are useful in both predicting sales and determining the number of standard deviations the actual result was from the reported result. Anyone with a passing grade and an undergraduate level of mathematics learned to create a solid model and trim variables for the most accurate results using more than intuition.

Still, such models do not change based on prior performance. They also require keep track of more variables than just sales data to be most accurate.

Even more problematic is the use of multiple factorizable variables. Using too many factorized variables will lead to poorly performing models. Poorly performing models lead to inappropriate decisions. Inappropriate decisions will destroy your company.

Power Up Through LSTMS

LSTMS are powerful devices capable of tracking variables over time while avoiding much of the factorization problem. Through a Bayesian approach, they predict information based on events from the past.

These models take into account patterns over time and are influenced by events from a previous day. They are useful in the same way as regression analysis but are impacted by current results.

Being Bayesian, an LSTM can be built in chunks and updated in real time, providing less need for maintenance and increasingly better performance.

Marketing Use Case as an Example

Predictive analytics and reporting are extremely useful in developing a marketing strategy, something often overlooked today.

By combining predictive algorithms with sales, promotions, and strategies, it is possible to ascertain whether there was an actual impact from using an algorithm. For instance, did a certain promotion generate more revenue or sales?

These questions posed over time (more than 32 days would be best), can prove the effectiveness of a program. They can reveal where to advertise to, how to advertise, and where to place the creative efforts of marketing and sales to best generate revenue.

When managers are given effective graphics and explanations for numbers based on these algorithms, they gain the power to determine optimal marketing plans. Remember, there is a reason business and marketing are considered a little scientific.

Conclusion

Comps suck. Stop using them to gauge success. They are illogical oddities from an era where money was easy and simple changes brought in revenue (pre 2008).

Companies should look to analytics and data science to drive sales and prove their results.

 

The Very Sad and Disturbing State of JVM Based Sparse Matrix Packages

Big data is the rage, distribution is the rage, and so to is the growth of streaming data. The faster a company is, the better. Such speed requires, no demands, solid matrix performance. Worse yet, big data is inherently sparse and testing and implementation of new algorithms requires sparse matrices (CSR,CSC, COO; the like). Sadly, Java is not up to the task.

Let’s revisit some facts. Java is faster than Python at its core. Many tasks require looping over data in ways numpy or scipy simply do not support. A recent benchmark on Python3 v. Java highlights this. Worse, Python2 and Python3 use the global interpreter lock (GIL) making attempts at speed through concurrency often slower than single threading and forcing developers to use the dreaded multiprocessing (large clunky programs using only as many processes as cores). Still, multiprogramming and other modern operating systems concepts are helping Python achieve better albeit still quite slow speeds.

That said, Numpy and Scipy are the opposite of any of these things. They require the slower Cython but are written in C, performing blazingly fast, and leave all Java matrix libraries in the dust. In fact, in attempting to implement some basic machine learning tasks, I found myself not just writing things like text tiling which I fully expected to do but also starting down the path of creating a full fledged sparse matrix library with hashing library.

The following is the sad state of my Matrix tests.

The Libraries

The following libraries were used in the test:

The Cosines Test

An intensive test of a common use case is the calculation of the dot product (a dot b, a * b.t). Taking this result and dividing by norm(a)*norm(b) yields the cosine of pheta.

This simple test includes multiplication, transposing, and mapping division across all active values.

The Machine and Test Specs

The following machine specifications held:

  • CPU : Core i3 2.33 ghz
  • RAM : 8 GB (upgraded on laptop)
  • Environment: Eclipse
  • Test: Cosine Analysis
  • Languages: Scala and Python(scipy and numpy only)
  • Iterations: 32
  • Alloted Memory: 7gb either with x64 Python3 or -Xmx7g
  • Average: Strict non-smoothed average

The Scipy/Numpy Gold Standard

Not many open source libraries can claim the speed and power of the almighty Scipy and Numpy. The library can handle sparse matrices with m*n well over 1,000,000. More telling, it is fast. The calculation of the cosines is an extremely common practice in NLP and is a valuable similarity metric in most circumstances.

import scipy.sparse
import scipy.sparse.linalg

mat = sparse.rand(1000,50000,0.15)
print scipy.sparse.dot(mat,mat.t)/pow(linalg.norm(mat),2)

Result : 5.13 seconds

The Results

The following resulted from each library:

  • Breeze : Crashes with Out of Memory Error (developer notified) [mat * mat.t]
  • UJMP : 128.73 seconds
  • MTJ : 285.13 seconds
  • La4j : 420.2 seconds
  • BidMach : Crashes (sprand(1000,50000,0.15))

Here is what the author of Breeze had to say. Rest assured, Numpy has been stable for over a decade now with constant improvements.

Conclusion

Java libraries are slow and seemingly undeserving of praise. Perhaps, due to the potential benefits of not necessarily distributing every calculation, they are not production grade. Promising libraries such as Nd4J/Nd4s still do not have a sparse matrix library and have claimed for some time to have one in the works. The alternatives are to use Python or program millions of lines of C code. While I find C fun, it is not a fast language to implement. Perhaps, for now Python will do. After all, PySpark is a little over 6 months old.

Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.

Resources

Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.

NotePad: A NickName Recognition Idea

So this is just an idea. Nicknames can apparently be attached to multiple names. That means that we could have a graph of nicknames and also that a direct list may be tricky. A quick thought on how to approach a solution to this.

 

A. Graph the Names and their known connections to formal names.

B. Attach probability data to each Node specifying which characteristics trigger that solution (Markovian sort of)

C. Start a Server accepting requests

D. When a request comes in first look for all potential connections from a nickname to a name, if none save this as a new node. Search can be performed as depth first or breadth first and even split into random points on the graph in a more stochastic mannersuch as a random walk (will need to study these more). This could also be done by just choosing multiple random search points.

E. If there is one node, return the formal name

F. If there are multiple names take one of several approaches:

1. Approach is to calculate Bayesian probabilities given specified characteristics and return the strongest match (this is the weakest solution).

2. Approach is to train a neural net with the appropriate kernel (RBF, linear; etc.) and return the result from this net. (This is slow as having a million neural nets in storage seems like a bad idea)

When generating stand alone nodes, it may be possible to use a Levenshtein distance and other characteristics to attach nickname nodes to formal nodes based on a threshold. A clustering algorithm could use the formal name averages as cluster centers and a hard cutoff could specify (e.g. Levenshtein of 1 or 2) could solidify the connection and throw out Type I and Type II error.

Stay tuned while I flesh this post out in this post with actual code. It may just be a proposal for a while.

A Cosine Cutoff Value

I recently discovered this but cannot remember where. The article was relating to ultra-sounds and mapping them although I was surprised that fuzzy c-means was not mentioned. I have deployed it with some decent effect in text mining algorithms I am writing.

threshold = sqrt((1/matrix.shape[0])*Sum(xij – column_meanj))

 

Please comment if you have found something better or know where this came from. Cheers!