Why Use Comps when We Live in an Age of Data Driven LSTMS and Analytics?

weather-climate-cover

Retail is an odd topic for this blog but I have a part time job. Interestingly, besides the fact you can make $20 – 25 per hour in ways I will not reveal, stores are stuck using comps and outdated mechanisms to determine success. In other words, mid-level managers are stuck in the dark ages.

Comps are horrible in multiple ways:

  • they fail to take into account the current corporate climate
  • they refuse to take into account sales from a previous year
  • they fail to take into account shortages in supply, price increases, and other factors
  • they are generally inaccurate and about as useful as the customer rating scale recently proven ineffective
  • an entire book worth of problems

Take into account a store in a chain where business is down 10.5 percent, that just lost a major sponsor, and recently saw a relatively poor general manager create staffing and customer service issues. Comps do not take into consideration any of these factors.

There are much better ways to examine whether specific practices are providing useful results and whether a store is gaining ground, remaining the same, or giving up.

Time Series Analysis

Time series analysis is a much more capable tool in retail. Stock investors already perform this type of analysis to predict when a chain will succeed. Why can’t the mid-level management receive the same information?

A time series analysis is climate driven. It allows managers to predict what sales should be for a given day and time frame and then examine whether that day was an anomaly.

Variable Selection

One area where retail fails is in variable selection. Just accounting for sales is really not enough to make a good prediction.

Stores should consider:

  • the day of the week
  • the month
  • whether the day was special (e.g. sponsored football game, holiday)
  • price of goods and deltas for the price of goods
  • price of raw materials and the price of raw materials
  • general demand
  • types of products being offered
  • any shortage of raw material
  • any shortage of staff

Better Linear Regression Based Decision Making

Unfortunately, data collection is often poor in the retail space. A company may keep track of comps and sales without using any other relevant variables or information. The company may not even store information beyond a certain time frame.

In this instance, powerful tools such as the venerable LSTM based neural network may not be feasible. However, it may be possible to use a linear regression model to predict sales.

Linear regression models are useful in both predicting sales and determining the number of standard deviations the actual result was from the reported result. Anyone with a passing grade and an undergraduate level of mathematics learned to create a solid model and trim variables for the most accurate results using more than intuition.

Still, such models do not change based on prior performance. They also require keep track of more variables than just sales data to be most accurate.

Even more problematic is the use of multiple factorizable variables. Using too many factorized variables will lead to poorly performing models. Poorly performing models lead to inappropriate decisions. Inappropriate decisions will destroy your company.

Power Up Through LSTMS

LSTMS are powerful devices capable of tracking variables over time while avoiding much of the factorization problem. Through a Bayesian approach, they predict information based on events from the past.

These models take into account patterns over time and are influenced by events from a previous day. They are useful in the same way as regression analysis but are impacted by current results.

Being Bayesian, an LSTM can be built in chunks and updated in real time, providing less need for maintenance and increasingly better performance.

Marketing Use Case as an Example

Predictive analytics and reporting are extremely useful in developing a marketing strategy, something often overlooked today.

By combining predictive algorithms with sales, promotions, and strategies, it is possible to ascertain whether there was an actual impact from using an algorithm. For instance, did a certain promotion generate more revenue or sales?

These questions posed over time (more than 32 days would be best), can prove the effectiveness of a program. They can reveal where to advertise to, how to advertise, and where to place the creative efforts of marketing and sales to best generate revenue.

When managers are given effective graphics and explanations for numbers based on these algorithms, they gain the power to determine optimal marketing plans. Remember, there is a reason business and marketing are considered a little scientific.

Conclusion

Comps suck. Stop using them to gauge success. They are illogical oddities from an era where money was easy and simple changes brought in revenue (pre 2008).

Companies should look to analytics and data science to drive sales and prove their results.

 

Smoothing: When and Which?

Smoothing is everywhere. It is preprocessing for signal processing, it makes text segmentation work well, and it is used in a variety of programs to cull noise. However, there are a wide variety of ways to smooth data to achieve more appropriate predictions and lines of fit. This overview should help get you started.

When to Use Smoothing

The rather simple art of smoothing data is best performed when making predictions with temporal data where data comes from one or more potentially noisy sources (think a weather station with a wind speed monitor that is a bit loose or only partially unreliable) and dealing with tasks such converting digital sound to analogue waves for study. When a source appears capable of making decent predictions but is relatively noisy, smoothing helps. It is even used in fitting smoothing splines.

Types of Smoothing to Consider

Now that the rather blunt explanation is out of the way, the proceeding list is based on my own use with text mining and some sound data. I have tried to include the best resources I could for these.

  • Rectangular or Triangular: Best used for non-temporal and fairly well fitting data where more than past concerns are important (text segmentation is an example).
  • Simple Exponential: Creates a moving average smoothing and considers past events. Is not a terrific smoothing tool but is quick and works well when data is correlated (could work well with sound which may be better with a Hamming or Hanning Window). Unlike double and triple exponential smoothing, the algorithm requires no past experience or discovery to do correctly, for better or worse.
  • Double and Triple Exponential Smoothing: Works well with time series data. Peaks and valleys are more preserved. Triple exponential Smoothing works well with seasonal data. They require some manual training or an algorithm relying on previous experience to generate an alpha value to perfect. Again, past events are more heavily weighted.
  • Hanning and Hamming WindowsPeriodic data may work well with this type of smoothing (wave forms). They are based on the cosine function. Past experience is not needed. For really noisy data, try the more intensive Savitsky-Golay filter.
  • Savitzky–Golay: This smoothing works fairly well but preserves peaks and valleys within the broader scope of the data. Savitsky-Golay is not ideal for a truly smooth curve. However, if some noise is really important, this is a great method. Its publication was actually considered one of the most important by Analytical Chemistry for spectral analysis. It uses a localized least squares technique to accomplish its feat.

    However, do not rely on the matrix based calculation for OLS as the most efficient as gradient descent is clearly the winner. No self-respecting programmer will use a matrix implementation on large data sets. Spark contains an optimized gradient descent algorithm for distributed and even single node programming. The algorithm is tail recursive and seeks to minimize a cost function.

  • Distribution Hint

    For the programmer or developer looking to distribute a Savitsky-Golay calculation and not using Spark gradient descent. Map partitions works well on the local areas. It also works well when smoothing many splines or for the Hanning and Hamming Window based smoothing.

    A Proposed Improvement to Decision Trees, Sort of A Neural Net

    Decision trees remove a lot of power from themselves by having a root node. They are trees after all and who is to say that one starting point for most data holds for all data. There may be an alternate way to handle them, one I will test when I have time against an appropriate dataset such as legal case data. The proposed solution is simple, change the structure. Don’t use a decision tree, use a decision graph. Every node has the ability to be chosen. Also, Don’t just create one vertex, create many vertices.

    Feasibility

    This is actually a more feasible solution than it appears. By training a neural net to choose the correct node to start with and specifying correct paths from each node, the old (n)*(n-1) edge relationship to the number of nodes doesn’t need to be a reality. This also retains a decision tree like shape. Basically, a decision graph.

    Positives

    We now have a more manual hybrid between a neural net, decision tree, and graph that still needs to be tested. However, we also have man different possible outcomes without relying on one root node. This is a much better starting point and many more than just a single path.

    Negatives

    This is still a manual process whose best outcome is probably connecting every node to every other node and training outcomes on the best path. It can also be extremely calculation intensive to get right.

    Conclusion

    For now, just that decision trees are limited and not really the ideal solution in most circumstances.

    A Cosine Cutoff Value

    I recently discovered this but cannot remember where. The article was relating to ultra-sounds and mapping them although I was surprised that fuzzy c-means was not mentioned. I have deployed it with some decent effect in text mining algorithms I am writing.

    threshold = sqrt((1/matrix.shape[0])*Sum(xij – column_meanj))

     

    Please comment if you have found something better or know where this came from. Cheers!