Why Use Comps when We Live in an Age of Data Driven LSTMS and Analytics?


Retail is an odd topic for this blog but I have a part time job. Interestingly, besides the fact you can make $20 – 25 per hour in ways I will not reveal, stores are stuck using comps and outdated mechanisms to determine success. In other words, mid-level managers are stuck in the dark ages.

Comps are horrible in multiple ways:

  • they fail to take into account the current corporate climate
  • they refuse to take into account sales from a previous year
  • they fail to take into account shortages in supply, price increases, and other factors
  • they are generally inaccurate and about as useful as the customer rating scale recently proven ineffective
  • an entire book worth of problems

Take into account a store in a chain where business is down 10.5 percent, that just lost a major sponsor, and recently saw a relatively poor general manager create staffing and customer service issues. Comps do not take into consideration any of these factors.

There are much better ways to examine whether specific practices are providing useful results and whether a store is gaining ground, remaining the same, or giving up.

Time Series Analysis

Time series analysis is a much more capable tool in retail. Stock investors already perform this type of analysis to predict when a chain will succeed. Why can’t the mid-level management receive the same information?

A time series analysis is climate driven. It allows managers to predict what sales should be for a given day and time frame and then examine whether that day was an anomaly.

Variable Selection

One area where retail fails is in variable selection. Just accounting for sales is really not enough to make a good prediction.

Stores should consider:

  • the day of the week
  • the month
  • whether the day was special (e.g. sponsored football game, holiday)
  • price of goods and deltas for the price of goods
  • price of raw materials and the price of raw materials
  • general demand
  • types of products being offered
  • any shortage of raw material
  • any shortage of staff

Better Linear Regression Based Decision Making

Unfortunately, data collection is often poor in the retail space. A company may keep track of comps and sales without using any other relevant variables or information. The company may not even store information beyond a certain time frame.

In this instance, powerful tools such as the venerable LSTM based neural network may not be feasible. However, it may be possible to use a linear regression model to predict sales.

Linear regression models are useful in both predicting sales and determining the number of standard deviations the actual result was from the reported result. Anyone with a passing grade and an undergraduate level of mathematics learned to create a solid model and trim variables for the most accurate results using more than intuition.

Still, such models do not change based on prior performance. They also require keep track of more variables than just sales data to be most accurate.

Even more problematic is the use of multiple factorizable variables. Using too many factorized variables will lead to poorly performing models. Poorly performing models lead to inappropriate decisions. Inappropriate decisions will destroy your company.

Power Up Through LSTMS

LSTMS are powerful devices capable of tracking variables over time while avoiding much of the factorization problem. Through a Bayesian approach, they predict information based on events from the past.

These models take into account patterns over time and are influenced by events from a previous day. They are useful in the same way as regression analysis but are impacted by current results.

Being Bayesian, an LSTM can be built in chunks and updated in real time, providing less need for maintenance and increasingly better performance.

Marketing Use Case as an Example

Predictive analytics and reporting are extremely useful in developing a marketing strategy, something often overlooked today.

By combining predictive algorithms with sales, promotions, and strategies, it is possible to ascertain whether there was an actual impact from using an algorithm. For instance, did a certain promotion generate more revenue or sales?

These questions posed over time (more than 32 days would be best), can prove the effectiveness of a program. They can reveal where to advertise to, how to advertise, and where to place the creative efforts of marketing and sales to best generate revenue.

When managers are given effective graphics and explanations for numbers based on these algorithms, they gain the power to determine optimal marketing plans. Remember, there is a reason business and marketing are considered a little scientific.


Comps suck. Stop using them to gauge success. They are illogical oddities from an era where money was easy and simple changes brought in revenue (pre 2008).

Companies should look to analytics and data science to drive sales and prove their results.


Murmurhash3 and the Bloom Filter

There is often a need to speed things up by generating a unique hash via a bloom filter. Bloom filters are arrays of bits to which objects are mapped using a set of hashing functions.

Using some knowledge from creating a hashing trick algorithm, I came across the idea of deploying murmurhash3 in scala to improve performance and generate a fairly unique key. My immediate case, is to avoid collecting millions of duplicate pages from a large online scrape of a single source.

More information is available from this great blog article as I am a bit busy to get into details.

Demystifying Sparse Matrix Generation for NLP

Hashing, its simple, its powerful, and when it is unique it can take off time in generating sparse matrices from text. There are multiple ways to build a sparse matrix but do you really know just how scipy comes up with its fast solution or the powers at some of the big companies develop matrices for NLP. The simple hashing trick has proven immensely powerful and quite fast in aiding in sparse matrix generation. Check out my github for a sample and where I also have C99 and text tiling tokenizer algorithms written.

The Sparse Matrix

Sparse matrices are great tools for data with large numbers of dimensions but many, many gaps. When a matrix of m x n would become quite large (millions or more data points) and yet coverage is much smaller (say 50 percent even), a sparse matrix can fit everything into memory and allow testing of new or improved algorithms without blowing memory or needing to deploy to a distributed framework. This is also useful when dealing with results from frameworks like Spark or Hadoop.

Sparse matrices are composed of index and data pairings. Two common forms are Compressed Sparse Row matrices and Compressed Sparse Column matrices which store an index array pointing to where vectors (rows or columns) end and the data array that they point to.

Which Hash for NLP

Text hashes such as Horners method have been around for some time. For a matrix, we want to limit collisions. That is no small feat as hashes are sometimes guaranteed to generate collisions. Keeping the number of attributes large is helpful but not a complete solution. It is also important to generate a fairly unique hash.

In my TextPreprocessor software on github, I used murmurhash. There are 32 bit and 128 bit versions of the hash for the current murmurhash3 implementation available under scala.util.hashing and even a 64 bit optimized version. Murmurhash, with its large number of bytes and ability to effect the low end of the digit range, and filtering helps generate a fairly unique hash. Sci-kit Learn uses a similar variant.

Even murmurhash may not always work. A single bit hashing function has proven effective in eliminating much in the way of collisions. If the bit is 1, add 1. If not, subtract. Using modulus or another function may prove useful but testing is needed. In any case, the expected mean is now 0 for each column since there is a 50 percent chance you will be high and 50 percent chance low.

Building with the Hashing Trick

Building the matrix is fairly straightforward. First, generate and index and row or column array.

Then, get the feature counts. Take in text, split to sentences to lines and then lines to words, and for each line hash the words and count their instances. Each sentence is a vector. It is highly advisable to remove the most frequent words by which I mean those with an occurrence 1.5 or more standard deviations beyond the typical occurrence and to remove common words (stop words) like a, the, most conjunctions and some others. Stemming is another powerful tool which removes endings like -ing. It is possible to use the power of multiple cores by returning a map with hash to count key value pairs. From here simply iterate line by line and add an entry to the index array as array[n – 1] + count if n > 0 or array[0] = count if n = 0 and the features to the data array. If we know the maximum number of features per line, it is possible to add the lines in a synchronized method. Again, this is all on my github.

val future = getFeatures(getSentences(doc).map({x => x.splitWords(x)})).flatMap({x => addToMatrix(x)})

The reason I mention splitting the task is that it can be faster, especially if using a tool like Akka. It would be advisable to assign a number to the document and insert each row based on its number to be able to track which English language sentence it belongs to as well.

Thanks for sticking with me as I explore a simple yet potent concept.

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,


Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).


There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.


Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

How Can the Tools of Big Data Detect Malicious Activity?

With Apple in the news and security becoming a large concern and even as companies try new ways to protect their online presence, finding malicious activity has become an exploding topic. Another area offers some deeper insights into just how to discover users with bad intentions before data is lost. This article deals with protecting an online presence.

Detection can go well beyond knowing when a bad credit card hits the system or a certain blocked IP Address attempts to access a website.

Similarity: The Neural Net or Cluster

The neural net has become an enormous topic. Today it is used to discern categories in fields ranging from biology to dating or even terrorist activity. Similarity based algorithms have come into their own since their inception largely in the cold war intelligence game. Yet, how different is finding political discussions from conversational data captured at the Soviet embassy or discovering a sleeper cell in Berlin from finding a hacker. Not terribly different at the procedural level actually. Find the appropriate vectors, train the neural net or clustering algorithm, and try to find clusters representing those with an aim to steal your data. These are your state secrets. With Fuzzy C Means, K Means, and RBF neural nets, the line between good and bad doesn’t even need to look like a middle school dance.

Here are just a sampling of the traits that could be used in similarity algorithms which require shaping a vector to train on. Using them in conjunction with data taken from previous hacking attempts, it shouldn’t be extremely difficult to flag the riff raff.

Traits that Can be Useful

Useful traits come in a variety of forms. They can be encoded as a 1 or 0 for a Boolean value such as known malicious IP (always block these). They could be a Levenshtein distance on that IP. Perhaps a frequency for number of requests per second is important. They may even be a probability or weight describing likelihood of belonging to one category or another based on content. Whichever they are, they should be informative to your case with an eye towards general trends.

  • Types of Items purchased : Are they trivial like a stick of gum?
  • Number of Pages Accessed while skipping a level of depth on a website : Do they attempt to skip pages despite a viewstate or a typical usage pattern?
  • Number of Malformed Requests : Are they sending bad headers?
  • Number of Each type of Error Sent from the Server : Are there a lot of malformed errors?
  • Frequency of Requests to your website : Does it look like a DNS attack?
  • Time spent on each Page : Is it too brief to be human?
  • Number of Recent Purchases : Perhaps they appear to be window shopping
  • Spam or another derived level usually sent from an IP address: Perhaps a common proxy is being used?
  • Validity or threat of a given email address : Is it a known spam address or even real?
  • Validity of user information : Do they seem real or do they live at 123 Main Street and are named Rhinoceros?
  • Frequencies of words used that Represent Code: Is the user always using the word var or curly braces and semi-colons?
  • Bayesian belonging to one category or another based on word frequencies: Are words appearing like var?

Traits that May not Be Useful

People looking for your data will be looking to appear normal, periodically looking to access your site or attempting an attack in one fell swoop. Some traits may be less informative. All traits depend on your particular activity. These traits may, in fact be representative but are likely not.

  • Commonality of User Name : Not extremely informative but good to study
  • Validity of user information: Perhaps your users are actually value their secrecy and your plans to get to know them are ill-advised

Do not Immediately Discount Traits and Always Test

Not all traits that seem discountable are. Perhaps users value their privacy and provide fake credentials. However, what credentials are provided can be key. More often, such information could provide a slight degree of similarity with a bad cluster or just enough of an edge toward an activation equation to tip the scales from good to bad or vice versa. A confusion matrix and test data should always be used in discerning whether the traits you picked are actually informative.

Bayes, Cosines, and Text Content

Not all attacks can be detected by behaviour. Perhaps a vulnerability is already known. In this case, it is useful to look at Bayesian probabilities and perhaps cosine similarities. Even obfuscated code contains specific key words. For example, variables in javascript are always declared with var, most code languages use semi-colons, and obfuscated code is often a one line mess. Bayesian probability would state that the presence of one item followed by another when compared to frequencies from various categories yields a certain probability of belonging to a category.

If Bayes is failing, then perhaps similarity is useful. Words like e and var and characters such as ; or = may be more important in code.

Smoothing: When and Which?

Smoothing is everywhere. It is preprocessing for signal processing, it makes text segmentation work well, and it is used in a variety of programs to cull noise. However, there are a wide variety of ways to smooth data to achieve more appropriate predictions and lines of fit. This overview should help get you started.

When to Use Smoothing

The rather simple art of smoothing data is best performed when making predictions with temporal data where data comes from one or more potentially noisy sources (think a weather station with a wind speed monitor that is a bit loose or only partially unreliable) and dealing with tasks such converting digital sound to analogue waves for study. When a source appears capable of making decent predictions but is relatively noisy, smoothing helps. It is even used in fitting smoothing splines.

Types of Smoothing to Consider

Now that the rather blunt explanation is out of the way, the proceeding list is based on my own use with text mining and some sound data. I have tried to include the best resources I could for these.

  • Rectangular or Triangular: Best used for non-temporal and fairly well fitting data where more than past concerns are important (text segmentation is an example).
  • Simple Exponential: Creates a moving average smoothing and considers past events. Is not a terrific smoothing tool but is quick and works well when data is correlated (could work well with sound which may be better with a Hamming or Hanning Window). Unlike double and triple exponential smoothing, the algorithm requires no past experience or discovery to do correctly, for better or worse.
  • Double and Triple Exponential Smoothing: Works well with time series data. Peaks and valleys are more preserved. Triple exponential Smoothing works well with seasonal data. They require some manual training or an algorithm relying on previous experience to generate an alpha value to perfect. Again, past events are more heavily weighted.
  • Hanning and Hamming WindowsPeriodic data may work well with this type of smoothing (wave forms). They are based on the cosine function. Past experience is not needed. For really noisy data, try the more intensive Savitsky-Golay filter.
  • Savitzky–Golay: This smoothing works fairly well but preserves peaks and valleys within the broader scope of the data. Savitsky-Golay is not ideal for a truly smooth curve. However, if some noise is really important, this is a great method. Its publication was actually considered one of the most important by Analytical Chemistry for spectral analysis. It uses a localized least squares technique to accomplish its feat.

    However, do not rely on the matrix based calculation for OLS as the most efficient as gradient descent is clearly the winner. No self-respecting programmer will use a matrix implementation on large data sets. Spark contains an optimized gradient descent algorithm for distributed and even single node programming. The algorithm is tail recursive and seeks to minimize a cost function.

  • Distribution Hint

    For the programmer or developer looking to distribute a Savitsky-Golay calculation and not using Spark gradient descent. Map partitions works well on the local areas. It also works well when smoothing many splines or for the Hanning and Hamming Window based smoothing.

    Shooting the Gap Statistic with Map Reduce

    Finding the right K in Kmeans, n in LSA, n in LSI, or whichever grouping algorithm you use is fuzzy, especially with text. However, Tibshirani,Walther and Hastie developed a superb algorithm to deal with this, much better than finding an inflection point in your data as it handles abnormalities and ensures a proper distribution.

    An easier understanding of the algorithm can be found through a blog post from the Data Science Lab.

    This algorithm is intensive but,luckily, map-reduce exists. Splitting the procedure among a large number of machines is most helpful. Python and Scala offer map reduce and concurrent frameworks that can handle fairly significant amounts of data either directly or via a framework such as Celery.

    Scala’s callback mechanism is a personal favorite of mine although the tools for experimentation are not as readily available as they are with Python. Tools such as gensim and sci-kit learn are readily available and easy to test in Python.

    Using map reduce for this algorithm is almost a necessity for anything more than a small data set, especially for text data. However, unlike the Pham et. al algorithm, Tibshirani does not rely on the previous output at all. In this way it can be spread over a large number of machines to obtain a comparable result (use Pham for smaller datasets such as less than 30,000 text documents that fit on a single machine).

    The following is an example of the fastest way I ran this algorithm using Python. Whereas a single threaded algorithm was running 20 tests in 10 minutes, the following test were running 10.4 tests per minute on a 3.5 ghz core i-7 with 15 gb of RAM.

    The proper tradeoffs must be made between speed and memory.

    Firstly, Python offers a multiprocessing version of map. The map function is applied using a multiprocessing pool which is easy to establish.

    from multiprocessing import Pool
    import psutil


    You will need to fiddle with an equation that works with Pool’s processors function. It may not be the best for your data set.

    This pool can be used (please read the Data Science Lab article first) to find Wk, run multiple Kmeans and split up among other processes.

    Since I am text mining, my Wk and distance functions now become the following. I will not be releasing my text based KMeans here, sorry but note that it involves using a special function for finding clusters based on the dissimilarity rating.

    def Wk(X):
        if X.shape[0]>1:
                mp=map(lambda x: float(mul*angulareDistance(x[0].todense(),mvect)),X)
        elif X.shape[0] is 1:
        return res
    def calculateQueuedCosines(x):
        for j in xrange(len(x[1])):
        return d.index(max(d))
    def performWK(X):
        This function performs WK for a map and requires an input matrix,predictions from KMeans, and a k value.
        for currk in range(X[2]):
            for i in range(X[0].shape[0]):
                if preds[i] == currk:
                    if set is None:     
                if set is not None:
                    if set.shape[0]>1:
                            mp=map(lambda x: float(angulareDistance(x[0].todense(),mvect)),set)
                    elif set.shape[0] is 1:
            except Exception,e:
                print str(e)
            del gc.garbage[:]
        return results
    def performTest(X):
        For performing tests in separate processes.
        Requires a tuple with
        return preds
    def mapWKResults(X):
        return numpy.log(sum(numpy.nan_to_num(X)))

    I then use these functions to split up intense processes among multiple cores. Keep in mind that this is looking to work with text data. The distance formula becomes the angular distance formual with cosines (1-2*acos(cos(veca,vecb))/pi).

    def discoverCategories(self,file,refs=20,maxTime=600,ks=range(1,500),mx=20):
            A 1 to n estimator of the best categories using cosines to be used as  a backup.
            This is a sci-kit learn /python/scipy learning experience as well.
            Reference: https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/
            vectorizer=CountVectorizer(analyzer='word', binary=False, decode_error='strict',dtype=numpy.int64, encoding='utf-8', input='content',lowercase=True, max_df=1.0, max_features=None, min_df=1,ngram_range=(1, 1), preprocessor=None, stop_words=None,strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',tokenizer=None, vocabulary=None)
            logging.info("Starting KMN "+datetime.datetime.fromtimestamp(time.time()).strftime("%m/%d/%Y"))
            with open(file,'rb') as fp:
            if len(lines)>0:
                del tfidf
                del vectorizer
                del lines
                del gc.garbage[:]
                (ymin, xmin), (ymax, xmax)=self.bounding_box(mtx)
                for k in ks:
                    #get the overall WK
                    print "Testing at Cluster "+str(k)
                    for currk in range(k):
                        for i in range(mtx.shape[0]):
                            if preds[i]==currk:
                                if set is None:     
                        if set.shape[0]>0:
                    del set
                    del gc.garbage[:]
                    #generate individual sets and calculate Gap(k)
                    for i in range(refs):
                        print "Setting Up Test "+str(i)
                        if len(mres) is mx or (i+1 is refs and len(mres)>0):
                            print "Performing Async Tests" #if we were to create our own distributed framework with pyHFS,asyncore, and a db
                            preds=pool.map(performTest,[[k,x] for x in mres])
                            res=pool.map(performWK,[[mres[j],preds[j],k] for j in range(len(mres))])
                            del gc.garbage[:]
                        del ref
                        del gc.garbage[:]

    The results take a while but there is a basic rule based on non-zero values in a matrix for text data. This can be used to determine the maximum amount of clusters to be run, ensuring some buffer room.

      import numpy

    The code splits up the most intense function into map and reduce tasks. Map reduce could be use in each case (please see the sum functions attached to the final result).

    preds=pool.map(performTest,[[k,x] for x in mres])
    res=pool.map(performWK,[[mres[j],preds[j],k] for j in range(len(mres))])

    These are the Kmeans functions, reduction across all vectors, and a slight performance boost by performing the reduction on the result to obtain our Wk value for the overall equation.

    Ideally, this process can be used to find the best number of categories or groups in a variety of algorithms such as fuzzy clustering with membership matrices or LSA.

    Installing Hadoop on Windows 8.1 with Visual Studio 2010 Professional

    Looking for a matrix system that work like gensim in Python, I discovered Mahout. Wanting to test this against the Universal Java Matrix Package, I decided to give the install a try. That, unfortunately was a long side-tracked road going well beyond the requirements listed in the install file.

    In the end, I found a way that did not require Cygwin and went smoothly without requiring building packages in the Visual Studio IDE.

    Installation instructions follow.

    It is possible to download the hadoop install from a variety of mirrors.

    The x64 Platform

    Before starting, understand that hadoop uses a x64 system. x86-x64 works as well and using x32 installations of Cmake and the jdk will not harm the project. However, it is a 64 bit program and requires a Visual Studio 10 2010 Win 64 generator to compile the hdfs project files.

    Uninstall Visual Studio 2010 Express and Distributables
    Visual Studio 2010 Express uses a C++ distributable that will cause the command prompt for Windows SDK to fail and will also conflict with some of the build using the Visual Studion Command Prompt.


    The following requirements are necessary in no particular order.

    1. Microsoft Visual Studio 2010 Professional with C++
    2. Install the .Net 4.0 framework
    3. Zlib
    4. Most recent Maven
    5. MSBuild
    6. CMake
    7. Protoc
    8. Java JDK 1.7

    Path Variables

    The following must be in your path. The only order should be, if you have and wish to use Cygwin to place MS Visual studio before Cygwin to get rid of a copy of cmake that will not work for this task. It is better to just delete cmake on Cygwin and use it for Windows if this is the path you choose.

    1. MSBuild
    2. Cmake
    3. Visual Studio 2010
    4. Zlib
    5. protoc
    6. java

    Environment Variables

    The following can be set for an individual instance of command prompt.

    1. JAVA_HOME=path to jdk
    2. M2_HOME=path to maven
    3. VCTargetsPath=set to MSBuild/Microsoft.CPP/4.0 or other valid path to the CPP properties file
    4. Platform=x64

    Run the Build

    Open up a Visual Studio 2010 Win 64 command prompt and type the following command.

    mvn package -Pdist,native-win -DskipTests -Dtar

    Resulting Files

    The following files should appear in your unzipped haddoop file under hadoop-dist/target.

    1. hadoop-2.6.X.tar
    2. hadoop-dist-2.6.X.jar
    3. dist-layout-stitching.sh
    4. dist-tar-stitching.sh

    Special Thanks

    Special thanks to the IT admin and security professional contractor at Hygenics Data LLC for the copy of Microsoft Visual Studio 2010.

    Happy hadooping or being a Mahout.

    Morning Joe: Categorizing Text, Which Algorithms are Best in What Situations

    There is a ton of buzz around data mining and, as always, many new names being injected into the same topics showing a lack of study. While the buzz exists, knowing when to deploy an algorithm can be tricky. Based on a deep dive into the subject recently and a dire need to program these algorithms in Java, I present a brief overview of tools with benchmarks and examples to hopefully follow later. Basic concepts are presented first followed by some algorithms that I have not really honed yet (benchmarks are not feasible for now but will be incredibly soon).

    To read this thoroughly, I highly recommend following the links. This is a starting point and is also a summary of what I have found to now. Basically, the goal is to help avoid the hours it takes to find information on each and do some stumbling by reading this document.

    A word of caution, effective matrix based solutions use a large amount of data. Other fuzzy algorithms exist for discovering relate-ability between small sets of items. For strings, there is distance matching such as Jaro-Winkler or Levenshtein with rule based comparisons and lookup tables to minimize error (say between Adama and Obama). Statistics can enhance this process if there is a need to take the best rating. Train a model for discovering whether the hypothesis that two entities distances makes them the same as opposed to the null hypothesis that it does not after filtering out some common issues.

    The Matrix

    Linear algebra is a critical foundation of text mining. Different elements are thought of as equations When we have different documents or images, each document or image is often considered to form an equation. This equation can then be presented in a matrix, a simple and easy way to get rid of thinking in really complex terms.

    You may have heard of the vaulted differential equation. If not, some reading is in order from a site I used in college when there was not time for the book. The basis of a large portion of differential equations can be written in a matrix. This is important due to the eigen vector and eigen value. To be sure these concepts are crucial for solving matrices to find equations that explain a set of models. Drexel Universities eigenfaces tutorial provides a fairly solid understanding of the way a matrix is used in most text mining. However, similarity ratings are used to compare documents rather than a co-variance matrix for most tasks.

    The end result of studying these methods is the ability to look under the hood at today’s hottest text mining technologies.

    The Vector Space and TF-IDF

    Understanding vectors and vector operations is another crucial step to understanding the mining process. A basic vector is a set of points representing a position in these planes . Vectors can be added, subtracted,multiplied, and, most importantly, stuffed in a matrix where their points can be used with basic linear algebra to find relationships.

    Vectors have magnitude and distance. The angles and distances can be compared. Note that, while data loss may be present in finding the right magnitude and distance, the units used should be the same (it would be a terrible idea to think in terms of say millisecond-meter-documents-ice cream cones), it provides a sane basis for considering data. It is up to the miner to choose the most representative points for use in the vector.

    In text mining, term frequency-inverse document frequency rating is used in many commercial algorithms including search engines. If the name is not enough, it is basically a frequency ratio based on the ratio of individual document . It works best on more than one document and an offset of 0.5 for term frequency helps offset the effect of large documents slightly by bumping up the rating. Inverse document frequency utilizes a logarithm function to ensure that the rating remains between 0 and 1.

    Multiply the following equations together to find the result as described by Wikepedia:

    Similarity Ratings

    No matter what you do, similarity ratings are the key to making the process work. There are a several that can be used. If the data can be represented fairly well, co-variance is an option . However, text data is not that well suited to using co-variance. This is due to varying styles that represent the same object and, most importantly, issues with quantization. Natural language is naturally fuzzy. Therefore, cosines usually offers a much better solution.

    This cosines equation takes the product of two vectors or the sum of two vectors and divides by the product of there normalized vectors or the sum of their normalized vectors. It follows from vector algebra. The result is an angle representing the ‘degree’ of similarity. This can be used for comparison.

    Word Net, Disambiguation, and Stemming

    The process of disambiguation and stemming are crucial to text mining. There are many sentence processing methods as NLTK shows. At their core is WordNet and other dictionaries. WordNet is a freely available graph of an english dictionary. Most tools work with WordNet for finding root words, disambiguation, and cleaning.

    Part of Speech or POS tagging is involved in both disambiguation or stemming. Maximum entropy models are used to discover a part of speech based on common usage.

    Disambiguation attempts to resolve words with multiple meanings to their most probable meaning. The worst algorithm is original Lesk but involves only the use of WordNet. Accuracy hovers around 50 percent. Simplified Lesk achieves better results. Lesk finds overlapping words and frequencies to determine the best synonym to replace an ambiguous word. The better algorithms try to use clustering bayes for word sense discovery. Cosines may be used to improve Lesk as well.

    Stemming reduces words to their roots. Most WordNet tools use existing classifications with POS tagging to achieve this result.

    A Note on Regression Models

    Lets be certain, prediction is not well suited to categorization. Changes in word choice across a large number of documents and decisions on importance do not always mean the same thing. Therefore, regression models tend to work poorly. The data is not likely continuous as well. Think of writing like a magnetic field with eddy currents. Predicting the effect due to an encounter with these currents is really, really difficult. Basically, run into an eddy current and you are going to have a really, really bad day. That is not to say that an equation can be created that fits most of the data with respect to location of a point, basically a differential equation. It will likely not be generic and be incredibly difficult to find.

    Regression works well on continuous and more natural events.

    Classification Tree and a Random Forest

    Another often poor performer in categorization of text data is the classification tree. They are as good as the number of rules you are willing to create. However, they may be combined with multinomial Bayes for writing that is uniform and professional (say a legal document) to achieve some success. They are particularly useful after filtering data using LSA/HDP or Multinomial Bayes with decisions that work like a bayesian model when thinking about the bigger picture.

    Basically, a classification tree uses probabilities within groupings to ascertain an outcome and moving down to the appropriate left or right child node based on a yes or no response to the question do you belong?

    This process works well with defined data when there is a good degree of knowledge about a subject (say gene mapping) but text mining often uses fuzzy data with multiple possible meanings and disambiguation is not entirely accurate, Lesks original algorithm only acheived 50 percent accuracy and an LSA model hovers between 80-90%. Improving the quality can be done with multiple trees or possibly by training off of an extremely large set using cosines instead of raw frequency.

    There are multiple methods for tree creation, two are random forests and bagging. Bagging takes multiple trees and averages the probabilities for decisions, using this average for their respective nodes. Random forests find random subsets of features, find probabilities based on them, and select the stronger predictor for a node. The latter approach is best with a much larger set of known features. The number of forests is usually the square root of the total number of features.

    Again, the features must be known and fairly homogeneous. Text data is often not.

    Multinomial Bayesian Classifier

    Multinomial Bayesian Classification is a method that classifies data based on the frequency of words in different categories and their probabilities of occurrence. It is fairly straightforward, find the frequencies or train a set of frequencies on a word by word or gram by gram (a gram being an n-pairing and thus n-gram of words), find probabilities by sentence, take the best one.

    MNB works well when writing differs starkly, say with subject matters that differ greatly. It is good for tasks such as separating spam from policies and code in html data when large amounts of training data are present.

    Clustering with LSA or HDP

    Clustering works well when something is known about the data but categorization is not well done manually. Most algorithms avoid affinity propagation which usually uses the square root of total inputs as the number of clusters any way. Matrices are used heavily here as eigen values and eigen vectors derive an equation that can be used to find relate-ability between documents.

    LSA uses raw frequencies or more effectively cosines in the same manner as eigen faces to compare vectors. The end result, however, is an equation representing a category. By matrix inversion and multiplication, all elements are compared. In this case each ij entry in the matrix is a cosine or frequency of a word in a document. HDP (hierarchical direchlet process) is similar but attempts to learn more about the results and improve on the process. It takes much longer than LSA and is experiemental.

    If trying to discover new information about text or trying to find the best fit of a number of categories, these methods are useful.

    Maximum Entropy Models

    Maximum entropy models work well on heterogenous data in a manner similar to Bayes. Gensims sentence tagger classifies sentences from non-sentences in this way. The models find entropy using the maxent principle which uses frequencies and likelihood of occurrence to find outcomes. It works quite well with the correct training sets.

    If conditional independence is not assumable and nothing is known about a set, this model is useful. Categories should be known beforehand.




    Common Resources

    A New Project: Is a Distance Based Regular Expression Method Feasible?

    So, I would like to find specific data from scraped web pages, pdfs, and just about anything under the sun without taking a lot of time. After looking over the various fuzzy logic algorithms such as Jaro-Winkler, Metaphone, and Levenstein and finding that one did not have an incredibly wide application, I decided that developing a regular expression based distance algorithm may be more feasible.

    The idea is simple, start with a regular expression, build a probability distribution across a good and known data set or multiple data sets, and test for the appropriate expression across every web page. The best score across multiple columns would be the clear winner in this case.

    Building out the expression would be include taking known good data and finding a combination between the base pattern and the data that works or building an entirely new one. Patterns that appear across a large proportion of the set should be combined. If [A-Z]+[\s]+[0-9]+[A-Z], and [A-Z]+[\s]+[0-9]+, appears often in the same or equivalent place or even [a-z]+[\s]+[0-9]+, then it should likely be [A-Z\s0-9a-z]+, if the set is similarly structured. Since the goal is to save time in programming regular expressions to further parse Xpath or other regular expression results, this is useful.

    The tricky part of the project will be designing a similarity score that adequately equates the expressions without too many outliers. Whether this is done with a simple difference test resulting in a statistical distribution or a straightforward score needs to be tested,

    In all likelihood, re-occurring words should be used to break ties or bolster weak scores.

    The new project will hopefully be available on Source Forge for data integration and pre-curation processes.