Demystifying Sparse Matrix Generation for NLP

Hashing, its simple, its powerful, and when it is unique it can take off time in generating sparse matrices from text. There are multiple ways to build a sparse matrix but do you really know just how scipy comes up with its fast solution or the powers at some of the big companies develop matrices for NLP. The simple hashing trick has proven immensely powerful and quite fast in aiding in sparse matrix generation. Check out my github for a sample and where I also have C99 and text tiling tokenizer algorithms written.

The Sparse Matrix

Sparse matrices are great tools for data with large numbers of dimensions but many, many gaps. When a matrix of m x n would become quite large (millions or more data points) and yet coverage is much smaller (say 50 percent even), a sparse matrix can fit everything into memory and allow testing of new or improved algorithms without blowing memory or needing to deploy to a distributed framework. This is also useful when dealing with results from frameworks like Spark or Hadoop.

Sparse matrices are composed of index and data pairings. Two common forms are Compressed Sparse Row matrices and Compressed Sparse Column matrices which store an index array pointing to where vectors (rows or columns) end and the data array that they point to.

Which Hash for NLP

Text hashes such as Horners method have been around for some time. For a matrix, we want to limit collisions. That is no small feat as hashes are sometimes guaranteed to generate collisions. Keeping the number of attributes large is helpful but not a complete solution. It is also important to generate a fairly unique hash.

In my TextPreprocessor software on github, I used murmurhash. There are 32 bit and 128 bit versions of the hash for the current murmurhash3 implementation available under scala.util.hashing and even a 64 bit optimized version. Murmurhash, with its large number of bytes and ability to effect the low end of the digit range, and filtering helps generate a fairly unique hash. Sci-kit Learn uses a similar variant.

Even murmurhash may not always work. A single bit hashing function has proven effective in eliminating much in the way of collisions. If the bit is 1, add 1. If not, subtract. Using modulus or another function may prove useful but testing is needed. In any case, the expected mean is now 0 for each column since there is a 50 percent chance you will be high and 50 percent chance low.

Building with the Hashing Trick

Building the matrix is fairly straightforward. First, generate and index and row or column array.

Then, get the feature counts. Take in text, split to sentences to lines and then lines to words, and for each line hash the words and count their instances. Each sentence is a vector. It is highly advisable to remove the most frequent words by which I mean those with an occurrence 1.5 or more standard deviations beyond the typical occurrence and to remove common words (stop words) like a, the, most conjunctions and some others. Stemming is another powerful tool which removes endings like -ing. It is possible to use the power of multiple cores by returning a map with hash to count key value pairs. From here simply iterate line by line and add an entry to the index array as array[n – 1] + count if n > 0 or array[0] = count if n = 0 and the features to the data array. If we know the maximum number of features per line, it is possible to add the lines in a synchronized method. Again, this is all on my github.

val future = getFeatures(getSentences(doc).map({x => x.splitWords(x)})).flatMap({x => addToMatrix(x)})

The reason I mention splitting the task is that it can be faster, especially if using a tool like Akka. It would be advisable to assign a number to the document and insert each row based on its number to be able to track which English language sentence it belongs to as well.

Thanks for sticking with me as I explore a simple yet potent concept.