Big data is the rage, distribution is the rage, and so to is the growth of streaming data. The faster a company is, the better. Such speed requires, no demands, solid matrix performance. Worse yet, big data is inherently sparse and testing and implementation of new algorithms requires sparse matrices (CSR,CSC, COO; the like). Sadly, Java is not up to the task.
Let’s revisit some facts. Java is faster than Python at its core. Many tasks require looping over data in ways numpy or scipy simply do not support. A recent benchmark on Python3 v. Java highlights this. Worse, Python2 and Python3 use the global interpreter lock (GIL) making attempts at speed through concurrency often slower than single threading and forcing developers to use the dreaded multiprocessing (large clunky programs using only as many processes as cores). Still, multiprogramming and other modern operating systems concepts are helping Python achieve better albeit still quite slow speeds.
That said, Numpy and Scipy are the opposite of any of these things. They require the slower Cython but are written in C, performing blazingly fast, and leave all Java matrix libraries in the dust. In fact, in attempting to implement some basic machine learning tasks, I found myself not just writing things like text tiling which I fully expected to do but also starting down the path of creating a full fledged sparse matrix library with hashing library.
The following is the sad state of my Matrix tests.
The following libraries were used in the test:
The Cosines Test
An intensive test of a common use case is the calculation of the dot product (a dot b, a * b.t). Taking this result and dividing by norm(a)*norm(b) yields the cosine of pheta.
This simple test includes multiplication, transposing, and mapping division across all active values.
The Machine and Test Specs
The following machine specifications held:
- CPU : Core i3 2.33 ghz
- RAM : 8 GB (upgraded on laptop)
- Environment: Eclipse
- Test: Cosine Analysis
- Languages: Scala and Python(scipy and numpy only)
- Iterations: 32
- Alloted Memory: 7gb either with x64 Python3 or -Xmx7g
- Average: Strict non-smoothed average
The Scipy/Numpy Gold Standard
Not many open source libraries can claim the speed and power of the almighty Scipy and Numpy. The library can handle sparse matrices with m*n well over 1,000,000. More telling, it is fast. The calculation of the cosines is an extremely common practice in NLP and is a valuable similarity metric in most circumstances.
import scipy.sparse import scipy.sparse.linalg mat = sparse.rand(1000,50000,0.15) print scipy.sparse.dot(mat,mat.t)/pow(linalg.norm(mat),2)
Result : 5.13 seconds
The following resulted from each library:
- Breeze : Crashes with Out of Memory Error (developer notified) [mat * mat.t]
- UJMP : 128.73 seconds
- MTJ : 285.13 seconds
- La4j : 420.2 seconds
- BidMach : Crashes (sprand(1000,50000,0.15))
Here is what the author of Breeze had to say. Rest assured, Numpy has been stable for over a decade now with constant improvements.
Java libraries are slow and seemingly undeserving of praise. Perhaps, due to the potential benefits of not necessarily distributing every calculation, they are not production grade. Promising libraries such as Nd4J/Nd4s still do not have a sparse matrix library and have claimed for some time to have one in the works. The alternatives are to use Python or program millions of lines of C code. While I find C fun, it is not a fast language to implement. Perhaps, for now Python will do. After all, PySpark is a little over 6 months old.