At least in its current form, Spark is on an outdated architecture. That is not to say it is dead since a stated goal is better iterative single nod processing. In 2008, when I first heard about distributed systems like Hadoop from a relative working at IBM, machines were pitiful. 64 gb was a decent hard disk size let alone a pipe dream for RAM. RAID was a few years from becoming a common term. Storage was a big part of the tech research game. Now, a common desktop box can have as much as 128 gb of RAM and a RAID drive. What does that have to do with Spark? Spark relies on reshuffling data with a Hadoop backbone built for those puny devices of the 2000s.
The major issue here is network latency. A few years ago, expanding calculations to large data sets accepted network latency as an unavoidable problem. While we would like the rate of convergence to dictate speed. With bus speeds, processor speeds, network speeds, and RAM size being a major consideration in practice, this ideal is never the case. Network latency with Spark’s shuffling system is horrific. It is an issue that tools like Tensor flow seek to address. A small test of running calculations with numpy, bidmach, spark, and some other systems in a previous article on a 10000 x 10000 matrix reveal just how bad this latency becomes. Numpy handily destroyed the distributed system which needed to reshuffle data, pull data in from other sources, perform the calculations, and push the results back to the master.
Thanks to Moore’s law and companies such as Intel and Nvidia, whole data sets that were distributed not long ago now fit easily in memory. Terabyte size hard drives are available to the commoner and again, 128 gb of RAM with a RAID 5 is not just a capability for top of the line motherboards anymore.
A Future in Better Grouped sets of Tensors
These large boxes are not the death toll for a distributed system in general. However, distributed systems need to make better use of the hardware stack in today’s world.
Tensors, like vectors but more generalized. Think of a multi-dimensional numpy like array that grows and shrinks. A system with the following features including the use of tensors would greatly improve modern distributed systems beyond what using hardware such as GPUs can achieve.
- Grouping of tensors based on how commonly they are used together (by pid, task id, or another factor or using a predictive construct)
- Some user specified data structures for specifying initial groups of calculations belonging together
- Distribution of commonly grouped tensors in a way where latency between machines holding data in a group is reduced or the tensors reside on a single system
- Support for memory systems like a SAN or RAID
- Better use of the resources of large boxes
Think of tomorrow’s system as a distribution of these Tensors somewhat like Tensor Flow but more between Tensor Flow and Spark, capable of splitting equations and performing operations on extremely large sets while making the most of today’s beefy hardware.
Remember technology grows. Today’s hot system is tomorrow’s Commodore.