Lets face it BI and Data Integration platforms flat out suck. They eat resources, screw your productivity if you are not using a decent development computer (Pentaho eats 2.5 gb just on testing and crashes my Lenovo), and overall will cost you way too much money. They will improve, but for now, the better alternative is some mixture with Python or even better, Java.
Why reach down as far as the code language? Simple, it works its effective especially when mixing with a JDBC, and it sips resources at the developers will. Yes, will and thought do have a role in processing those massive amounts of strings and primitives and running our infamously quadratic algorithms. In fact combining Java with Spring, I was able to save close to 30 minutes of pure download time at a distance from the server and nearly tie Pentaho when configuring Java 8’s Fork Join Pool. They add flexibility and ease too if you know XML, Gradle may be less user friendly here.
If it is ease of manipulation, try Python. Data comprehension, many open source tools, easily integrated databases, and much less code make this language perfect if speed isn’t your game.
On top of this both languages have some version of map reduce and the Fork Join Pool can work miracles for data integration,collection, and even analysis.
Consider the following when parsing and normalizing large pages. Parsing requires large amounts of memory, decent memory management, and quadratic algorithms.
- 65,000 pages in Java with Fork Join: 5 minutes
- 65,000 pages in Pentaho with Fork Join: 20 minutes from afar 5 minutes close by
- Python: too large
- 65,000 pages in Java with developed commit at 10000: 610 mb, 15% CPU
- 65,000 pages Pentaho with commit at 100: 2.5 gb,20% CPU
Another useful test is of the GC overhead in different sources. Python and Java are stable since they can be hinted at or told (in Python’s case) when to garbage collect. Pentaho, on the other hand, is not only slower than Java (when used with the appropriate knowledge and threading) but at millions of records (4,009,008 for my source) produces GC overhead errors even at 100 records.
Enough venting though, Pentaho just reloaded.