Tired of Pentaho not doing what it is supposed to? Tired of paying $35,000 for an only somewhat decent database tool. Then I have a solution for you. Downloading my code from my github account will give you a full fledge ability to parse even the largest documents and leverage Pentaho properly. Using Oracle’s ForkJoin pool and other threading capabilities, this tool achieves large and tunable speeds.
All that is required is a computer or server. These tools work well in conjunction with each other. Parsing 1.5 million web pages and pre-grabbed. pdfs or 3 million API rows can take just an hour on 1 gb of RAM and a 2ghz dual core processor at 20% power. Plans for version 1 include API and REST tools as well as Silhouette Detection and much more.
Java 8 (Java 7 Works but the Fork Join Pool is Broken)
Apache Commons String Utils