Installing Hadoop on Windows 8.1 with Visual Studio 2010 Professional

Looking for a matrix system that work like gensim in Python, I discovered Mahout. Wanting to test this against the Universal Java Matrix Package, I decided to give the install a try. That, unfortunately was a long side-tracked road going well beyond the requirements listed in the install file.

In the end, I found a way that did not require Cygwin and went smoothly without requiring building packages in the Visual Studio IDE.

Installation instructions follow.

It is possible to download the hadoop install from a variety of mirrors.

The x64 Platform

Before starting, understand that hadoop uses a x64 system. x86-x64 works as well and using x32 installations of Cmake and the jdk will not harm the project. However, it is a 64 bit program and requires a Visual Studio 10 2010 Win 64 generator to compile the hdfs project files.

Uninstall Visual Studio 2010 Express and Distributables
Visual Studio 2010 Express uses a C++ distributable that will cause the command prompt for Windows SDK to fail and will also conflict with some of the build using the Visual Studion Command Prompt.

Requirements

The following requirements are necessary in no particular order.

  1. Microsoft Visual Studio 2010 Professional with C++
  2. Install the .Net 4.0 framework
  3. Zlib
  4. Most recent Maven
  5. MSBuild
  6. CMake
  7. Protoc
  8. Java JDK 1.7

Path Variables

The following must be in your path. The only order should be, if you have and wish to use Cygwin to place MS Visual studio before Cygwin to get rid of a copy of cmake that will not work for this task. It is better to just delete cmake on Cygwin and use it for Windows if this is the path you choose.

  1. MSBuild
  2. Cmake
  3. Visual Studio 2010
  4. Zlib
  5. protoc
  6. java

Environment Variables

The following can be set for an individual instance of command prompt.

  1. JAVA_HOME=path to jdk
  2. M2_HOME=path to maven
  3. VCTargetsPath=set to MSBuild/Microsoft.CPP/4.0 or other valid path to the CPP properties file
  4. Platform=x64

Run the Build

Open up a Visual Studio 2010 Win 64 command prompt and type the following command.

mvn package -Pdist,native-win -DskipTests -Dtar

Resulting Files

The following files should appear in your unzipped haddoop file under hadoop-dist/target.

  1. hadoop-2.6.X.tar
  2. hadoop-dist-2.6.X.jar
  3. dist-layout-stitching.sh
  4. dist-tar-stitching.sh

Special Thanks

Special thanks to the IT admin and security professional contractor at Hygenics Data LLC for the copy of Microsoft Visual Studio 2010.

Happy hadooping or being a Mahout.

Advertisements

Morning Joe: Big Data Tools for Review

Big data is a big topic and many tools are starting to Spring up. However, implementations of the old SQL standard are falling far behind these tools. My recent post on using a Fork Join Pool in insertion and pulling data can help if using PostgreSQL but also destroys bandwidth. It cause headaches for everyone whenever I need to do some serious checking of my ETL,parsing, normalization, data work, or other processes outside of our co-location. Although this is nothing new, I’ve compiled a list of tools to go forth and make little rocks from big rocks all day.

I’ve found some companies and a tool for a big review later on, this is my prework post while my IDE starts up. So far, I have found technologies for SQL databases that use Hadoop,Fractal Trees, and Cassandra to speed up the process. They are not focused specifically on speed but can help create faster database access and lower coding time.

What I’ve found so far:

  • Cassandra (open source): promises scalability and availability alongside a plethora of features (maybe for implementing other tools)
  • Oracle Data Integration Adapter for Hadoop: promises the speed of hadoop connected to an Oracle database
  • BigSQL (open source):promises to combine Cassandra,PostgreSQL, and Hadoop into a blazing fast package for analsis.
  • MapR technologies (somewhat open source): offers a wide variety of products to improve speed in querying and analysis from Hive and map reduce, to actual hadoop
  • Fractal Tree Indexing (open source): Tokutech’s fractal tree indexing speeds up insertions using buffers on each tree node
  • Alteryx: a tool for quicker data processing though not quite as fast as the others (good if your budget does not allow clusters but allows something better than Pentaho
  • MongoDB (open source): Combines map reduce and other technologies with large databases. Tokutech tested its fractal tree indexes on MongoDB
  • Pentaho (open source): The open source version of Alteryx

Many of these tools are already implemented in others such as Pentaho. Personally, I would like to see a SQL-like language that uses these tools alongside a query processor. It would make the tasks even faster to write, think Java v. Python. You could have 10 lines of map reduce code, 5 minutes of click and drag, or a 1 line easy-going query that writes as you think.

To be clear, I am not ranking these, only marking them for future review since this is what has piqued my interest today. Cheers!