Having spent the better part of a week reading through blog posts and documentation, I found that none of them covered the process in full detail, at least not for the software versions I intended to use.
UPDATE (Sept. 5, 2011): I no longer have this system running (switched to Ubuntu), and will most likely not be able to answer questions about the setup. I would recommend you to ask your questions on the hadoop-users mailinglist. You will find information on how to subscribe and post to the list on the Hadoop website.
UPDATE (May 25, 2011): If you are using this guide, remember to have a look at the comments, some of them concern version updates and other related issues.
UPDATE (Nov. 1, 2010): I've noticed some errors arising when using Hadoop 0.21.0 and HBase 0.20.6 and gone back to Hadoop 0.20.2 instead as this does not produce the same errors. If you intend to use HBase together with Hadoop I would recommend setting up Hadoop 0.20.2 instead, the installation is more or less identical.
You will additionally need ZooKeeper 3.3.1 in order to get HBase to run properly.
Throughout this guide I will assume that your Cygwin install path will be c:\cygwin and that Hadoop, ZooKeeper and HBase will be installed in c:\cygwin\etc\local (/etc/local/), this is however something you can choose yourself. If you choose to install Cygwin elsewhere, I would recommend to use folder names without whitespaces and other non-regular charaters.
The only prerequisite for this quite is that you have Java installed and added to your %PATH% variable (which is usually done automatically).
- Step 1 - Download software
- Step 2 - Install and configure Cygwin
- Step 3 - Install and configure Hadoop
- Step 4 - Install and configure ZooKeeper
- Step 5 - Install and configure HBase
- Step 6 - Start your cluster
Download each software bundle and put it somewhere where you'll easilly find it later.
If you've never used Cygwin (or Linux/Unix/etc), you should perhaps get familiar with those environments first. If you still want to continue, read on.
Throughout the Cygwin section - if you find yourself lost, please follow Vlad Korolev's guide on how to get Cygwin up and running for Hadoop and make sure to additionally install tcp_wrappers and diffutils when chosing packages. Follow steps 2 to 4 in the guide and then continue with the Hadoop installation guide below.
If you're familiar with Cygwin you just need to make sure you have these packages installed:
Additionally you will have to set configure ssh to start as a service, and enable passwordless logins. To do this, fire up a Cygwin terminal window after you've completed the installation and do the following:
When asked about privilege separation answer no
When asked if sshd should be installed as a service answer yes
When asked about the CYGWIN environment variable, enter ntsec
Now go to the Services and Applications toll in Windows, locate the CYGWIN sshd service and start it.
Next Cygwin step is to set up the passwordless login. Go to your Cygwin terminal and enter
Do not use a passphrase and accept all default values. After the process is finished do the following
1 2 3
cd ~/.ssh \n cat id_rsa.pub >> authorized_keys
This will add your identification key to the set of authorized keys, i.e. those that are allowed to login without entering a password.
Try connecting to localhost to see whether it works
Doing this the first time should prompt you with a warning, enter yes and enter. Now try issuing the same command, this time there should be no warning and no need to enter a password.
Since Vlad's guide is made for Hadoop 0.19.0, some of the configuration details specified in his guide do not apply anymore (or have moved to other files), this is an updated version what you'll find in his guide.
First - copy the downloaded tar.gz file to c:\cygwin\usr\local (which corresponds to /usr/local in the Cygwin environment). When this is done, it's time to extract the package, this is done by issuing
tar xvzf hadoop-0.21.0.tar.gz
This command extracts the content of the downloaded hadoop file into c:\cygwin\usr\local\hadoop-0.21.0 (/usr/local/hadoop-0.21.0).
Hadoop requires some configuration, the configuration files are located in c:\cygwin\usr\local\hadoop-0.21.0\conf.
The files that need to be altered are:
<property> <name>fs.default.name</name> <value>hdfs://127.0.0.1:9100</value> </property>
<property> <name>mapred.job.tracker</name> <value>127.0.0.1:9101</value> </property>
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property>
Only Hadoop 0.21.0: Next, one line has to be added to the hadoop-config.sh file in hadoop-0.21.-0/bin
CLASSPATH=`cygpath -wp "$CLASSPATH"`
Add this line before the line containing
The reason for this is that in order for the CLASSPATH to be build with all the Hadoop jars (line ~120 to ~200) the path needs to be in the Cygwin format (/cygdrive/c/cygwin/usr/local/hadoop...), however in order for Java use the classpath, it needs to be in the Windows format (c:\cygwin\usr\local\hadoop..). The line transforms the Cygwin built classpath to one that is understood by Windows.
This should be enough for Hadoop to run, test the installation by issuing these commands in a Cygwin window:
cd /usr/local/hadoop-0.21.0 mkdir logs bin/hadoop namenode -format
The last command will take some seconds to finish and should produce about 20 lines of output during the creation of the namenode filesystem.
The final step of the Hadoop setup is to start it and test it.
To start it issue the following commands in a Cygwin window:
cd /usr/local/hadoop-0.21.0 bin/start-dfs.sh bin/start-mapred.sh
Providing no error messages are printed, this should have started Hadoop. This can be checked by opening http://localhost:9100 and http://localhost:9101 in a browser. The first link should provide information about the NameNode, make sure that the Live Nodes count is 1. The second link provides information about the cluster.
Now it's time to run a little job on the cluster to see whether or not the installation was successfull.
First, copy some files to the node:
cd /usr/local/hadoop-0.21.0 mkdir input cp conf/*.xml input bin/hadoop jar hadoop-*examples.jar grep input output 'dfs[a-z.]+' cat output/*
Provided there were no errors, you've just run your first Hadoop process.
This step, it seems, is only necessary if you're installing the setup on 64 bit Windows.
The problem seems to be that the ZooKeeper server which comes bundled with HBase does not work correctly, and thus a standalone one needs to be set up.
Luckily the ZooKeeper install and configuration is quite easy.
First, copy the download zookeeper-3.3.1-tar.gz file to your c:\cygwin\usr\local directory, open a Cygwin window and issue the following commands:
cd /usr/local/ tar xvzf zookeeper-3.3.1.tar.gz
ZooKeeper's configuration file (zoo.cfg) is located in /usr/local/zookeeper-3.3.1/conf (c:\cygwin\usr\loca\zookeeper-3.3.1\conf).
Open the file and paste the following content into it, overwriting the original config:
# The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/tmp/zookeeper/data # the port at which the clients will connect clientPort=2181
Make sure to create the /tmp/zookeeper/data directory and make it writable for everyone (chmod 777).
ZooKeeper is started by typing:
cd /user/local/zookeeper-3.3.1 bin/zkServer.sh start
Make sure to test if ZooKeeper is running correctly by connecting to it
bin/zkCli.sh -server 127.0.0.1:2181
Start by copying hbase-0.20.6.tar.gz to c:\cygwin\usrl\local and extracting it by issuing
tar xvzf hbase-0.20.6.tar.gz
in a Cygwin terminal.
Now it's time to create a symlink to your JRE directory in /usr/local/. Do this by typing:
ln -s /cygdrive/c/Program\ Files/Java/<jre name> /usr/local/<jre name>
in a Cygwin terminal. <jre name> will most likely be jre6, but be sure to double check this before making the link.
HBase's configuration files are located in /usr/local/hbase-0.20.6/conf/ (C:\cygwin\usr\local\hbase-0.20.6\conf), and to get HBase up and running we need to edit hbase-env.sh and hbase-default.xml.
In the hbase-env.sh the JAVA_HOME, HBASE_IDENT_STRING and HBASE_MANAGES_ZK variables have to be set, this is done by editing the lines containing the variable names to read:
export JAVA_HOME=/usr/local/jre6 export HBASE_IDENT_STRING=$HOSTNAME export HBASE_MANAGES_ZK=false
The last variable tells HBase not to use the bundled ZooKeeper server, as we've already installed a stand alone one.
Next, the hbase-default.xml file has to be edited, the two properties that need to be set are hbase.rootdir and hbase.tmp.dir
<property> <name>hbase.rootdir</name> <value>file:///C:/cygwin/tmp/hbase/data</value> <description>The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR </description> </property> <property> <name>hbase.tmp.dir</name> <value>C:/cygwin/tmp/hbase/tmp</value> <description>Temporary directory on the local filesystem.</description> </property>
Make sure that both directories exist and are writeable by all users (chmod 777).
The command for starting HBase is:
cd /user/local/hbase-0.20.6 bin/start-hbase.sh
This section is very similar to what's found on the HBase wiki, the difference is the standalone ZooKeeper config.
Having done all these steps, it's time to start up the cluster.
The startup procedure should follow this order:
So what you do is:
cd /usr/local/zookeeper-3.3.1 bin/zkServer.sh start
cd /usr/local/hadoop-0.21.0 bin/start-dfs.sh bin/start-mapred.sh
cd /usr/bin/hbase-0.20.6 bin/start-hbase.sh
In order to get my system up and running, I used tutorials and information posted by others, this is simply an aggregation of several resources. These include: