cluster

Setting up a Hadoop cluster on CentOs – Oppum Cloudera Manager style!!

OK so it’s a cheesy title but hey it grabs attention, right? Hadoop clusters are serious business , so we need to inject a little bit of fun! Well, the first fun part is that this is a very cheap , not for production, type cluster. I don’t have a couple of racks to try out the full power of Hadoop. My intention was to go through a complete “soup” to “nuts” experience complete with hardware, OS and the hadoop stack. But I wanted to experience setting up a cluster with the resources at hand – which was a bunch of laptops that my friends at IT procured for me.

My end result? – Three laptops (two old Dells and 1 IBM Thinkpad) with the Cloudera Hadoop distribution. I am posting all the steps I went through for those interested in going through a similar experience. This is a pretty detailed post (18 steps) but at the end you should have your own complete Hadoop Stack.

The Cloudera Manager “generally” is all UI based click-through type of installation. However, it’s very sensitive to your machine’s networking.

You have to get a few things absolutely right (Goldilocks style) – especially relating to your hostnames, DNS, ports etc.

To begin with my m/c configurations were as follows:

1st machine (IBM Thinkpad) : Intel quad core, 16 GB RAM, 500 GB disk

2nd machine: Intel dual core, 8 GB RAM, 80 GB disk

3rd machine: Intel core 2 duo, 2 GB RAM, 80 GB disk

What? You didn’t think cheap meant that cheap? Yup – but remember – this is a purely prototyping instance.

I loaded these laptops with CentOS v6.2. I am going to start with the CentOS installation process:

1)   I used NetInstall (which means you need an Ethernet or wireless connection ) – where I created a boot CD and the media was downloaded via a URL: http://mirrror.centos.org/centos/6/os/x86_64

2)   NOTE 1: When you are asked to provide a hostname, make sure to provide a fully qualified real hostname, instead of localhost.localdomain. You can change it later, but it’s better to get it done here

3)      Click through most of your install

4)      Where you are asked to provide a root password, I recommend using the same root password on all the machines that will be part of the cluster – to keep it simple – unless you want to generate some rsa public and private keys for SSH.

5)   You now need to make sure you have Oracle Java SDK installed on your machine (for CDH 4, it’s verion 1.6.0_31 as the certified version, so I recommend using that or anything after that.

  • First check if you have other versions of java installed (most likely openjdk 1.6 or 1.7)
  • If you have openjdk, then do yum remove java-1.6.0-openjdk (if you have another version of java substitute that version
  • If you have some other java , do what you need to uninstall it
  • Download Oracle Java from (this is version 6, browse to your appropriate version from this site )
  • http://www.oracle.com/technetwork/java/javase/downloads/jdk6u35-downloads-1836443.html
  • Chmod +x jdk-6u35-linux-x64-rpm.bin
  • ./ jdk-6u35-linux-x64-rpm.bin
  • Set the alternatives if you want to :

# alternatives –install /usr/bin/java java /usr/java/jdk1.6.0_35/jre/bin/java 20000

# alternatives –install /usr/bin/jar jar /usr/java/jdk1.6.0_35/bin/jar 20000

# alternatives –install /usr/bin/javac javac /usr/java/jdk1.6.0_35/bin/javac 20000

# alternatives –install /usr/bin/javaws javaws /usr/java/jdk1.6.0_35/jre/bin/javaws 20000

# alternatives –set java /usr/java/jdk1.6.0_35/jre/bin/java

# alternatives –set javaws /usr/java/jdk1.6.0_35/jre/bin/javaws

# alternatives –set javac /usr/java/jdk1.6.0_35/bin/javac

# alternatives –set jar /usr/java/jdk1.6.0_35/bin/jar

6)      Open a terminal window and type uname –a or hostname (this should show your hostname that you specified during installation)

7)      On each machine, first disable selinux via /etc/selinux/config

8)      Open /etc/hosts (you will need to sudo or su) . Your’s may look like this :

127.0.0.1              localhost.localdomain    localhost

::1                           localhost.localdomain localhost localhost4

Delete the ::1 line. And add your own host entry if it’s not already there. Then proceed to add the other machines that will part of the cluster.

Eg: 192.168.x.x  Hadoop-main.yourdomain.com                                Hadoop-main

192.168.m.n  Hadoop-node2.yourdomain.com            Hadoop-node2

8) Open /etc/resolv.conf and for the “search” entry, add domain.local. Eg:

Domain        yourdomain.com

Search          yourdomain.com             domain.local

Repeat these entries on other nodes in the cluster so the ips/names are resolved properly

9)      You will most likely need to make changes to your firewall to allow cloudera manager ports access to 7180, 7182. The best way is to restrict these ports to  hosts in the cluster. On all the machines that cloudera manager will connect to, you can do the following:

iptables -A INPUT -p tcp -s 192.168.x.x/24 –dport 22 -j ACCEPT

iptables -A INPUT -p tcp -s 192.168.x.x/24 –dport 7180 -j ACCEPT

iptables -A INPUT -p tcp -s 192.168.x.x/24 –dport 7182 -j ACCEPT

10)   Save it: service iptables save

11)   NOT RECOMMENDED but useful for testing : If you have the desktop, you can go into system->administration->Firewall and disable the firewall (hit apply).

After steps 6-9, reboot.

12)   Open a terminal window and get the cloudera manager free edition :

wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin

13)   Make it executable: chmod +x ./cloudera-manager-installer.bin

14)   ./cloudera-manager-installer.bin – this should launch the installer and the rest is history J

15)   You should now automatically be launched into the admin console of CM.

16)   If you want to use your browser from a Windows laptop, you will either have to use the ip address:7180 or edit your Windows etc/hosts file to add the CM host. I suggest you do the latter because the url references are all based on the CM hostname and you will need to edit your URL everytime with the IP for every reference.

17)   To add your linux machine alias to Windows etc, do the following:

  • Search for Notepad in your start menu – right click and do “run as administrator” for notepad. This step is important because Windows will not let you edit the etc hosts file otherwise
  • Browse to C:\Windows\system32\drivers\etc and open the hosts file
  • Add the IP and hostname of the linux machine runnging CM

18)   Remember where you may have disabled the entire firewall? If everything now works after this, then you can reconfigure the firewall with the right ports enabled. If you have the cluster manager running as a web app on one of the machines, you may have to open the Web server port so you can access it from your windows laptop.

Cloudera-Mgr-Laptop

That’s it! Your own Hadoop Cluster on a budget!! My next post I will attempt to run the weather processing example from the book “The Definitive Guide” – by Tom White.

Advertisements