Shelan's Blog: Map Reduce

Sunday, November 9, 2014

How to Remote Debug Standalone Hadoop

When you run your map reduce applications you may have hiccups here and there and may need to have a look inside. If you need to remote debug rather going through logs and figuring out what went wrong following is the procedure.

I am using Intellij Idea as the IDE but for other IDE's process is similar.

1) In Intellij Idea go to Run > Edit Configuration and then click on "+" . And then add Remote for "Remote Debugging"

2) You will have following window after clicking on Remote. You can change the port you are using for remote debugging in this panel.

3) Open your Hadoop Root folder and navigate to etc/hadoop-env.sh in your editor. At the bottom of the file add the following line. (Make sure to have the port you given for IDE configuration as the address)

export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

Now you can start hadoop in standalone mode and it will wait until you attach your IDE to debug process to resume.

Saturday, October 11, 2014

How to install Hadoop Standalone / Pseudo Distributed mode 2.X.X on Mac with OS X Mavericks

Image source: http://www.javacodegeeks.com/2011/05/hadoop-soft-introduction.html

I was searching for a complete tutorial on installing Hadoop on Mac and play around with it. There are resources on installing Hadoop with "HomeBrew" which is the missing package manager in Mac ;). But i do not want to offload all the configuration burden to it as i need to learn this from top to bottom. I played with some and here are the configuration steps i followed.

1) You need to download and extract Hadoop Binary. I used Hadoop 2.5.1 which is the latest at the moment.

http://www.apache.org/dyn/closer.cgi/hadoop/common/

2) Extract the binary and lets called the location as HADOOP_HOME

eg: /Users/user1/software/hadoop-2.5.1

3) Add HADOOP_HOME and JAVA_HOME as path variables to your system. You can add them to

bashrc or bash_profile.

You can add them by issuing following commands.


$vim ~/.bash_profile

Add following entries and change paths according to your machine's configurations.


export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_HOME=/Users/user1/software/hadoop-2.5.1
export PATH=$PATH:$HADOOP_HOME

and then reload the configurations.


$source ~/.bash_profile

(Follow these steps if you need to run in Pseudo Distributed mode. If you do continue you will have to add input files to the HDFS and then download output files from the HDFS too.)

4) Navigate to HADOOP_HOME and change following files as below.

etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

5) Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:


$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

(You may need to enable remote-login in system preferences --> sharing if you have not enabled it later to login through ssh.)

6) Starting the Hadoop in Standalone mode.

Navigate to $HADOOP_HOME

Format the filesystem:


 $ bin/hdfs namenode -format

Start NameNode daemon and DataNode daemon:


$ sbin/start-dfs.sh

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
Browse the web interface for the NameNode; by default it is available at:

NameNode - http://localhost:50070/

So good luck with all your map reduce jobs. :)

References :

http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation