Tuesday, 20 May 2014

How To WorkOut Navie Bayes Algorithm

A Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.The main advantage of naive Bayes is that it only requires a smaller amount of data for training inorder to estimate the class labels necessary for classification. Because independent variables are assumed.

In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. 

By training it means to train them on particular inputs so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon. 

How to Apply NaiveBayes to Predict an Outcome
Let's try it out using an example.



In the above training data we have 2 class labels buys_computer No and Yes. And we know 4 characteristics.


1. Whether the age is youth,middle_aged or senior.
2. Whether income is high,low or medium.
3. Whether they have student or not.
4. Whether credit is excellent,fair.


There are many things to pre-compute from the training dataset for future prediction.


Prior Probabilities

Prior Probabilities
-------------------

P(yes) = 9/14 = 0.643
  Given that the class label is "yes" the universe is 14 = yes(9) + no(5). 9 of them is yes
P(no) = 5/14 = 0.357
  Given that the class label is "no" the universe is 14 = yes(9) + no(5). 5 of them is no

Probability of Likelihood

Probability of Likelihood
-------------------------

P(youth/yes) = 2/9 = 0.222
  Given that the class label is "yes" the universe is 9. 2 of them are youth.
P(youth/no) = 3/5 = 0.600
...
...
P(fair/yes) = 6/9 = 0.667
P(fair/no) = 2/5 = 0.400

How to classify an outcome



Let's say we are given the properties of an unknown buys_computer (class). We are told that the properties are


X => age = youth, income = medium, student = yes, credit rating = fair

We need to 

 Maximize P(X|Ci )P(Ci ), for i = 1, 2

P(Ci ) - the prior probability of each class, can be computed based on the training tuples:



P(yes/youth,medium,yes and fair) 
      = P(youth/yes)* P(medium/yes)* P(yes/yes)* P(fair/yes) * P(yes)
      = (0.222* 0.444* 0.667* 0.667) * 0.643
      = 0.028

P(no/youth,income,medium,yes and fair) 
      = P(youth/no)* P(medium/no)* P(yes/no)* P(fair/no) * P(no)
      = (0.600* 0.400* 0.200* 0.400) * 0.357
      = 0.007

(0.028 >> 0.007), we classify this youth/medium/yes/fair  as likely to be yes.


Therefore, the naive Bayesian classifier predicts buys_computer = yes for tuple X.


Saturday, 17 May 2014

Count Frequency Of Values In A Column Using Apache Pig


There may be situations to count the occurence of a value in a field.
Let this be the sample input bag.


user_id   course_name user_name
1           Social      Anju
2           Maths       Malu
1           English     Anju
1           Maths       Anju

Say we need to calculate no of occurence of each user_name.
Anju 3
Malu 1

Inorder to achieve this - COUNT Built In Function can be used.


COUNT Function in Apache Pig


COUNT function  compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.

The basic idea to do the above example is to group by user_name and count the tuples in the bag.


--count.pig

 userAlias = LOAD '/home/sreeveni/myfiles/pig/count.txt' as 
             (user_id:long,course_name:chararray,user_name:chararray);
 groupedByUser = group userAlias by user_name;
 counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
 result = FOREACH counted GENERATE user_name, cnt;
 store result into '/home/sreeveni/myfiles/pig/OUT/count';

The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.




Monday, 12 May 2014

Configuring PasswordLess SSH for Apache Hadoop


In pseudo-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and fully distributed modes: it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. Pseudo-distributed mode is just a special case of fully distributed mode in which the (single) host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password.

If you cannot ssh to localhost without a passphrase, execute the following commands:

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/unmesha/.ssh/id_rsa): [press enter]
Enter passphrase (empty for no passphrase): [press enter]
Enter same passphrase again: [press enter]
Your identification has been saved in /home/unmesha/.ssh/id_rsa.
Your public key has been saved in /home/unmesha/.ssh/id_rsa.pub.
The key fingerprint is:
61:c5:33:9f:53:1e:4a:5f:e9:4d:19:87:55:46:d3:6b unmesha@unmesha-virtual-machine
The key's randomart image is:
+--[ RSA 2048]----+
|         ..    *%|
|         .+ . ++*|
|        o  = *.+o|
|       . .  = oE.|
|        S    ..  |
|                 |
|                 |
|                 |
|                 |
+-----------------+

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-copy-id localhost
unmesha@localhost's password: 
Now try logging into the machine, with "ssh 'localhost'", and check in:

  ~/.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Now you will be able to ssh without password.

unmesha@unmesha-hadoop-virtual-machine:~$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
unmesha@unmesha-virtual-machine:~$ 

Happy Hadooping ...

Sunday, 4 May 2014

Map-Only Jobs In Hadoop


There may be reasons where Map-Only job is needed,Where there is no Reducer to execute.Here Map does all its task with its InputSplit and no job for Reducer.This can be achieved by setting  job.setNumReduceTasks()  to Zero in Configuration.

Job job = new Job(getConf(), "Map-Only Job");
job.setJarByClass(MaponlyDriver.class);

job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
/*
 * Set no of reducers to 0
 */
job.setNumReduceTasks(0);

job.setMapperClass(Mapper.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

boolean success = job.waitForCompletion(true);
return(success ? 0 : 1);

This sets Reducer task to 0 and turns off the Reducer.

job.setNumReduceTasks(0);

So the no. of output files will be equal to no. of mappers and output files will be named as part-m-00000.

And once Reducer task is set to Zero the result will be unsorted.

If we are not specifying this property in Configuration, an Identity Reducer will get executed in which the same value is simply emitted along with the incoming key and the output file will be part-r-00000.



Happy Hadooping ...

Saturday, 3 May 2014

Hadoop Installation Using Cloudera Package - Pseudo Distributed Mode (Single Node)

[Previous Post]

Hadoop can be installed using cloudera also with less steps in an easy way .The difference is Cloudera packed Apache Hadoop and some ecosystem projects into one package.And they have set all the configuration to localhost and we need not want to set the configuration files.

Installation using Cloudera Package.

Prerequistie

1.Java


Installation Steps

Step 1: Set Java home in /etc/profile

unmesha@unmesha-hadoop-virtual-machine:~$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Check your current location of java 

unmesha@unmesha-hadoop-virtual-machine:~$ sudo update-alternatives --config java
[sudo] password for unmesha: 
There is only one alternative in link group java: /usr/lib/jvm/java-7-oracle/jre/bin/java
Nothing to configure.

Set JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
unmesha@unmesha-hadoop-virtual-machine:~$ source ~/.bashrc 

Step 2: Download the package for your system from here.


Step 3: Extract the package

unmesha@unmesha-hadoop-virtual-machine:~$sudo dpkg -i cdh4-repository_1.0_all.deb


Step 4: Install Hadoop

unmesha@unmesha-hadoop-virtual-machine:~$sudo apt-get update 
unmesha@unmesha-hadoop-virtual-machine:~$sudo apt-get install hadoop-0.20-conf-pseudo


Step 5: Format Namenode

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hdfs namenode -format


Step 6: Start HDFS

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done


Step 7: Create the /tmp Directory

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir /tmp 
unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chmod -R 1777 /tmp


Step 8: Create the MapReduce system directories

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred


Step 9: Verify the HDFS File Structure

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -ls -R /


Step 10: Start MapReduce

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done


Step 11: Set up user directory

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir /user/<your username>unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chown <user> /user/<your username> 
unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir /user/unmesha/new


Step 12: Run grep example, you can also try out wordcount example

unmesha@unmesha-hadoop-virtual-machine:~$/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'

Step 13: You can also stop the services

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x stop ; done


Happy Hadooping ...


Wednesday, 30 April 2014

How To Create Tables In HIVE


Hive provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface.

You can create table in two different ways.

1. Create External table for local data

CREATE EXTERNAL TABLE students
(id INT, name STRING, batch STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' #supply delimiter
LOCATION '/home/unmesha/students'; # local FS
2. Create Table for HDFS Data
CREATE TABLE students
(id INT, name STRING, batch STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' #supply delimiter
LOCATION '/user/unmesha/students'; #HDFS path


Hadoop Installation For Beginners - Pseudo Distributed Mode (Single Node Cluster)


Hadoop is an open-source software framework which is capable to store large amount of data and processing those bigdata.The underlying technology was invented by Google back in their earlier days. Hadoop was part of an open source project Nutch developed by Yahoo.Later Hadoop was spun out from Nutch Search Engine.
Hadoop is able to handle a Large amout of data.

Hadoop is comprised of 2 components
1. HDFS for storage
2. MapReduce for processing data in HDFS

Hadoop can be installed in 3 diffrent ways

1. Standalone Mode

Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
   $ mkdir input
   $ cp conf/*.xml input
   $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
   $ cat output/*

2. Pseudo Distributed Mode or Single Node Cluster

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process with one node.

3. Multi Node Cluster

A few nodes to extremely large clusters with thousands of nodes.

Below installation explains how to install hadoop in pseudo distributed Mode.


Prerequisite

1. Java (Latest Version)

> sudo add-apt-repository ppa:webupd8team/java
> sudo apt-get update
> sudo apt-get install oracle-java7-installer

2. SSH

> apt-get install ssh
> ssh localhost
[sudo]password:
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
 * Documentation:  https://help.ubuntu.com/
Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local


Configuring Passwordless SSH

In pseudo-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and fully distributed modes: it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. Pseudo-distributed mode is just a special case of fully distributed mode in which the (single) host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password.

If you cannot ssh to localhost without a passphrase, execute the following commands:

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/unmesha/.ssh/id_rsa): [press enter]
Enter passphrase (empty for no passphrase): [press enter]
Enter same passphrase again: [press enter]
Your identification has been saved in /home/unmesha/.ssh/id_rsa.
Your public key has been saved in /home/unmesha/.ssh/id_rsa.pub.
The key fingerprint is:
61:c5:33:9f:53:1e:4a:5f:e9:4d:19:87:55:46:d3:6b unmesha@unmesha-virtual-machine
The key's randomart image is:
+--[ RSA 2048]----+
|         ..    *%|
|         .+ . ++*|
|        o  = *.+o|
|       . .  = oE.|
|        S    ..  |
|                 |
|                 |
|                 |
|                 |
+-----------------+

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-copy-id localhost
unmesha@localhost's password: 
Now try logging into the machine, with "ssh 'localhost'", and check in:

  ~/.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Now you will be able to ssh without password

unmesha@unmesha-hadoop-virtual-machine:~$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
unmesha@unmesha-virtual-machine:~$ 


Setting JAVA_HOME


Before running Hadoop, we need to tell where Java is located on your system. If you have the JAVA_HOME environment variable set to point to a suitable Java installation, that will be used, and you don’t have to configure anything further. Otherwise, you can set the Java installation that Hadoop uses by editing certain configuration file, and specifying the JAVA_HOME variable.

unmesha@unmesha-hadoop-virtual-machine:~$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Check your current location of java 
unmesha@unmesha-hadoop-virtual-machine:~$ sudo update-alternatives --config java
[sudo] password for unmesha: 
There is only one alternative in link group java: /usr/lib/jvm/java-7-oracle/jre/bin/java
Nothing to configure.

If you have only one alternative it will only display as above else this command lists all the alternatives with a * symbol to the current installed location.
Next copy the path (before /jre/bin)and set it in ~/.bashrc 
unmesha@unmesha-hadoop-virtual-machine:~$ vi ~/.bashrc
Note: Check if you are able to type into profile else 
apt-get install vim

and add then set java home to last line.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Navigate to another terminal or refresh profile

unmesha@unmesha-hadoop-virtual-machine:~$ source ~/.bashrc 
Check if you are able to echo  JAVA_HOME
unmesha@unmesha-hadoop-virtual-machine:~$ echo $JAVA_HOME
/usr/lib/jvm/java-7-oracle

Hadoop Installation


Download a latest version of hadoop from Apache Mirrors.

Download the latest stable version.
Downloading: hadoop-2.3.0.tar.gz


Untarring the file

unmesha@unmesha-hadoop-virtual-machine:~$ tar xvfz hadoop-2.3.0.tar.gz 
unmesha@unmesha-hadoop-virtual-machine:~$ cd hadoop-2.3.0/
unmesha@unmesha-hadoop-virtual-machine:~/hadoop-2.3.0$ ls
bin  include  libexec      NOTICE.txt  sbin
etc  lib      LICENSE.txt  README.txt  share

Move hadoop-2.3.0 to hadoop
unmesha@unmesha-hadoop-virtual-machine:~$ sudo mv hadoop-2.3.0 /usr/local/hadoop

Set below contents into ~/.bashrc 
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
unmesha@unmesha-hadoop-virtual-machine:~$ source ~/.bashrc 

Configuration

We need to configure 5 files  

1. core-site.xml  
2. mapred-site.xml
3. hdfs-site.xml
4. hadoop-env.sh
5. yarn-site.xml

1. core-site.xml

unmesha@unmesha-hadoop-virtual-machine:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
unmesha@unmesha-hadoop-virtual-machine:~/hadoop/hadoop-2.3.0/etc/hadoop$ vi core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
</configuration>

2. mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains the /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml. This file is used to specify which framework is being used for MapReduce.

unmesha@unmesha-hadoop-virtual-machine:~/$ vi cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
unmesha@unmesha-hadoop-virtual-machine:~$ vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
 </configuration>

Create two folders for namenode and datanode (Dont use sudo for creating mkdir)


mkdir -p /usr/local/hadoop_store/hdfs/namenode
mkdir -p /usr/local/hadoop_store/hdfs/datanode

3. hdfs-site.xml

unmesha@unmesha-hadoop-virtual-machine:~/$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
 <configuration>
 <property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>

4. hadoop-env.sh

unmesha@unmesha-hadoop-virtual-machine:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

5. yarn-site.xml


unmesha@unmesha-hadoop-virtual-machine:~/$ vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
 <configuration>
 <property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
 </property>
 <property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
</configuration>

Now Format the namenode (Only done once)

unmesha@unmesha-hadoop-virtual-machine:~/$hdfs namenode -format

You will see some thing like this

......14/04/30 12:37:42 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/04/30 12:37:42 INFO util.ExitUtil: Exiting with status 0
14/04/30 12:37:42 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at unmesha-virtual-machine/127.0.1.1
************************************************************/

Now we will start all the demoons

unmesha@unmesha-hadoop-virtual-machine:~/$start-dfs.sh
unmesha@unmesha-hadoop-virtual-machine:~/$start-yarn.sh
To check what all daemons are running type "jps"
unmesha@unmesha-hadoop-virtual-machine:~/$jps
2243 NodeManager
2314 ResourceManager
1923 DataNode
2895 SecondaryNameNode
1234 Jps
1788 NameNode

In Hadoop there are 2 locations

1. User's HDFS

 (Optional)

 Setting hadoop users location. (Try with sudo -u hdfs command or hadoop fs command)

sudo -u hdfs hadoop fs -mkdir /user/<your username> 
sudo -u hdfs hadoop fs -chown <user> /user/<your username> 
  OR
hadoop fs -mkdir /user/<your username> 
hdfs hadoop fs -chown <user> /user/<your username> 

2. Root HDFS

hadoop fs -ls /

You can put your files in any location

To put your files in user hdfs just leave last parameter as empty(automatically points to users hdfs )



unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -put mydata 
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -ls 

Lets run an example

For any programming language there will be a "Hello World" program.

Similary hadoop is having a "Hello World" programs known as "Word Count"Hadoop jobs run basically with 2 directories


1. one directory or files as input

2. and another non-existing directory as output path.Hadoop automatically creates the output path


So for WordCount program input will be a text file and output folder contains the wordcount for that file.You can copy the text file from Here or else copy some paragraph data from Google or any other place and name that file and place it in a folder. 

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$mkdir mydata
unmesha@unmesha-hadoop-virtual-machine:~/$cd mydata
unmesha@unmesha-hadoop-virtual-machine:~/mydata$vi input
# Paste into this input file

Now your input folder is ready.Any hadoop job to run we should place our inputs to HDFS.
MapReduce programs can only run inputs only from HDFS.
So now we need to put mydata to HDFS.

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -put mydata /

Hadoop shell commands are executed using "hadoop" .

What above command does is : It puts mydata to hdfs
hadoop fs -put local/path hdfs/path

Note: If you have any issues in copying directory to hdfs.It is because of "No permission". Copy or move your mydata directory to /tmp

unmesha@unmesha-hadoop-virtual-machine:~/$mv mydata /tmp

Then try to copy with new input location as the source.

Now We will run wordcount program from hadoop-mapreduce-examples-2.3.0.jar which contains several examples.

Any mapreduce programs that we write are packed as a jar and then we submit the job to cluster.

Basic command to run MapReduce Jobs

hadoop jar jarname.jar MainClass indir outdir

Run wordcount example

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount /mydata /output1

After finishing the job traverse to output1 to view the result

unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -ls -R /output1  
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -cat /output1/part-r-00000    # This shows the wordcount result

For any job the result will be stored in part files.

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are available by default.
http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

You can also track the running Job using a url which is displayed in console while running the job.


14/04/30 12:57:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398885280814_0002
14/04/30 12:57:19 INFO impl.YarnClientImpl: Submitted application application_1398885280814_0002
14/04/30 12:57:21 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1398885280814_0002/
14/04/30 12:57:21 INFO mapreduce.Job: Running job: job_1398885280814_0002

Killing a Job

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop job -list
job_1398885280814_0002
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop job -kill job_1398885280814_0002
14/04/30 14:02:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/04/30 14:03:06 INFO impl.YarnClientImpl: Killed application application_1398885280814_0002
Killed job job_1398885280814_0002

To stop the single node cluster


unmesha@unmesha-hadoop-virtual-machine:~/$stop-all.sh

Hadoop can be installed using cloudera also with less steps in an easy way .The difference is Cloudera packed Apache Hadoop and some ecosystem projects into one package.And they have set all the configuration to localhost and we need not want to set the configuration files.

Installation using Cloudera Package.


Happy Hadooping ...