## Tuesday, 20 May 2014

### How To WorkOut Navie Bayes Algorithm

#### Prior Probabilities

``````Prior Probabilities
-------------------

P(yes) = 9/14 = 0.643
Given that the class label is "yes" the universe is 14 = yes(9) + no(5). 9 of them is yes
P(no) = 5/14 = 0.357
Given that the class label is "no" the universe is 14 = yes(9) + no(5). 5 of them is no

``````

#### Probability of Likelihood

``````Probability of Likelihood
-------------------------

P(youth/yes) = 2/9 = 0.222
Given that the class label is "yes" the universe is 9. 2 of them are youth.
P(youth/no) = 3/5 = 0.600
...
...
P(fair/yes) = 6/9 = 0.667
P(fair/no) = 2/5 = 0.400

``````

``` ```

``` ```

#### ``` We need to ```

``` Maximize P(X|Ci )P(Ci ), for i = 1, 2 P(Ci ) - the prior probability of each class, can be computed based on the training tuples: ```

``````
P(yes/youth,medium,yes and fair)
= P(youth/yes)* P(medium/yes)* P(yes/yes)* P(fair/yes) * P(yes)
= (0.222* 0.444* 0.667* 0.667) * 0.643
= 0.028

P(no/youth,income,medium,yes and fair)
= P(youth/no)* P(medium/no)* P(yes/no)* P(fair/no) * P(no)
= (0.600* 0.400* 0.200* 0.400) * 0.357
= 0.007

``````
``` ```

## Saturday, 17 May 2014

### Count Frequency Of Values In A Column Using Apache Pig

#### There may be situations to count the occurence of a value in a field. Let this be the sample input bag.

```user_id   course_name user_name
1           Social      Anju
2           Maths       Malu
1           English     Anju
1           Maths       Anju```

Say we need to calculate no of occurence of each user_name.
```Anju 3
Malu 1```

#### COUNT function  compute the number of elements in a bag.To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.The basic idea to do the above example is to group by user_name and count the tuples in the bag.

```--count.pig

(user_id:long,course_name:chararray,user_name:chararray);
groupedByUser = group userAlias by user_name;
counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
result = FOREACH counted GENERATE user_name, cnt;
store result into '/home/sreeveni/myfiles/pig/OUT/count';```

## Monday, 12 May 2014

#### In pseudo-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and fully distributed modes: it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. Pseudo-distributed mode is just a special case of fully distributed mode in which the (single) host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password. If you cannot ssh to localhost without a passphrase, execute the following commands:

```unmesha@unmesha-hadoop-virtual-machine:~\$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/unmesha/.ssh/id_rsa): [press enter]
Enter passphrase (empty for no passphrase): [press enter]
Enter same passphrase again: [press enter]
Your identification has been saved in /home/unmesha/.ssh/id_rsa.
Your public key has been saved in /home/unmesha/.ssh/id_rsa.pub.
The key fingerprint is:
61:c5:33:9f:53:1e:4a:5f:e9:4d:19:87:55:46:d3:6b unmesha@unmesha-virtual-machine
The key's randomart image is:
+--[ RSA 2048]----+
|         ..    *%|
|         .+ . ++*|
|        o  = *.+o|
|       . .  = oE.|
|        S    ..  |
|                 |
|                 |
|                 |
|                 |
+-----------------+

Now try logging into the machine, with "ssh 'localhost'", and check in:

~/.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.```

#### Now you will be able to ssh without password.

```unmesha@unmesha-hadoop-virtual-machine:~\$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)

* Documentation:  https://help.ubuntu.com/

Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
unmesha@unmesha-virtual-machine:~\$ ```

## Saturday, 3 May 2014

### Hadoop Installation Using Cloudera Package - Pseudo Distributed Mode (Single Node)

[Previous Post]

#### Step 1: Set Java home in /etc/profile

```unmesha@unmesha-hadoop-virtual-machine:~\$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)```

#### Check your current location of java

```unmesha@unmesha-hadoop-virtual-machine:~\$ sudo update-alternatives --config java
There is only one alternative in link group java: /usr/lib/jvm/java-7-oracle/jre/bin/java
Nothing to configure.```

#### Set JAVA_HOME

`export JAVA_HOME=/usr/lib/jvm/java-7-oracle`
`unmesha@unmesha-hadoop-virtual-machine:~\$ source ~/.bashrc `

#### Step 3: Extract the package

`unmesha@unmesha-hadoop-virtual-machine:~\$sudo dpkg -i cdh4-repository_1.0_all.deb`

```unmesha@unmesha-hadoop-virtual-machine:~\$sudo apt-get update

#### Step 5: Format Namenode

`unmesha@unmesha-hadoop-virtual-machine:~\$sudo -u hdfs hdfs namenode -format`

#### Step 6: Start HDFS

`unmesha@unmesha-hadoop-virtual-machine:~\$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service \$x start ; done`

#### Step 7: Create the /tmp Directory

```unmesha@unmesha-hadoop-virtual-machine:~\$sudo -u hdfs hadoop fs -mkdir /tmp

#### Step 8: Create the MapReduce system directories

```unmesha@unmesha-hadoop-virtual-machine:~\$sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

```

#### Step 9: Verify the HDFS File Structure

`unmesha@unmesha-hadoop-virtual-machine:~\$sudo -u hdfs hadoop fs -ls -R /`

#### Step 10: Start MapReduce

`unmesha@unmesha-hadoop-virtual-machine:~\$for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service \$x start ; done`

#### Step 11: Set up user directory

```unmesha@unmesha-hadoop-virtual-machine:~\$sudo -u hdfs hadoop fs -mkdir /user/<your username>unmesha@unmesha-hadoop-virtual-machine:~\$sudo -u hdfs hadoop fs -chown <user> /user/<your username>

#### Step 12: Run grep example, you can also try out wordcount example

`unmesha@unmesha-hadoop-virtual-machine:~\$/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'`

#### Step 13: You can also stop the services

```unmesha@unmesha-hadoop-virtual-machine:~\$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service \$x stop ; done

unmesha@unmesha-hadoop-virtual-machine:~\$for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service \$x stop ; done```

## Wednesday, 30 April 2014

### How To Create Tables In HIVE

#### 1. Create External table for local data

```CREATE EXTERNAL TABLE students
(id INT, name STRING, batch STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' #supply delimiter
LOCATION '/home/unmesha/students'; # local FS
```
2. Create Table for HDFS Data
```CREATE TABLE students
(id INT, name STRING, batch STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' #supply delimiter
LOCATION '/user/unmesha/students'; #HDFS path```

### Hadoop Installation For Beginners - Pseudo Distributed Mode (Single Node Cluster)

Prerequisite

```> sudo add-apt-repository ppa:webupd8team/java
> sudo apt-get update
> sudo apt-get install oracle-java7-installer```

#### 2. SSH

```> apt-get install ssh
> ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
* Documentation:  https://help.ubuntu.com/
Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
```

In pseudo-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and fully distributed modes: it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. Pseudo-distributed mode is just a special case of fully distributed mode in which the (single) host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password.

#### Now you will be able to ssh without password

```unmesha@unmesha-hadoop-virtual-machine:~\$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)

* Documentation:  https://help.ubuntu.com/

Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
unmesha@unmesha-virtual-machine:~\$ ```

#### Before running Hadoop, we need to tell where Java is located on your system. If you have the JAVA_HOME environment variable set to point to a suitable Java installation, that will be used, and you don’t have to configure anything further. Otherwise, you can set the Java installation that Hadoop uses by editing certain configuration file, and specifying the JAVA_HOME variable.

```unmesha@unmesha-hadoop-virtual-machine:~\$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)```
Check your current location of java
```unmesha@unmesha-hadoop-virtual-machine:~\$ sudo update-alternatives --config java
There is only one alternative in link group java: /usr/lib/jvm/java-7-oracle/jre/bin/java
Nothing to configure.```

If you have only one alternative it will only display as above else this command lists all the alternatives with a * symbol to the current installed location.
Next copy the path (before /jre/bin)and set it in ~/.bashrc
`unmesha@unmesha-hadoop-virtual-machine:~\$ vi ~/.bashrc`
Note: Check if you are able to type into profile else
`apt-get install vim`

#### and add then set java home to last line.

`export JAVA_HOME=/usr/lib/jvm/java-7-oracle`

#### Navigate to another terminal or refresh profile

`unmesha@unmesha-hadoop-virtual-machine:~\$ source ~/.bashrc `
Check if you are able to echo  JAVA_HOME
```unmesha@unmesha-hadoop-virtual-machine:~\$ echo \$JAVA_HOME
/usr/lib/jvm/java-7-oracle```

#### Untarring the file

```unmesha@unmesha-hadoop-virtual-machine:~\$ tar xvfz hadoop-2.3.0.tar.gz
bin  include  libexec      NOTICE.txt  sbin
```

`unmesha@unmesha-hadoop-virtual-machine:~\$ sudo mv hadoop-2.3.0 /usr/local/hadoop`

Set below contents into ~/.bashrc
```export HADOOP_INSTALL=/usr/local/hadoop
```
```unmesha@unmesha-hadoop-virtual-machine:~\$ source ~/.bashrc
```

#### Configuration

We need to configure 5 files

1. core-site.xml
2. mapred-site.xml
3. hdfs-site.xml
5. yarn-site.xml

#### 1. core-site.xml

```unmesha@unmesha-hadoop-virtual-machine:~\$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
```

#### 2. mapred-site.xml By default, the /usr/local/hadoop/etc/hadoop/ folder contains the /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml. This file is used to specify which framework is being used for MapReduce.

```unmesha@unmesha-hadoop-virtual-machine:~/\$ vi cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
```

#### Create two folders for namenode and datanode (Dont use sudo for creating mkdir)

```mkdir -p /usr/local/hadoop_store/hdfs/namenode

#### 3. hdfs-site.xml

```unmesha@unmesha-hadoop-virtual-machine:~/\$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
</property>
<property>
<name>dfs.datanode.data.dir</name>
</property>
</configuration>```

```unmesha@unmesha-hadoop-virtual-machine:~\$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386```

#### 5. yarn-site.xml

```unmesha@unmesha-hadoop-virtual-machine:~/\$ vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
</property>
</configuration>```

#### Now Format the namenode (Only done once)

`unmesha@unmesha-hadoop-virtual-machine:~/\$hdfs namenode -format`

#### You will see some thing like this

``````......14/04/30 12:37:42 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/04/30 12:37:42 INFO util.ExitUtil: Exiting with status 0
14/04/30 12:37:42 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at unmesha-virtual-machine/127.0.1.1
************************************************************/``````

#### Now we will start all the demoons

```unmesha@unmesha-hadoop-virtual-machine:~/\$start-dfs.sh
```
To check what all daemons are running type "jps"
```unmesha@unmesha-hadoop-virtual-machine:~/\$jps
2243 NodeManager
2314 ResourceManager
1923 DataNode
2895 SecondaryNameNode
1234 Jps
1788 NameNode```

In Hadoop there are 2 locations

#### ``````sudo -u hdfs hadoop fs -mkdir /user/<your username> sudo -u hdfs hadoop fs -chown <user> /user/<your username> OR hadoop fs -mkdir /user/<your username> hdfs hadoop fs -chown <user> /user/<your username> `````` 2. Root HDFS ``hadoop fs -ls /`` You can put your files in any location * To put your files in user hdfs just leave last parameter as empty(automatically points to users hdfs )

```unmesha@unmesha-hadoop-virtual-machine:~/\$hadoop fs -put mydata
```

Lets run an example

#### So for WordCount program input will be a text file and output folder contains the wordcount for that file.You can copy the text file from Here or else copy some paragraph data from Google or any other place and name that file and place it in a folder.

```unmesha@unmesha-hadoop-virtual-machine:~/\$cd
# Paste into this input file```

#### Now your input folder is ready.Any hadoop job to run we should place our inputs to HDFS. MapReduce programs can only run inputs only from HDFS. So now we need to put mydata to HDFS.

```unmesha@unmesha-hadoop-virtual-machine:~/\$cd
```

#### Hadoop shell commands are executed using "hadoop" . What above command does is : It puts mydata to hdfs hadoop fs -put local/path hdfs/path Note: If you have any issues in copying directory to hdfs.It is because of "No permission". Copy or move your mydata directory to /tmp

`unmesha@unmesha-hadoop-virtual-machine:~/\$mv mydata /tmp`

Then try to copy with new input location as the source.

Now We will run wordcount program from hadoop-mapreduce-examples-2.3.0.jar which contains several examples.

#### Basic command to run MapReduce Jobs

`hadoop jar jarname.jar MainClass indir outdir`

#### Run wordcount example

```unmesha@unmesha-hadoop-virtual-machine:~/\$cd

#### After finishing the job traverse to output1 to view the result

```unmesha@unmesha-hadoop-virtual-machine:~/\$hadoop fs -ls -R /output1
```

For any job the result will be stored in part files.

#### You can also track the running Job using a url which is displayed in console while running the job.

```14/04/30 12:57:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398885280814_0002
14/04/30 12:57:19 INFO impl.YarnClientImpl: Submitted application application_1398885280814_0002
14/04/30 12:57:21 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1398885280814_0002/
14/04/30 12:57:21 INFO mapreduce.Job: Running job: job_1398885280814_0002

```

#### Killing a Job

```unmesha@unmesha-hadoop-virtual-machine:~/\$cd
job_1398885280814_0002
14/04/30 14:02:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/04/30 14:03:06 INFO impl.YarnClientImpl: Killed application application_1398885280814_0002
Killed job job_1398885280814_0002
```

#### To stop the single node cluster

```unmesha@unmesha-hadoop-virtual-machine:~/\$stop-all.sh
```