Tuesday, 8 April 2014

Name node is in safe mode - How to leave

Namenode enters in safe mode in unusual situations, for example when disk is full, also in the start-up phase. So to leave safemode,run below command.

bin/hadoop dfsadmin -safemode leave

After doing the above command, Run hadoop fsck so that any inconsistencies crept in the hdfs might be sorted out.

Use  hdfs  command instead of  hadoop command for newer distributions:
hdfs dfsadmin -safemode leave

hadoop dfsadmin has been deprecated, all hdfs related tasks are being moved to a separate command hdfs

How to CONCAT multiple expressions in Pig

eg: CONCAT((chararray)$0,CONCAT('Pig',(chararray)$6));

Custom parameters to pig script

Custom Parameters : Input,Output,Delimiter,field

--load hdfs/local fs data
original = load '$input' using PigStorage('$delimiter');
--filter aspecific field value into another bag 
filtered = foreach original generate $split; 
--storing data into hdfs/local fs
store filtered into '$output' using PigStorage('$delimiter');
Run command as Local or MapReduce.

Local Mode
pig -x local -f customparam.pig -param input=employee.csv -param output=OUT/empdetails -param delimiter="," -param split='$3'

input - employee.csv
output - OUT/empdetails
delimiter - ,
split - spliting 4 th field in a tuple.(starts from $0,$1....) 

Thursday, 2 January 2014

K-means Clustering Algorithm

  How To Work Out K-means Clustering Algorithm

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.Distance measure here used is Euclidean, there are many other Distance measures used for k-means clustering.

Lets workout k-means using a simple example.
Problem 1:
Lets take a medicine example.

Here we have  4 medicine as our training data. each medicine has 2 coordinates, which represent the coordinates of the object.These 4 medicines belong to two cluster.We need to identify which medicine belongs to cluster 1 and cluster 2.

k-means clustering starts with a k value where k is user defined ie how many clusters . here we need to classify the medicines into 2 clusters so k = 2.

Next we need to identify initial centroids. lets selects the first 2 points as our initial centroids.

 k = 2    Initial Centroids = (1,1) and (2,1)

Iteration 1:

For each feature we need to calculate the distance to the cluster and find the minimum distance.

For Feature  (1,1)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-1)^2  (1-1)^2)                                                     = sqrt((2-1)^2  (1-1)^2)
                    = 0                                                                                = 0
For Feature (2,1)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-2)^2  (1-1)^2)                                                     = sqrt((2-2)^2  (1-1)^2)
                    = 1                                                                                = 0

For Feature  (4,3)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-4)^2  (1-3)^2)                                                     = sqrt((2-4)^2  (1-3)^2)
                    = 3.6                                                                             = 2.82
For Feature (5,4)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-5)^2  (1-4)^2)                                                     = sqrt((2-5)^2  (1-4)^2)
                    = 5                                                                                = 4.24

Next we need to identify the cluster having minimum distance for each feature.

                      C1 = (1,1)
                   C2 = (2,1),(4,3),(5,4)

Calculate the New Centroid till it converges.
New Centroids : 
Continue the iteration.for all features with new centroids.

                  C1 remains the same.
                  For C2 the new centroid will be the average of the above listed points.
                  ie   (2+4+5/3 , 1+3+4/3) 

                 C1 = (1,1)
                 C2 = (3.66,2.66)

Iteration 2: 

For each feature we need to calculate the distance to the cluster and find the minimum distance

Identify the cluster having minimum distance for each feature.

                       C1 = (1,1) , (2,1)
                   C2 = (4,3),(5,4)

New Centroids : 
                 C1 = (1.5,1)
                 C2 = (4.5,3.5)
Check if the points converges.
How to check: Compare the old centroid with new one.If both are equal then k-means converges else continue the iteration.
                     Old Centroid :   (1,1) and (3.66,2.66)
                     New Centroid:  (1.5,1) and (4.5,3.5) 

                     Old Centroid != New Centroid

Iteration continues.

Iteration 3: 

For each feature we need to calculate the distance to the cluster and find the minimum distance.

Identify the cluster having minimum distance for each feature.

                       C1 = (1,1) , (2,1)
                   C2 = (4,3),(5,4)

New Centroids : 
                 C1 = (1.5,1)
                 C2 = (4.5,3.5)

Check if the points converges.

                     Old Centroid :   (1.5,1) and (4.5,3.5)
                     New Centroid:  (1.5,1) and (4.5,3.5) 

                     Old Centroid = New Centroid

Therefore the algorithm stops.

Medicine A and B belongs to Cluster 1
Medicine C and D belongs to Cluster 2


Tuesday, 24 December 2013

SVM - Support Vector Machines

                                                  How To Workout SVM Algorithm

In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
More formally, a support vector machine constructs a hyper plane or set of hyper planes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyper plane that has the largest distance to the nearest training data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

1.1 SVM Binary Classification Algorithm

Example 1:

Step 1:Normalize the given data by using the equation


Step 2: Compute Augmented matrix   [A  -e]. ie Augment a "-1" column matrix to A.


Step 3: Compute  H = D[A -e]
where D is a diagonal matrix with classes. here it is 1 and -1.

        D × [A  -e] = 


Step 4 : Compute U = V ×[I –H[I/V + HTH]-1 HT]×e
                Where I = Identity Matrix   
                                V = Order of HTH with value 0.1

Step 5: Find  w and gamma  = 0.                
Step 6: Find w trans * X -gamma. Find for all features
  eg: X= Column_matrix{x,y}

           Result of SVM Traning

Step 7: Compare sign(wT*x -gama ) with the actual class label.
You can notice that 3rd , 5 th and 9 th class labels are misclassified.
After Traning the dataset , the below datapoints

           5     9   falls in class label -1
           8     7   falls in class label -1
           8     5   falls in class label 1.

Misclassification in 3 datapoints 3 rd 5 th and 9 th.
Acurracy is 70%.

Saturday, 21 December 2013

Decision Tree Algorithm

How To Workout Decision Tree Algorithm

Decision Tree Induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node.

During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin, and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer supervised learning algorithms are often compared. In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book Classification and Regression Trees (CART), which described the generation of binary decision trees. ID3 and CART were invented independently of one another at around the same time, yet follow a similar approach for learning decision trees from training tuples. These two cornerstone algorithms spawned a flurry of work on decision tree induction. ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for decision tree induction also follow such a top-down approach, which tarts with a training set of tuples and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is being built.

Problem 1:

Iteration 1:

Find gain for class: Play
Calculate the no of yes  and no 
          No – 5
          Yes – 9
Total observation – 14

           Info([9,5]) = (-9/14 log 9/14 – 5/14 log 5/14)/log 2 = 0.940

Then find the gain for all instance.

          InfoOutlook([2,3],[4,0],[3,2]) = 5/14(-2/5 log 2/5 – 3/5 log 3/5) + 4/14(-4/4 log 4/4) + 5/14(-3/5 log 3/5 – 2/5 log 2/5) /log 2

          InfoTemp([3,1],[4,2],[3,1])/log 2
          InfoHumidity([4,3],[6,1])/log 2
          InfoWindy([2,6],[3,3])/log 2

          gain(outlook)  = Info([9,5]) – InfoOutlook([2,3],[4,0],[3,2])=0.247
          gain(Temp)       = 0.029
          gain(humidity)   = 0.152
          gain(windy)      = 0.048

Find which is having highest gain and it will be taken as the root node.

So here in overcast all the leaf nodes are yes therefore we don't have a further recursion in that.You can delete the entire row which is having overcast as outlook or else continue with the same table.

Iteration 2:

Now we need to identify what all nodes can come under oulook -> Sunny and outlook -> Rainy.
First look for the case of Sunny. Sort out the master table to a child table which comprises only Sunny entry.

Calculate the no of yes  and no 
          No - 3
          Yes - 2

          Info([3,2])        = 0.97095

          gain(Temp)       = 0.57095
          gain(humidity) = 0.97095
          gain(windy)       = 0.419973

Find which is having the highest gain
          gain(humidity) = 0.97095

So humidity comes under Sunny
New tree becomes

From the above diagram all leaf node is same . Therefore no iteration with in humidity so remaining left out is Rainy.

Iteration 3:

Now we need to identify what all nodes can come under oulook -> Rainy .
Sort out the master table to a child table which comprises only Rainy entry as done in Iteration 2.

Calculate yes and no
          No -2
          Yes - 3
          Info([3,2])    = 0.97095

          gain(Temp)   = 0.02095
         gain(windy) = 0.97095

Highest gain is for windy

therefore it comes under outlook ->rainy .

Now all the leaf is unique so no further iteration.So Temperature will not be taken .


1. Start
2. Get Input
3. Extract class Label col and find TotalEntropy
4. Associate attr with class label and calc Info gain for each attribute
5. Determine the attribute with highest gain
6. If entropy =0 go to step 7 else go to step 3
7. End

ID3 to C4.5

Only difference is we will be considering “Split Info” also. So Gain ratio of each attribute will be gain/split info.Other procedures are same.

ex.Split info = info([5,4,5]) = 5 sunny ,4 overcast and 5 rainy like wise.


[1] K.P.Soman,Shyam Diwaker,V.Ajay(2006),Insight into Data Mining Theory And Practice,Prentice-Hall of India Private Limited, New Delhi