Wednesday, 21 January 2015

K-Nearest Neighbors Algorithm - KNN

KNN algorithm is a classification algorithm can be used in many application such as image processing,statistical design pattern and data mining.

As for any classification algorithm KN also have a model and Prediction part. Here model is simply the input dataset. While predicting output is a class membership. An object is classified by a majority vote of its neighbors (k), with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small).
1.  If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
2.  If k=3, and the classlabels are Good =2 Bad=1,then the predicted classlabel will be Good,which contains the magority vote.

Lets see how to handle a sample data in KNN algorithm.


We have data from questionnaires survey and objective testing with two attribute to classify whether a special paper issue is good or not.

Here is for training sample.







Let this be the test sample





1. Determine the parameter k=the no.of nearest neighbours.
      Say  k=3
2. Calculate the distance between queryinstance and all the training samples. 

Coordinate of query instance is (3,7) ,instead of calculating the distance we compute square distance which is faster to calculate(without squareroot)


3. Sort the distance and determine Nearest neighbors based on the kth minimum distance.


4. Gather  the category Y  of the nearest neighbours .



-> the second row inthe last column that the category of nearest neighbours (Y) is not included becoz the rank of this data is more than 3(=k).

5. Use simple majority of the category of  nearest neighbors as the prediction value of query instance.

We have  2 good and 1 bad ,since,2>1 So we conclude that a new paper tissue that pass laboratory test with x1=3 and x2=7 is included in Good category.



Thursday, 1 January 2015

Cloudera Certified Hadoop Developer (CCD - 410)


I cleared Cloudera Certified Hadoop Developer (CCD – 410) examination on December 31 st 2014.And received the certificate from cloudera on the very next day.

If you are planning to do this certification you need to know hadoop in depth and have hands-on experience too.

I started Hadoop career from my MCA (2010 - 2013) Final Year Major Project and it paved me to my current Job. Around 6 months after joining I planned to write Cloudera Certification Exam. I had around 1+ year experience in hadoop ,learned and practised hadoop myself. I thought it will be nice to attend the training session to see if I missed out any of the pointers. I registered with Cloudera and attended Cloudera Hadoop training @ Banglore.It was from 27 th to 30 th of March 2014 at Ibis Hotel, Banglore.It was my first trip to Banglore and was little bit tensed.

There were 13 trainees including me.And I was the only lady among them. Allan Schweitz was our trainer and Vipin Nahal from OssCube assisted him.

It was 4 days training which helps to know Hadoop Framework in depth , they also cover Hadoop EcoSystem Projects and Hands on assignments.After 4 days we will be able to know hadoop in depth.The 4 days class was really good and informative.If some one knows hadoop it will be like waste of time but still you can clarify your doubts.

At the end of 4 th day we received a training certificate and had a group pic. 



After training we recieved 180-day subscription to Cloudera Official Developer Practice Test Subscription for CCDH.This self-assessment will help you to discover strengths and weaknesses in your understanding and skills around Apache Hadoop and prepares you across the entire range of topics covered in a Cloudera certification exam.

Finally by 31 st of December 2014 I appeared for Hadoop Certification exam and cleared CCD 410 successfully.

There were around 52 questions in total and all options were easy and the answers seems to be similar and tricky. Here is my Certificate.



Advices to pass the certification examination
  1. Please dont depend on Cheating sites , Most of their answers are wrong.View some sample dumps from cheating site.
  2. Go through Hadoop - Definitive guide
  3. Gather a good knowledge in Hadoop EcoSystem projects.
  4. If attending Cloudera Training you will recieve 180 days subscription test as mentioned above.You can practise them.If you are getting an overall grade greater than 75% you will surely pass the examination.
Details For Cloudera Certification

1. Exam Code: CCD-410
    Number of Questions: 50 - 55 live questions
    Time Limit: 90 minutes
    Passing Score: 70%
    Language: English, Japanese
    Price: USD $295, AUD $300, EUR €215, GBP £185, JPY ¥28,500

2. You can log on to Cloudera PearsonVue for registering your Certification Test.
    First you need to set up your profile in cloudera pearson vue site.Once you have registered you will see a link to register for the exam and subsequently you can choose date and location. It will then take you to payment options where you need to pay for your certification Exam.

All the very best!!

For further information or queries or doubts regarding Hadoop you can contact me.


Thursday, 11 December 2014

Computing Median In Hive

In statistics and probability theory, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.

The median is the central point of a data set.

Consider the following data points: 1,4,5,6,7
The Median is "5".

Lets see how we will find median in Hive.

Consider a "test" table.
-------------------
|Name Age|
------------------
|A  23 |
|B    23 |
|C  20 |
------------------
hive> select * from test;
OK
A 23
B 23
C 20
Time taken: 4.219 seconds, Fetched: 3 row(s)

Lets say we are going to find the median for Age column in "test" table.
Our expected median is "23".

PERCENTILE(BIGINT col,0.5) function helps to compute median in hive.The 50th percentile would be the median.

Structure of  "test" table
hive> desc test;      
OK
firstname            string                                   
age                  int                                      
Time taken: 0.32 seconds, Fetched: 2 row(s)

Here we can see the column we are going to find median is in INT. We need to convert the column into BIGINT.

Lets try out the query
select percentile(cast(age as BIGINT), 0.5) from test; 
Here we casted age column into BIGINT.
hive> select percentile(cast(age as BIGINT), 0.5) from test1; 
Query ID = aibladmin_20141211140606_c61cb042-ed14-4048-8270-4cea1eece1c7 
Total jobs = 1 
Launching Job 1 out of 1 
.
.
OK 
23.0 
Time taken: 27.659 seconds, Fetched: 1 row(s)
23.0 is the expected result which is the median for [23,23,20].

Sunday, 7 December 2014

Joining Two Files Using MultipleInput In Hadoop MapReduce - MapSide Join

There are cases where we need to get 2 files as input and join them based on id or something like that.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  
MapSide can be achieved using MultipleInputFormat in Hadoop.

Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.

File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin

AND

File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C

We will try to join these files into one based on EmployeeID
The result we aim at is 

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

Here in both file File1.txt,File2.txt we can see that we need to join the records based on id.  So the employeeId's are common.
We will write 2 map jobs to process these files.

Processing File1.txt
public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
 String line=value.toString();
 String[] words=line.split("\t");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.

eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin

Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.

Processing File2.txt
public void map(LongWritable k, Text v, Context context) throws IOException, InterruptedException
{
 String line=v.toString();
 String[] words=line.split(" ");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File2.txt

eg: 1 50000,A
words[0] = 1
words[1] = 50000,A

If the files are of same delimiter and ID comes first you can resuse the same map job

Lets write a commomn Reducer task to join the data using key.
String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
 int i =0;
 for(Text value:values)
 {
  if(i == 0){
   merge = value.toString()+",";
  }
  else{
   merge += value.toString();
  }
  i++;
 }
 valEmit.set(merge);
 context.write(key, valEmit);
}

Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.

Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat


public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
p1,p2 are the Path variable holding 2 input files.
You can find the code in Github



Tuesday, 2 December 2014

Hive Bucketed Tables


In previous post we had seen how  to create partition tables in Hive.

Lets see how to create buckets in Hive table


The main difference between Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.



In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;


1. Creating a staging table to store your data

create external table stagingtbl (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/aibladmin/Hive'; 

2. Create bucketed table

create table emp_bucket (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) clustered by (department) into 3 buckets row format delimited fields terminated by ",";

3. Load data from stagingtbl to bucketed table

from stagingtbl insert into table emp_bucket 
       select employeeid,firstname,designation,salary,department;


4. Check how many data file have created in Hive metastore.


Lets check the table content in Hive warehouse




We can find 3 files in warehouse directory for department A,B and C.Each bucket contains unique values.

Monday, 1 December 2014

How To Drop A Particular Partition in HIVE


Hive Partition can be dropped using  

ALTER TABLE Tablename DROP IF EXISTS
 PARTITION(PartitionedID=PartitionVALUE);

Lets see an example.
Say I have an emp Hive Table where there are 3 partitions for Department(A,B,C).
Inorder to delete a particular Department use the below query.
ALTER TABLE emp DROP IF EXISTS
  PARTITION(Department='A');

Tuesday, 25 November 2014

[SOLVED] FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations in hive-0.14.0


CRUD operations are supported in Hive from 0.14 onwards.
See Wiki 

Hive supports data warehouse software facility,which facilitates querying and managing large datasets residing in distributed storage. In data warehouse there are situation where we need to update, delete etc transactions.In hive later versions UPDATE was not supported,but there were workarounds to do update a transaction

1. Update Statement In Hive For Small Tables
2. Update Statement In Hive For Large Tables using INSERT


Lets see how to do INSERT,UPDATE,DELETE in newer version of hive. 

Create a table "test"
CREATE EXTERNAL TABLE 
    test (EmployeeID Int,FirstName String,Designation  
        String,Salary Int,Department String) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY  "," 
    LOCATION '/user/hdfs/Hive';
We will try to update the salary of employee id 19 from 45,000 to 50,000.
 hive> UPDATE test 
           SET salary = 50000 
           WHERE employeeid = 19;

 FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction m anager that does not support these operations.

While applying above query it shows a semantic Exception.In order to allow update and delete we need to add additional settings in hive-site.xml and create table with ACID output format support.

To achieve the same follow below steps:

1. New Configuration Parameters for Transactions
 hive.support.concurrency – true
 hive.enforce.bucketing – true
 hive.exec.dynamic.partition.mode – nonstrict
 hive.txn.manager –org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
 hive.compactor.initiator.on – true
 hive.compactor.worker.threads – 1
You can set these configuration in hive-site.xml (after setting restart Hive ) for ever or via terminal.
2. Below query creates HiveTest table with ACID support
(To do Update,delete or Insert we need to create a table that support ACID properties)
 create table HiveTest 
   (EmployeeID Int,FirstName String,Designation String,
     Salary Int,Department String) 
   clustered by (department) into 3 buckets 
   stored as orc TBLPROPERTIES ('transactional'='true') ;
3. Load data into HiveTest from a staging table,which contains the original data.
 from stagingtbl 
   insert into table HiveTest 
   select employeeid,firstname,designation,salary,department;

4. UPDATE,DELETE and INSERT operations


1.UPDATE
 update HiveTest 
    set salary = 50000 
    where employeeid = 19; 

SYNOPSIS

  1. The referenced column must be a column of the table being updated.
  2. The value assigned must be an expression that Hive supports in the select clause.  Thus arithmetic operators, UDFs, casts, literals, etc. are supported.  Subqueries are not supported.
  3. Only rows that match the WHERE clause will be updated.
  4. Partitioning columns cannot be updated.
  5. Bucketing columns cannot be updated.
  6. In Hive 0.14, upon successful completion of this operation the changes will be auto-committed.


2. INSERT
 insert into table HiveTest 
     values(21,'Hive','Hive',0,'B');

SYNOPSIS

  1. Each row listed in the VALUES clause is inserted into table tablename.
  2. Values must be provided for every column in the table.  The standard SQL syntax that allows the user to insert values into only some columns is not yet supported.  To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.
  3. Dynamic partitioning is supported in the same way as for INSERT...SELECT.
  4. If the table being inserted into supports ACID and a transaction manager that supports ACID is in use, this operation will be auto-committed upon successful completion.



3. DELETE
 delete from HiveTest
     where employeeid=19;

SYNOPSIS
  1. Only rows that match the WHERE clause will be deleted.
  2. In Hive 0.14, upon successful completion of this operation the changes will be auto-committed.