Monday, 29 September 2014

Comments On CCD-410 Sample Dumps


What do you think of these three questions mentioned in site CCD-410 Practice Exam Questions Demo 100% Pass-Guaranteed or Your Money Back!!!


QUESTION: 3

What happens in a MapReduce job when you set the number of reducers to one?

A. A single reducer gathers and processes all the output from all the mappers. The output is written in as many separate files as there are mappers.
B. A single reducer gathers and processes all the output from all the mappers. The output is written to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.

Answer:A

QUESTION: 4

In the standard word count MapReduce algorithm, why might using a combiner reduce theoverall Job running time?

A. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
B. Because combinersperform local aggregation of word counts, thereby reducing the number of mappers that need to run.
C. Because combiners perform local aggregation of word counts, and then transfer that data toreducers without writing the intermediate data to disk.
D. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff let across the network to the reducers.

Answer:A

QUESTION: 5

You have user profile records in your OLTP database,that you want to join with weblogs you have already ingested into HDFS.How will you obtain these user records?

A. HDFS commands
B. Pig load
C. Sqoop import
D. Hive

Answer :B


Correct Answers

QUESTION 3: Answer B
QUESTION 4: Answer D
QUESTION 5: Answer C

See reviews on correct answer

Can we change the default key-value input seperator in Hadoop MapReduce

Previous Post


Yes, We can change it using "key.value.separator.in.input.line" property in Driver class.

There may be cases where we need to take each line with specific delimiter.


eg:
one	first line
two	second line
Inorder to read a file like this we will be using KeyValueTextInputFormat.class as it takes the line with TAB  as default seperator.

So while printing each line in map() , the key will be "one" and value will be "first line".


What if we need other delimiters instead of TAB delimiter


eg:
one,first line
two,second line

Here also we need to get key as "one" and value as "first line".

It is possible by adding an extra configuration along with KeyValueTextInputFormat to change the default seperator.

//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ","); 
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);

Monday, 22 September 2014

Matrix Multiplication in Hadoop MapReduce

Matrix multiplication is common and important algebraic operation.


Matrix multiplication is applied to file of format:MatrixName,row,col,element

A,0,1,1.0
A,0,2,2.0
A,0,3,3.0
A,0,4,4.0
B,3,1,10.0
B,3,2,11.0
A,1,0,5.0
A,1,1,6.0
A,1,2,7.0
A,1,3,8.0
A,1,4,9.0
B,0,1,1.0
B,0,2,2.0
B,1,0,3.0
B,1,1,4.0
B,1,2,5.0
B,2,0,6.0
B,2,1,7.0
B,2,2,8.0
B,3,0,9.0
B,4,0,12.0
B,4,1,13.0
B,4,2,14.0

Find code : GitHub




Monday, 8 September 2014

How To Set Counters In Hadoop MapReduce

Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.Lets see an example where Counters count the no of keys processed in reducer.


3 key points to set

1. Define counter in Driver class

static enum UpdateCount{
  CNT
}

2. Increment or set counter in Reducer

context.getCounter(UpdateCount.CNT).increment(1);

3. Get counter in Driver class 

c = job.getCounters().findCounter(UpdateCount.CNT).getValue();

Full code :  GitHub

You will be able to see the counters in console also.






Saturday, 23 August 2014

Example for Apriori Algorithm


Lets take a store data
pen,pencil
pencil,book,eraser
pen,book,eraser,chalk
pen,eraser,chalk
pen,pencil,book
pen,pencil,book,eraser
pen,Ink
pen,pencil,book
pen,pencil,eraser
pencil,book,chalk
To start with Apriori follow the below steps.
Step 1: Initially we need to find Item 1 Frequent Dataset
c1
------
book 6
chalk 3
eraser 6
pen 8
pencil 7
Ink 1
We will say that an item set is frequent if it appears in at least 3 transactions of the itemset: the value 3 is the support threshold.
Support count = 3
So the items less that support count can be discarded form F1 frequent Dataset.
so our new set will be
L1
------
book 6
chalk 3
eraser 6
pen 8
pencil 7
Step 2: We need to generate size 2 frequent item pair sets by joining L1 set
eg:{book} U {chalk} => {book,chalk} and so on..
{book,chalk}
{book,eraser}
{book,pen}
{book,pencil}


{chalk,eraser} 
{chalk,pen} 
{chalk,pencil}

{eraser,pen} 
{eraser,pencil} 

{pen,pencil}
Once the transactions are joined we need to identify the no occurence of the above data items in original transaction(That will be the support count of C2)
C2
----------------
{book,chalk} 2
{book,eraser} 2
{book,pen} 4
{book,pencil} 5


{chalk,eraser} 2
{chalk,pen} 2
{chalk,pencil} 0

{eraser,pen} 5
{eraser,pencil} 3

{pen,pencil} 5
Transactions less that support count can be discarded form C2 frequent Dataset
L2
----------------
{book,pen} 4
{book,pencil} 5
{eraser,pen} 5
{eraser,pencil} 3
{pen,pencil} 5
To find C3 loop through L2
eg: {book,pen} U {book,pencil} => {book,pen,pencil}
C3
-------------------------
{book,pen,pencil} 3
{chalk,eraser,pen} 2
{eraser,pen,pencil} 2
Transactions less that support count can be discarded form C3 frequent Dataset
L3
-------------------------
{book,pen,pencil} 3
There are no transaction to join further.
So our Frequent item sets are
L1:
-------
book 6
chalk 3
eraser 6
pen 8
pencil 7

L2:
-----------------
{book,pen} 4
{book,pencil} 5
{eraser,pen} 5
{eraser,pencil} 3
{pen,pencil} 5

L3
-------------------------
{book,pen,pencil} 3
Step 3: We need to generate Strong Assosiaction  Rules for frequent Set using L1,L2and L3

Say confidence is 60% and Support count is 3.So we have to find the Transactions with no.of item 3 (ie L3 as support count = 3) and which has a confidence >=60.Now we can identify L3 set
{book,pen,pencil} 3

Finding Ruleset
{book,pen} => pencil
{book,pencil} => pen
{pen,pencil} => book

pencil => {book,pen}
pen => {book,pencil}
book => {pen,pencil}

Now we need to find the confidence of each transaction
eg: {book,pen} => pencil
           = support Cnt{book,pen}/ support count({pen})

Therefore rules having confidence greater than and equal to 60 are
book,pen=>pencil 75.0
book,pencil=>pen 60.0
pen,pencil=>book 60.0
These are the strongest rules.
If a customer buys book and pen he have a tendency to buy a pencil too. Like wise if he buys book and pencil he may buy pen too.

Calculating Mean in Hadoop MapReduce


Given a csv files we will find mean of each column (Optimized approach)


Mapper

 Takes each input line and calculate the sum and stores the no of lines it sumed.Then sum get stored in a hash map with key as column Id.cleanup emits the sum and total line count inorder to take the overall mean.As we know each map only gets a block of input data.So while summing up we need to know how many elements summed up.


//Calculating sum
if (sumVal.isEmpty()) {
 //if sumval is empty add elements to sumval
 sumVal.putAll(mapLine);
 } else {
//calculating sum
 double sum = 0;
 for (Integer colId : mapLine.keySet()) {
  double val1 = mapLine.get(colId);
  double val2 = sumVal.get(colId);
 /*
  * calculating sum
 */
 sum = val1 + val2;
 sumVal.put(colId, sum);
 }
}


Reducer

 Sums of the values for each key.

Reducer calculates 2 sums.
  1. Sums the values for each key and 
  2. Sums total no.of linecount


for (TwovalueWritable value : values) {
 //Taking sum of values and total number of lines 
 sum += value.getSum();
 total += value.getTotalCnt();
 }
 //sum contains total sum of all elements in each column
 //total contains total no of elements in each column
 mean = sum / total;
 valEmit.set(mean);
 context.write(key, valEmit);


This approach helps in avoiding a large no of communication with reducer.Reducer needs only to sum up few values from mapper.
Say we have only 3 mappers and 4 columns in input set.Reducer only want to wait for 4 values from each mapper(no.of columns also considered)


Complete code : GitHub Link

Tuesday, 20 May 2014

How To WorkOut Navie Bayes Algorithm

A Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.The main advantage of naive Bayes is that it only requires a smaller amount of data for training inorder to estimate the class labels necessary for classification. Because independent variables are assumed.

In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. 

By training it means to train them on particular inputs so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon. 

How to Apply NaiveBayes to Predict an Outcome
Let's try it out using an example.



In the above training data we have 2 class labels buys_computer No and Yes. And we know 4 characteristics.


1. Whether the age is youth,middle_aged or senior.
2. Whether income is high,low or medium.
3. Whether they have student or not.
4. Whether credit is excellent,fair.


There are many things to pre-compute from the training dataset for future prediction.


Prior Probabilities

Prior Probabilities
-------------------

P(yes) = 9/14 = 0.643
  Given that the class label is "yes" the universe is 14 = yes(9) + no(5). 9 of them is yes
P(no) = 5/14 = 0.357
  Given that the class label is "no" the universe is 14 = yes(9) + no(5). 5 of them is no

Probability of Likelihood

Probability of Likelihood
-------------------------

P(youth/yes) = 2/9 = 0.222
  Given that the class label is "yes" the universe is 9. 2 of them are youth.
P(youth/no) = 3/5 = 0.600
...
...
P(fair/yes) = 6/9 = 0.667
P(fair/no) = 2/5 = 0.400

How to classify an outcome



Let's say we are given the properties of an unknown buys_computer (class). We are told that the properties are


X => age = youth, income = medium, student = yes, credit rating = fair

We need to 

 Maximize P(X|Ci )P(Ci ), for i = 1, 2

P(Ci ) - the prior probability of each class, can be computed based on the training tuples:



P(yes/youth,medium,yes and fair) 
      = P(youth/yes)* P(medium/yes)* P(yes/yes)* P(fair/yes) * P(yes)
      = (0.222* 0.444* 0.667* 0.667) * 0.643
      = 0.028

P(no/youth,income,medium,yes and fair) 
      = P(youth/no)* P(medium/no)* P(yes/no)* P(fair/no) * P(no)
      = (0.600* 0.400* 0.200* 0.400) * 0.357
      = 0.007

(0.028 >> 0.007), we classify this youth/medium/yes/fair  as likely to be yes.


Therefore, the naive Bayesian classifier predicts buys_computer = yes for tuple X.