Tuesday 4 August 2015

Hadoop: Partitioner


  • It gives more performance in the application
  • Number of reducer is mentioned in the drivercode
  • All the key,value pairs are passed to all the reducers that are present.
  • Partitioner code:

public class MyPartitioner implements Partitioner(interface)... <text,IntWritable>
{
public void configure(JobConf conf)
{
}
public int getPartition(Test Key,IntWritable value,int setNumRedTasks)
{
String s=key.toString();
if(s.length()==1)
{
return 0;
}
if(s.length==2)
{
return 1;
}
 if(s.length==3)
{
return 2;
}
 else
return 3;
}
}


  • Configure partitioner class in the DriverCode:

conf.PartitionerClass(MyPartitioner.class);

  • Configure the reducer class:

conf.setNumReduceTable(4);

  • We can work with 10^5 reducers (maximum)

part-00000 to part-99999

Hello World Job


  • Step1: creating a file

J$ cat>file.txt
hi how are you
how is your job
how is your family
what is time now
what is the strength of hadoop
ctrl+d(to save and exit)


  • Step2: loading file.txt from local file system to HDFS

J$hadoop fs -put file.txt file

  • Step3: Writing programs

DriverCode.java
MapperCode.java
ReducerCode.java


  • Step4: Compiling all above .java files

J$javac -classpath $HADOOP_HOME/hadoop-core.jar *java


  • Step5: creating jar file


J$jar cvf test.jar *.class
*.class: all * files will be taken


  • Step6: running above test.jar on file(which is there in HDFS)


J$hadoop jar test.jar DriverCode file TestOutput

The above step's screenshot will available soon!

Combiner (HDFS + MapReduce)


  • Combiner is what? it is a mini reducer.
  • To decrease the network traffic and increase the performance we use the combiner
  • After removing the same key,value pairs from each collection out of mappercode, the unique key,value pairs are sent to the combiner block.
  • Each mapper or input split has its own combiner and combining the output of all combiner make the reducer into action and hence the reducer gives the output of all in the O/P file.


MapReduce Architecture


  • 'WordCountJob' is the "HelloWorld" program in Hadoop.By this job we will see MapReduce FlowChart.
  • How we are running our program

                                      "J$hadoop jar test.jar DriverCode file.txt TestOutput"

  • Now consider the previous 4 splited file


INPUT SPLITs
           ---> 1st input split   --><-- RecordReader(Interface) -->  Mapper ->}
           ---> 2nd input split  --><-- RecordReader(Interface) -->  Mapper ->}   REDUCER  
           ---> 3rd input split   --><-- RecordReader(Interface) -->  Mapper ->}  (Identity REDUCER)
           ---> 4th input split   --><-- RecordReader(Interface) -->  Mapper      ->}


  • We need not to write extra code for RecordReader(Interface). Hadoop framework take care of it.
  • How this RecordReader(Interface) is reading this input splits or records(on what basis this RecordReader(Interface) converitng these records to [key,numerical])?

 Answer: There are 4 formats to do this:

1) TextInputFormat
2) KeyValueInputFormat
3) SequenceFileInputFormat
4) SequenceFileAsTextFormat

By Default: TextInputFormat
 
    Now if the format is TextInputFormat, then how the RecordReader(Interface) converts the record into [key,value]---> (byteoffset,enterline).

byteoffset:- Address of the line.                enterline:- which is read by the recordreader

For eg:

(0, hi how are you)
(16,how is your job)

  1. As much lines are there, that much byteoffset,enterline sets will be there. and that many times the mapper will work.
  2. You will write only 1 mappercode.
  3. Mappercode make the sets given below:

(hi,1)
(how,1)
(are,1)
(you,1)
hi/how..:Text
1:IntWritable
(text,IntWritable)


  1.  Data after mappercode is an intermediate data and this data is futher sends to REDUCER for process.
  2.  Keys shouldn't be duplicate but values can be duplicated
  3. Now there are further two phases through which the intermediate data pass : Shuffling & Sorting
  4. SHUFFLING PHASE: combine all those values associated to single indentical key. Eg.(how,[1,1,1,1,1]) , (is,[1,1,1,1,1,1]) , etc.
  5. SORTING PHASE: Its done automatically by getting non duplicate key,value through comparing.
  6. It makes out a parrallel system: Main objective of Hadoop.
  7. In collection framework(java), we use wrapper classes instead of primitive classes:

Apart from others, Hadoop introduce box clases

Wrapper class                           Primitive class                   Box Class

1)Interger                                                     int                                       IntWritable
2)Long                                                        long                                   LongWritable
3)Float                                                        float                                   FloatWritable
4) Double                                                  double                               DoubleWritable
5) String                                                     string                                   TextWritable
6)Character                                              character                                       -do-


  • If I want to convert int --> IntWritable

new IntWritable(int);
for vice-verca
get();

  •  If I want to convert string --> TextWritable


toString()


  • FINALLY THE REDUCER GIVE OUTPUT AND THAT OUTPUT IS GIVEN INTO RECORDWRITTER

RECORDWRITTER knows how to write key,value.

  • The output of RECORDWRITTER is further counts in the file name O/P which is also termed as part-00000.
  • Now there is a output directory which is named as TestOutput. This output file contains 2 directories and 1 file those are:

                                                          _Source, _logs and part-00000.