Big Data: 2015

Tuesday 4 August 2015

Hadoop: Partitioner

It gives more performance in the application
Number of reducer is mentioned in the drivercode
All the key,value pairs are passed to all the reducers that are present.
Partitioner code:

public class MyPartitioner implements Partitioner(interface)... <text,IntWritable>
{
public void configure(JobConf conf)
{
}
public int getPartition(Test Key,IntWritable value,int setNumRedTasks)
{
String s=key.toString();
if(s.length()==1)
{
return 0;
}
if(s.length==2)
{
return 1;
}
if(s.length==3)
{
return 2;
}
else
return 3;
}
}

Configure partitioner class in the DriverCode:

conf.PartitionerClass(MyPartitioner.class);

Configure the reducer class:

conf.setNumReduceTable(4);

We can work with 10^5 reducers (maximum)

part-00000 to part-99999

Hello World Job

Step1: creating a file

J$ cat>file.txt
hi how are you
how is your job
how is your family
what is time now
what is the strength of hadoop
ctrl+d(to save and exit)

Step2: loading file.txt from local file system to HDFS

J$hadoop fs -put file.txt file

Step3: Writing programs

DriverCode.java
MapperCode.java
ReducerCode.java

Step4: Compiling all above .java files

J$javac -classpath $HADOOP_HOME/hadoop-core.jar *java

Step5: creating jar file

J$jar cvf test.jar *.class
*.class: all * files will be taken

Step6: running above test.jar on file(which is there in HDFS)

J$hadoop jar test.jar DriverCode file TestOutput

The above step's screenshot will available soon!

Combiner (HDFS + MapReduce)

Combiner is what? it is a mini reducer.
To decrease the network traffic and increase the performance we use the combiner
After removing the same key,value pairs from each collection out of mappercode, the unique key,value pairs are sent to the combiner block.
Each mapper or input split has its own combiner and combining the output of all combiner make the reducer into action and hence the reducer gives the output of all in the O/P file.

MapReduce Architecture

'WordCountJob' is the "HelloWorld" program in Hadoop.By this job we will see MapReduce FlowChart.
How we are running our program

"J$hadoop jar test.jar DriverCode file.txt TestOutput"

Now consider the previous 4 splited file

INPUT SPLITs
---> 1st input split --><-- RecordReader(Interface) --> Mapper ->}
---> 2nd input split --><-- RecordReader(Interface) --> Mapper ->} REDUCER
---> 3rd input split --><-- RecordReader(Interface) --> Mapper ->} (Identity REDUCER)
---> 4th input split --><-- RecordReader(Interface) --> Mapper ->}

We need not to write extra code for RecordReader(Interface). Hadoop framework take care of it.
How this RecordReader(Interface) is reading this input splits or records(on what basis this RecordReader(Interface) converitng these records to [key,numerical])?

Answer: There are 4 formats to do this:

1) TextInputFormat
2) KeyValueInputFormat
3) SequenceFileInputFormat
4) SequenceFileAsTextFormat

By Default: TextInputFormat

Now if the format is TextInputFormat, then how the RecordReader(Interface) converts the record into [key,value]---> (byteoffset,enterline).

byteoffset:- Address of the line. enterline:- which is read by the recordreader

For eg:

(0, hi how are you)
(16,how is your job)

As much lines are there, that much byteoffset,enterline sets will be there. and that many times the mapper will work.
You will write only 1 mappercode.
Mappercode make the sets given below:

(hi,1)
(how,1)
(are,1)
(you,1)
hi/how..:Text
1:IntWritable
(text,IntWritable)

Data after mappercode is an intermediate data and this data is futher sends to REDUCER for process.
Keys shouldn't be duplicate but values can be duplicated
Now there are further two phases through which the intermediate data pass : Shuffling & Sorting
SHUFFLING PHASE: combine all those values associated to single indentical key. Eg.(how,[1,1,1,1,1]) , (is,[1,1,1,1,1,1]) , etc.
SORTING PHASE: Its done automatically by getting non duplicate key,value through comparing.
It makes out a parrallel system: Main objective of Hadoop.
In collection framework(java), we use wrapper classes instead of primitive classes:

Apart from others, Hadoop introduce box clases

Wrapper class Primitive class Box Class

1)Interger int IntWritable
2)Long long LongWritable
3)Float float FloatWritable
4) Double double DoubleWritable
5) String string TextWritable
6)Character character -do-

If I want to convert int --> IntWritable

new IntWritable(int);
for vice-verca
get();

If I want to convert string --> TextWritable

toString()

FINALLY THE REDUCER GIVE OUTPUT AND THAT OUTPUT IS GIVEN INTO RECORDWRITTER

RECORDWRITTER knows how to write key,value.

The output of RECORDWRITTER is further counts in the file name O/P which is also termed as part-00000.
Now there is a output directory which is named as TestOutput. This output file contains 2 directories and 1 file those are:

_Source, _logs and part-00000.

Wednesday 22 July 2015

Previous Continue: JobTracker

JobTracker get heartbeat from TaskTracker in every 3 seconds of time.
If the JobTracker doesn't get the heartbeat properly it will wait for the heartbeat for next 10 heartbeat time(i.e. 30 seconds).
If one TaskTracker is down either by not working properly or by getting loads of request from the JobTracker as being the nearest TaskTracker to the JobTracker, then the JobTracker sends the request to the other TaskTracker which can be a bit away from the JobTracker so as to complete the task.
We maintain high reliable systems for JobTracker also.
As many number of splits will be there....those many numbers of mappers will be there.
As many numbers of reducers will be there... those many numbers of outputs will be there.

Wednesday 20 May 2015

HDFS: JobTracker

JobTracker take the request and process that data into HDFS.
JobTracker cant talk to DataNode but it can talk to NameNode.
What JobTracker say to DataNode in layman's language

"Hey NameNode, I have a client with a file name File.txt and he wanted me to process his file & give its output to 'testoutput' directory by running a program lets say of 10 KB(program.java) on File.txt"
AND "I don't know which block or what should I take to process my request, so give me the details of my request or send me the MataData".

Now NameNode will check the file name of File.txt (Is it there or not?).
If the file is there, then NameNode simply sends the MataData of that cluster to JobTracker.
Now JobTracker select the nearest hardware from the 3 replicas(from the hardwares having 3 copies of same data) to upload the task(10 KB code) of processing.
Input Split is the set of blocks, whose combination forms a File which is supposed to store in HDFS. For Eg: If there is a file of 200 MB and we have 64 MB of block each then the file will be stored in 192(3 blocks) and 8(1 block and left with 56 MB).
File to store in HDFS is known as Input.
Uploading the program(code) into the block is known as MAP.
Number of Input Splits = Number of Maps
Each DataNode has its own TaskTracker.
TaskTracker is further used by JobTracker. JobTracker gives the task to TaskTracker.
TaskTracker job is to find the nearest DataNode to fetch the data from it and compile the request of the client which was assigned to JobTracker by the client.
If the data is not found in one of the replica then the task will be assigned to another TaskTracker of other replica.

HDFS: NameNode

NameNode is also known as Single Point Failure.
Lets say 3 replicas copies are shared among the local DataNodes(DN) and each DataNode acknowledge back to its previous or linked DataNode (that it has received the replica of the file and is stored to its local disk).
Every DataNode will give block report or heartbeat to NameNode
There is a data called MataData in which name of the files and where their replicas are stored has written.
If this MataData is anyhow lost, then there is no use of Hadoop or we can say we wont be able to get the benefits of Hadoop. Entered cluster will be inaccessable and the HDFS will not be working for that Cluster.
MataData is stored in the NameNode Hard disk only.
NameNode generally maintains communication with Cheap Hardware, but its better to maintain it with high reliable Hardware.
In the cluster to store data, we generally creates a block of big size lets say 64MB instead of 4KB sized block(default size). Why?? Because whenever there enters a file into a block of 4KB there would a log produced in the MataData and this is how all 4 KB blocks will produce the log into MataData file. But if we make a block of 64 MB then very less log will be produced comparative to the previous concept.
To get the data from the server/hadoop, we writes a code in java or python or any other language whose size will be countable in KBs. Then we uplink/upload that data and get the access to our data.

By this now we are about to understand the concept of JobTracker.

Tuesday 19 May 2015

HDFS

There is a slogan in JAVA Write Once and Read any number of times, but In HADOOP, Write Once Read any number of times and DONT CHANGE THE CONTENT OF FILE (Streaming Access Pattern)

HDFS has 5 services:

NameNode
Secondary NameNode
JobTracker
DataNode
TaskTracker

First three are the Master Services and rest two are the Slave/Demon Services. We cant see these services working as they all work internally.
Every Master Service can talk to each other and so do Slave Services.
If 'NameNode' is a master node then its corresponding slave node is 'DataNode'. If 'JobTracker' is a master node then its corresponding slave node is 'TaskTracker'.
One master service can talk to its own slave service but cant talk to another slave service of another master service.
NameNode is act as a Manager which leads the data to store into which sector or which region of storage device.

Hadoop

Now we will about to start an open source framework i.e HADOOP.

What is Hadoop?

Ans: Hadoop is basically a combination of HDFS and MapReduce. Now, Question arises what is "HDFS and MapReduce"? To understand this lets dig up history of HADOOP.

It ws founded by Doug Cutting.
At the very first place, Google has analysed problems of BIG DATA and to overcome this problem Google proposed a concept of GFS : GOOGLE FILE SYSTEM. GFS is a distributed File System. It is Designed to provide efficient, reliable access to data using large clusters of commodity hardware.
Then, MapReduce is introduced after GFS, an updated version of GFS. In this all log files get stored in the Storage devices, processing and generating large data sets with a parallel, distributed algorithm on a cluster.
In 2003, GFS: Google File System was launched.
In 2004, MapReduce was launched.
In 2006-07, HDFS Hadoop Distributed File System was launched.
In 2007-08, Mapreduce (is the technique to process the files in HDFS) was launched.
HDFS: Hadoop Distributed File System is the technique to store data from commodity hardware.

Why Hadoop is termed as 'Hadoop'? and What is the reason behind its Elephant symbol?

Ans: Its answer is really insane but this is true, Doug Cutting child was playing with his toy(Elephant shaped toy) whose name was 'Hadoop'. From this Cutting named it Hadoop with Elephant symbol.

Saturday 16 May 2015

Why? & What?

WHY HADOOP?

Big Data analytics and Apache Hadoop open source project are rapidly emerging as the preferred solution to address business and technology trends that are disrupting traditional data management and processing.

Enterprises can gain a competitive advantage by being early adopters of big data analytics.

WHAT IS HADOOP?

There are mainly 3 segments which we will discuss here those are:

GFS
HDFS
MapReduce

In Industries

Big Data has provided many opportunities to the Small as well as Giant MNCs in the field of Financial Services, Healthcare & Life Sciences, Retail, Media and Telecommunications, Manufacturing, Advertising & Public Relations, Energy and Government.

Retail

CRM-Customer Scoring
Store Sitting and Layout
Fraud Detection/Prevention
Supply Chain Optimization

2. Financial Services

Algorithmic Trading
Risk Analysis
Fraud Detection
Portfolio Analysis

3. Manufacturing

Product Research
Engineering Analysis
Process & Quality Analysis
Distribution Optimization

4. Government

Market Governance
Counter-Terrorism
Econometrics
Health Informatics

5. Advertising & Public Relations

Demand Signaling
Ad Targeting
Sentiment Analysis
Customer Acquisition

6. Media & Telecommunications

Network Optimization
Customer Scoring
Churn Prevention
Fraud Prevention

7. Energy

Smart Grid
Exploration

8. Healthcare & Life Sciences

Pharmaco-Genomics
Bio-Informatics
Pharmaceutical Research
Clinical Outcomes Research

Friday 15 May 2015

Techniques( Big Data)

When Big-Data is really a hard problem?

When the operations on data are complex:

Simple counting is not a complex problem Modelling and Reasoning with data of different kinds can get extremely complex.

Good news about Big-Data:

Often, because of vast amount of data, modelling techniques can get simpler(e.g. smart counting can replace complex model based analytics) as long as we deal with scale.

What matters when dealing with data?

Research areas (such as IR, KDD, ML, NLP, SemWeb, etc) are subcubes within the data cube.

Thursday 14 May 2015

Tools typically used in Big-Data scenarios

NoSQL: DatabasesMongoDB, CouchDB, Cassandra, Redics, BigTable, Hbase, Hypertable, Voldemort, Riak, Zookeeper
MapReduce: Hadoop, Hive, Pig, Cascading, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
Storage: S3, Hadoop Distributed File System
Servers: EC2, Google App Engine, Elastic, Beanstalk, Heroku
Processing: R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, Bigsheets, Tinkerpop

Wednesday 13 May 2015

Characterization of 'Big Data': Volume,Velocity,Variety(V3)

Big Data is characterized into three main Vs i.e.

Volume
Velocity
Variety

Understand the statistics below:

SMART DATA not BIG DATA

Smart data is not big data. A link is attached Click here to understand this.

What is Big Data?

Now a days, as the size of organisation is extending the concept of BIG DATA has became a 'Trending_Topic'. This is essential for big organizations and enterprises that deals with Terabytes, Petabytes and Exabytes of data.

Big Data can be defined as

"extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions."

'Big Data'-A Boon for Organisation

Here are some real-world examples of Big Data in action:

Consumer product companies and retail organizations are monitoring social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception.

Manufacturers are monitoring minute vibration data from their equipment, which changes slightly as it wears down, to predict the optimal time to replace or maintain. Replacing it too soon wastes money; replacing it too late triggers an expensive work stoppage

Manufacturers are also monitoring social networks, but with a different goal than marketers: They are using it to detect aftermarket support issues before a warranty failure becomes publicly detrimental.

The government is making data public at both the national, state, and city level for users to develop new applications that can generate public good.

Financial Services organizations are using data mined from customer interactions to slice and dice their users into finely tuned segments. This enables these financial institutions to create increasingly relevant and sophisticated offers.

Advertising and marketing agencies are tracking social media to understand responsiveness to campaigns, promotions, and other advertising mediums.

Insurance companies are using Big Data analysis to see which home insurance applications can be immediately processed, and which ones need a validating in-person visit from an agent.

By embracing social media, retail organizations are engaging brand advocates, changing the perception of brand antagonists, and even enabling enthusiastic customers to sell their products.

Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay.

Web-based businesses are developing information products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs.

Sports teams are using data for tracking ticket sales and even for tracking team strategies.

Big Data