We already saw what is Big Data. There are many technologies and processes to work with Big Data. Hadoop is one of the most popular technologies available to work with Big Data.
Hadoop is not a single product, but a collection of components.
There is a core set of components in Hadoop like:
- HDFS, which stands for Hadoop Distributed File System,
- is Hadoop’s way of distributing the data over the network.
- HDFS replicates the data over the network with a default replication factor of 3, thus providing better availability (from CAP theorem) and making Hadoop more reliable.
- HDFS also allow to add more nodes to the setup, without affecting other nodes, thus providing partition tolerance (from CAP theorem) and making Hadoop more scalable.
- is Hadoop’s way of taking processing to nodes.
- MapReduce is a programming model implementation with a parallel, distributed algorithm on a cluster, and is designed for processing and generating large data sets much faster.
- MapReduce can be considered as a two step processes: a process that splits tasks into pieces (mapping) and then combines the result (reducing).
- is a newer version of MapReduce architecture, that
- separates resource management and job scheduling/monitoring responsibilities of original MapReduce, into separate daemons.
Hadoop is also an affordable solution because it runs on one or many regular commodity hardware and is also open source.
There are also additional technologies like Pig, Hive, HBase, Storm, Spark, Shark etc. that works along with the core set of HDFS, MapReduce and YARN and help in processing Big Data efficiently.
Spark is a system, which many also consider a core technology as it can also work independently of MapReduce and can process data in memory.