Skip to content

Head first Hadoop ecosystem

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

HDFS is a distributed file system that stores data on commodity hardware. It provides high-throughput access to application data and is suitable for applications that have large data sets.

MapReduce is a programming model for processing and generating large data sets. It is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Yarn is a resource management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.

Spark is a fast and general engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. It provides a mechanism to project structure on top of a variety of data formats and provides a simple SQL-like language called HiveQL to query the data.

Pig latin is a high-level language for analyzing large data sets that consists of a set of data flow statements. It is compiled into a map reduce job and executed on Hadoop.

Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Storm is a distributed, fault-tolerant, real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

  1. License under CC BY-NC 4.0
  2. Copyright issue feedback, replace # with @
  3. Not all the commands and scripts are tested in production environment, use at your own risk
  4. No privacy information is collected here
Try iOS App