♻️ MapReduce & Spark
MapReduce
| MapReduce 1.0 | MapReduce 2.0 (YARN) | |
|---|---|---|
| Programming Model | MapReduce | Multiple programming model support (e.g., MPI) |
| Resource Management | JobTracker | Resource Manager (RM) |
| Application Parallelism | One application at a time | Multiple applications running in parallel |
| Resource Allocation | Slot, static for map or reduce at a time | Container, dynamically created and allotted to applications |
Hadoop vs Spark
| Apache Hadoop | Apache Spark | |
|---|---|---|
| Compute Storage | Disk-based | In-memory |
| Fault Tolerance | HDFS replication | RDD lineage |
| Data Processing | Batch processing | Batch, Real-time processing |
| Languages | Primarily Java | Java, Scala, Python, R |
| Ecosystem | Hive, Pig, HBase | Spark SQL, Spark Streaming |