Skip to main content

♻️ MapReduce & Spark

MapReduce

MapReduce 1.0MapReduce 2.0 (YARN)
Programming ModelMapReduceMultiple programming model support (e.g., MPI)
Resource ManagementJobTrackerResource Manager (RM)
Application ParallelismOne application at a timeMultiple applications running in parallel
Resource AllocationSlot, static for map or reduce at a timeContainer, dynamically created and allotted to applications

Hadoop vs Spark

Apache HadoopApache Spark
Compute StorageDisk-basedIn-memory
Fault ToleranceHDFS replicationRDD lineage
Data ProcessingBatch processingBatch, Real-time processing
LanguagesPrimarily JavaJava, Scala, Python, R
EcosystemHive, Pig, HBaseSpark SQL, Spark Streaming