Spark's Distributed Execution

•

🔸 Spark's Distributed Execution 🔸 • Spark is a distributed data processing engine and it works in clusters of machines. • The components in Spark's distributed architecture work together for the data processing. Let's discuss them ↓

There are multiple components involved: 1. Spark Driver 2. SparkSession 3. Spark Executor 4. Deployment Modes 5. Data Partitioning

🔸 Spark Driver • Responsible for orchestrating parallel operations. • It accesses the distributed components using SparkSession and is also responsible for instantiating it. • It has multiple roles like talking to cluster manager, requesting resources, scheduling, etc.

🔸 SparkSession • It is the entry point to all the Spark's functionalities and is created via the global variable spark or sc. • It is responsible for managing & allocating resources for a cluster of nodes on which the Spark application runs.

🔸 Spark Executor • The executor runs on each node in the cluster and is responsible for executing the tasks on the workers. • There is only a single executor per node in the cluster.

🔸 Deployment Modes • Spark has multiple deployment modes allowing it to run in different configurations and environments. • Spark can be deployed in popular environments such as Apache Hadoop YARN and Kubernetes and can operate in different modes.

🔸 Distributed Data & Partitions • Data is distributed in partitions across the cluster. In Spark, each partition is treated as high level logical abstraction. • Executor reads the partition closest to it. Partitioning allows for efficient parallelism and minimize bandwidth.

So this is it about the distributed execution, we'll grab some new spark concept in the next thread. Any comments, suggestions, or corrections are welcome. Thanks for reading 🧡 Consider following @capeandcode