Spark Tutorials 2 - SparkContext、Stage、Executor、RDD

blairchen

Publish：Sep 19, 2020

1. Spark - SparkContext

SparkContext allows your Spark Application to access Spark Cluster with the help of Resource Manager. The resource manager can be one of these three- Spark Standalone, YARN, Apache Mesos.

The different contexts in which it can run are local, yarn-client, Mesos URL and Spark URL.

1.1 SparkContext Functions

Functions of SparkContext

1.2 SparkContext Conclusion

Hence, SparkContext provides the various functions in Spark like get the current status of Spark Application, set the configuration, cancel a job, Cancel a stage and much more. It is an entry point to the Spark functionality. Thus, it acts a backbone.

2. Spark - Stage

Spark Stage- An Introduction to Physical Execution plan

It is basically a physical unit of the execution plan.

ShuffleMapstage

ResultStage

2.1 What are Stages in Spark?

one task per partition. In other words

with the boundary of a stage in spark marked by shuffle dependencies.

Ultimately, submission of Spark stage triggers the execution of a series of dependent parent stages. Although, there is a first Job Id present at every stage that is the id of the job which submits stage in Spark.

2.2 ShuffleMapStage

ShuffleMapStage in Spark
ResultStage in Spark

2.3 ResultStage

ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark. It also helps for computation of the result of an action.

running a function on a spark RDD Stage that executes a Spark action in a user program is a ResultStage.

3. Spark - Executor

Apache Spark Executor for Executing Spark Tasks

3.1 Spark Executor

Some conditions in which we create Executor in Spark is:

When CoarseGrainedExecutorBackend receives RegisteredExecutor message. Only for Spark Standalone and YARN.
When LocalEndpoint is created for local mode.

n. 端点；[化]滴定终点

3.2 Creating Executor Instance

By using the following, we can create the Spark Executor:

From Executor ID.

By using SparkEnv we can access the local MetricsSystem as well as BlockManager. Moreover, we can also access the local serializer by it.

From Executor’s hostname.

To add to tasks’ classpath, a collection of user-defined JARs. By default, it is empty.

By flag whether it runs in local or cluster mode (disabled by default, i.e. cluster is preferred)

Moreover, when creation is successful, the one INFO messages pop up in the logs. That is:

INFO Executor: Starting executor ID [executorId] on host [executorHostname]

3.3 Heartbeat Sender Thread

Basically, with a single thread, heartbeater is a daemon ScheduledThreadPoolExecutor.
We call this thread pool a driver-heartbeater.

3.4 Launching Task Method

3.5 executor.taskLaunch.worker

Thread Pool — ThreadPool Property

executor

3.6 Conclusion

we have also learned how Spark Executors are helpful for executing tasks. The major advantage we have learned is, we can have as many executors we want. Therefore, Executors helps to enhance the Spark performance of the system.

4. Spark - RDD

Spark RDD – Introduction, Features & Operations of RDD

4.1 resilient distributed dataset

Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures.

Distributed, since Data resides on multiple nodes.

Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

4.2 Why need RDD in Spark?

Apache Spark evaluates RDDs lazily. It is called when needed, which saves lots of time and improves efficiency. The first time they are used in an action so that it can pipeline the transformation. Also, the programmer can call a persist method to state which RDD they want to use in future operations.