December 10, 2017

Apache Spark - A dummy's introduction - 1 (Architecture)

A dummy's introduction is split in 4 parts:
  1. Architecture
  2. Data Structure
  3. Execution
  4. Example


I'll introduce the main components of the architecture in order as follows.

Spark Application

There is a manager of these workers, called the Driver Program. Many times this is referred to as the Spark Application itself. You just have to submit your application (the batch, in our case) to the Spark Application and it will take care of the rest. It will divide the whole application into tasks (e.g.: read input, process input, write output). It is also responsible for creating the groups of batches, or partitions (which is a more correct technical term). The driver is lazy (more on that in part 3).

Spark Context

Spark is a framework - in order to use this framework, we need to create an instance of Spark Application. In order to do this, we create an object of the class SparkContext. It is crucial as it establishes the connection to the spark cluster. The special data structures that Spark uses cannot be created on the cluster without SparkContext. Thus each application has to have the SparkContext.

Cluster Manager (Yarn)

Then there is a Cluster Manager (e.g.: Apache Mesos or Yarn). The Cluster Manager exists for resource management. It physically assigns all resources to all workers. Thus it works in coordination with the Spark Application and the workers. However the Spark Application is abstracted from the Cluster Manager's working. As a developer, we might never care to look into more details. However an operations person might be interested to look into its details. The Cluster Manager also runs the Spark UI where we can study the performance of the application. It lets you monitor running jobs, and view statistics and configuration.

Executor

The worker we have been referring to above is called an executor. It is the slave in Spark's master-slave architecture. It receives tasks from the Spark Application and it's job is to execute them as per the schedule. It is important to note that each executor has its own set of resources (as allocated by the Cluster Manager). Which means, to start with, your application (the batch's jar/war file) is copied on to every executor. They have their own variables, cache etc. In effect, you can imagine each executor works independently.

No comments:

Post a comment

Leave me a feedback, I'll be glad to hear you!

Popular Posts