Just knowing the data structure and its operations is not enough to understand how Spark operates. The order of execution matters. In fact, it is a concept which a developer has to always remember while writing the code so as to avoid any silly mistakes.
Spark is lazy in execution
Always remember - Spark is lazy. Lazy is the keyword while writing code. What exactly do we mean?
Now, we can either perform actions or transformations on the RDDs. The transformations always created another RDD and that's it. Let's say after the application finished, these newly transformed RDDs would be dumped. There is no scope of these RDD variables outside the application. However actions, they are different. They return values. Maybe files or integers but they return something which has a real purpose outside the scope of the application.
So when we say Spark is lazy, Spark doesn't execute it's tasks until an action is performed. Meaning, Spark is aware that transformations yield no value and it does nothing until it sees that an action is really performed on the RDD.
Example: If we create an RDD of fruits, then we call the following functions:
- map - convert to upper case
- filter - keep only starting with 'B'
- save result as a file
The code for the above example looks like this:
The steps 1 and 2, are transformations while step 3 is an action. Since Spark is lazy, nothing will really happen until Spark reads step 3.
What we expect is:
But what happens is:
This is a beautiful feature about this framework - the executors don't perform any process until they see there's something valuable coming out of it. Isn't that similar to human behavior? If we never called the step 3, i.e. we never saved that file, why would we even bother to do steps 1 and 2? It would be a complete waste of time and resources to perform transformations without actions (Its useless to perform transformations when the result won't be remembered in any way).
While Spark is lazy, Spark is lazy-efficient. Perhaps more efficient than we as developers are. Like how Bill Gates believes that lazy people find efficient ways to do things.
In the above example, while we imagined spark to be working like a batch job, i.e. read inputs then process them all and then write them all at once, it may really not be so. Spark will perform sequences of transformations by element so no data is stored.
So in our above example, the actual execution occurs as follows:
So in which case Spark behaves how? We cannot tell. But what we can surely tell is that we can trust Spark with the approach it chooses. It will always be efficient.
I think its a good time now to understand how partitions are handled in Spark.
Now, we may have several executors in our cluster. That is completely upto us, how we configure spark. But what is not upto us, is how data is partitioned in Spark. To cite an example, let's say we want to read input file of roughly size 10,000 bytes. We can only specify the minimum number of partitions that the input file can be divided into. Spark takes a judgement and decides how many actual paritions of the original file will be made. If we specified a minimum parition of 2, we'd expect Spark to divide the file into two 5000 bytes parts. But if Spark calculates that partitioning the file into 10 paritions will lead to optimized run of the application, it will create 10 parts of 1000 bytes each. And you as a developer can do nothing about it. You have to leave certain things to the framework and this is one of them.
Similarly, the RDDs are also partitioned in reality. If we want to process an input file by loading it into an RDD, each file reads its partition, and created its "part" of the RDD. Which means, in reality, the whole RDD as we imagine it, is distributed across different executors.
It is important to remember at this point that each executor is immune from the others. This means each executor has its own set of variables. So when the input file is to be read in partitions, the Spark Application just decides the size of the partition. It sends each executor a copy of the entire file and their starting and end point of the partition. Then that executor reads its particular part of the file. Note how each executor even has a copy of the whole file!
Now, while writing into an output file, each executor will create its own partition of the output file and later the Spark Application will maintain it as one file.
You also cannot control individual executors in any way. You can control however how all executors behave. How that would be? Well by implementing the spark API.
The Spark Application partitions the data, something we have understood. But is the order of the records lost anywhere?
The two transformations of map and filter will never change the order of rows. This means the nth line of the input is the nth record of the RDD and when mapped into another RDD, it still remains the nth record.
However there are other transformations which are heavier and donot maintain the order. I haven't introduced them so far - groupBy and Join. You can read more about them later.