September 26, 2017

Why Apache Spark and what is parallel processing?

Recently at Rakuten we successfully renewed our mainframe from COBOL to Java (an old Fujitsu to Oracle Exalogic) - press release. It was almost a 3 year long project and towards the end, the entire company was working on it. I transitioned into the development team for the mainframe, to rewrite several batches from Cobol to Java.

Now of course I cannot read COBOL (although its not that different). We had a machine translation software convert the cobol to java but honestly, that was hardly anything. The converted code was not smart, it used zero OOP concepts, it used customized stacks to remember function calls (haha, imagine writing your own implementation to remember function calls) and it looked like COBOL just changed its clothes. As a result, these batches performed very slow - their performance times significantly higher than their COBOL counterparts for the same data.

So the job was simple - to make these batches faster, in fact, much faster than their COBOL counterparts, otherwise what's the point?

My team chose Apache Spark as a framework to work parallelly with data. In this post, I am trying to explore why they made this decision and how Spark looks like compared to traditional batch processing.

Background of our batches:


For most of the batch processing we did, there was a fixed pattern to be noticed:

  1. Read data from database and/or file
  2. Process that data for each row of the database table/ each record of the file
  3. Write to database and/or file.

Assume we have an input file which needs to be processed and written into an output file. Assume we have 100 records in the input file.
Traditional Processing
If we were to use traditional processing, we would read the first record, process it, write it to output. Then we would read next record, process it, write to output. We would keep repeating this until all 100 rows. This is a single flow - and as we can guess, it would take us a long time to complete the whole operation. Here's how it can be represented:

Batch Processing

However we already use batch processing. We divide our input records into groups. Let's say groups of 20. That makes a total of 5 groups.
Here we read the first group together (i.e. 20 records), process them, write them to output. Then we would read the next group, process them, write them. We would keep repeating this for all 5 groups. Of course it is much faster to do this than traditional processing, since we save a lot of time switching between tasks and dealing with files. Something like this:

Parallel Processing

However,
What if, we use that same batch processing model as above but with a twist. Let's say you had 5 different machines. You could give each group to one machine. The total time taken would be one fifth that of the batch processing model. This approach is called parallel processing. Roughly, instead of one worker, you have multiple workers who are working in batches:
This is also the approach that Spark takes to process our batches. The working of Spark is slightly different than the diagram above but we will come to the details later, in the next post.

So, why Spark?

In conclusion, we can say that indeed batch processing is much faster than traditional processing. Further, the reason why parallel processing is faster than batch processing is because you operate with multiple groups at the same time instead of just one group at a time. Of course this means we need more resources to achieve this (e.g.: the number of workers required increases in the above diagrams).
Batches which were being run on our mainframe, had very little logic to them. It was about feeding data from one process to another based on a bunch of conditions. Of course what I say is a very very simplified version of how it really looks like. But in summary, there was more work to do in getting the input and writing the output than there was in processing the data in between. Spark fit this choice because, even if we just got rid of the bloated processing of the converted code, we would have practically made no progress with the I/O operation. The logic processing would have pretty much remained the same. Apache Spark blessed us with the power of forgetting about the logic, and concentrating only on the I/O to speed up the processing times. It came with its own challenges, like writing the custom file readers and writers but that's the story of another post!


Popular Posts