Apache Spark innovates a lot of in the in-memory data processing area. With this framework, you are able to upload data to a cluster memory and work with this data extremely fast in the interactive mode (interactive mode is another important Spark feature btw…). One year back (10/10/2014) Databricks announced that Apache Spark was able to sort 100 terabytes of data in 23 minutes.
Here is an interesting question – what is the limit for the amount of data you can process interactively in a cluster? What if you had 100 terabytes of memory in your cluster? Memory is so quick you would think! Intuition tells you can use this memory to interactively process 100 terabytes of input data or at least half of this size. However, as usual in a distributed systems world, our intuition is wrong!
1. Response time
What would be a response time for a simple data processing scenario and for a more complicated one? Are we still in interactive mode? We’d like to think so but unfortunately, we are not. I saw in a practice scenario that response time for a simple scenario with a simple “where sum(), count() ” statements with 8 terabytes of data was 20-40 seconds. For a more complicated one and for more realistic scenarios (couple of “group bys” + couple of “joins”) response time was 3-5 minutes. This is definitely not what I call interactive mode!
In my daily life, I do analytics where the response time is critical. For me, I give it up to 3 or 10 seconds, okay perhaps even up to 15 seconds and still consider this interactive mode. Beyond this I would consider it actually batch mode. Several seconds or 3-5 minutes instead of 15-60 minutes might look like a incredible result compared to MapReduce-like on-disk processing. However, this is not interactive.
2. Where the interactivity end?
The maximum amount of memory I was able to process in the interactive mode with only a few seconds of delay was limited by 1 terabyte. With this, the efficiency was still good. However, beyond 1 Tb, I noticed that the response time was extremely delayed
My guess is that in order to improve efficiency (5-10 terrabytes with only several seconds delay) we would need to update our hardware (I’d like to try a cluster with the most powerful EC2 machines i2.8xlarge with 250 gigabytes of RAM memory) and tune software settings (Apache Spark driver settings, in-memory columnar format, and probably YARN settings)
Even with software and hardware upgrade, it is clear to me that the interactive mode limit doesn’t even come close to the 100 terabytes.
3. Read data to memory first
As you recall from previously, remember that it takes many seconds or even several minutes for each iteration of data processing. However, this is not the complete story. If you work in Ad Hoc analytics or create machine learning models your initial data set will most likely be stored in a cluster HDFS storage. This means that before the in-memory iterations you will be reading data from disks which takes much longer. The performance as usually depends on the hardware you have and the software settings. Most likely it will take between 15-30 minutes for an 5-8 terabytes data set. Even for 1 terabyte it might take 5 minutes or so.
Before jumping into the Apache Spark in-memory processing it is worth to make a plan for your analytical scenarios and estimate response time especially if your data size is more than 1 terabyte.