Spark or Hadoop? Let’s ask Google

aaeaaqaaaaaaaambaaaajgu5zwuxodawlti3zwmtndbkny05mzzjltu4mgrhnzhlndfkna

Google Trends is an interesting free service that is based on Google Search and can be quite useful in some cases. It allows one to check how often some particular search terms appear (over time) relative to the total number of searches. It also allows you to compare the volume of searches between two or more items of interest. Which allows inferring how the interest in these items changed over time and, thus, to “guess” a general future direction.

Google Trends gained its popularity when it was shown that search statistics could be used to track influenza-like illnesses because certain search queries could be correlated to relative frequency of doctor visits.

Below I am describing an attempt to use Google Trends to understand the past and predict the future for a quite different world of Big Data Analytics, where competing technologies battle for who is the best, fastest, most versatile, and most complete technology to fuel the ongoing Big Data revolution.

The main competitors in that world are Hadoop and Spark (and, maybe, Storm).

Hadoop is based on MapReduce and GFS papers from Google. It became an open source project in 2005.   Hadoop is an Apache open-source framework that allows storing and processing of data in a distributed fashion across clusters of computers using simple programming models. And with computation located where the data is, which matches well the concept of “data gravity” (moving processing to the data and not the other way around).

Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is designed to run on commodity hardware, is resilient to hardware/software failures and achieves data reliability via replication (typically, 3x). Hadoop is written in Java and has a Java Interface. Using Python, R, etc. requires executable scripts and the Hadoop streaming API. HDFS, which is one of the main components of Hadoop, is a distributed object store designed narrowly for the “write once, read many” Hadoop paradigm. In spite of its relative maturity, global Hadoop market CAGR growth is projected to be at ~60% till 2020. Which clearly shows that Hadoop is in great overall shape.

However, this search on Google Trends shows that one of the members of Hadoop’s large ecosystem and – at the same time – its competitor Spark has already overtaken Hadoop in popularity and the interest in Spark is growing fast, much faster than that for Hadoop, while the interest in Hadoop is mostly saturated.

aaeaaqaaaaaaaaluaaaajdrmmdmyztq4lwvlnzmtngmzyy1hmthlltzjodm5ywy5owuwyq

What makes Spark so attractive? Mostly these few things:

  • SPARK data is kept and processed in memory as a Resilient Distributed Dataset (RDD), which makes it much faster in general
  • Spark data is written to disk only when memory is full
  • In the case of multiple sequential/repeat MapReduce jobs, Spark is also generally faster because it doesn’t shut down M/R tasks immediately after they are completed
  • This makes Machine Learning (iterative tasks), streaming, and any interactive programming much faster
  • Spark supports Scala, Java, Python, and R with SparkR. Interactive shells are provided for Scala, Python, and R.

So, apparently, Spark is already more popular than Hadoop and – if the trend continues – the popularity gap between these two will just keep increasing. One other advantage of Spark over Hadoop (at least for now) is its ability to work with both data batches (like Hadoop) and to stream data. Streaming in Spark is really more like micro-batching, but still works well for most data streaming tasks.

However, if you really want to stream your data and analyze it on the fly, then you might want to consider Apache Storm – the other competitor mentioned above and shown in the chart. Storm is designed to work with data as granular as one key-value pair and is optimized for pure streaming but not for batch processing. If we believe the Google Trends plot above, Apache Storm is already close to 60% of Apache Hadoop’s popularity, but the trend implies saturation and doesn’t suggest strong future growth.

All in all, assuming that Google Trends is a good tool to capture people’s interest in Big Data Analytics technologies, Spark is the center of attention today and will remain in the limelight in the near future. You can use this exact query to repeat my study in the future or modify it with the names of new technologies being introduced over time: link.

This entry was posted in Analytics, data analytics, big data, big data analytics, data on the internet, data analytics meaning, Cloud technology, computing, storage, data, Computers, Data Analysis and Visualization, Data storage, hard disk drives (hdd), solid state drives (ssd), Featured, Past, present, and future and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s