Main Definitions – Big Data Analytics

Blue Abstract art

This page reviews the main definitions and important concepts from the field of BDA.

Big Data (BD) is typically defined using some of the following criteria:

  • It involves “4Vs”: Volume + Velocity + Variety + Veracity
  • Data can be both structured and unstructured, and data types can be both numerical and categorical
  • Data sets are so large and complex (having high dimensionality and both numerical and categorical data types) that they become difficult to process using traditional database management tools or data processing applications
  • Data sizes and data kinds used make traditional analytics inefficient
  • All of the above demands distributed, parallelized hardware/software and new data analysis algorithms, such as machine learning (ML)

Big Data Analytics (BDA) is typically defined using some of the following criteria:

  • The process of analyzing Big Data
  • The process of examining large, fast, diverse, and mostly unstructured data of a variety of types (Big Data), to uncover hidden patterns, unknown correlations, and other useful information
  • 4V analytics, or analytics that works with data involving the “4Vs” mentioned above

Machine Learning (ML):

  • The field of CS and Math that is related to algorithms and programs that can improve their performance through experience
  • ML algorithms use past data to learn (be trained) to perform better in the future
  • The most popular training techniques used are either “supervised” or “unsupervised”

See how AI relates to both ML and Deep Learning here.

Black Box Modeling

  • blackbox-3
    • Black box model can be viewed in terms of its input and output without any knowledge of its internal workings
    • Of cause, internal algorithms are written by people and we understand how they work in each case.  However, many models perform so many data manipulations to generate conclusions, that it makes it difficult for us to simply explain how the model has arrived to each particular conclusion
      • A good example of such a black box model is the “random forest” algorithm in machine learning
    • The main approach used to validate the accuracy of such a model and its conclusions is called “backtesting“, when the model’s accuracy is assessed by running it against a data set with a known outcome
    • The combination of usefulness and the opaque nature of such models has led to what I call the “Black Box Paradox” (read below)

Predictive Analytics:

  • Predictive analytics is the use of data, statistical algorithms, and machine-learning techniques to identify the most probable future outcomes based on historical data
  • A good example of predictive analytics is weather forecasting, where one has to deal with various possible outcomes and determine the most likely scenario, typically using ensemble of different models:

ensemble_large_atmo

Prescriptive Analytics:

  • Prescriptive analytics is the use of data, statistical algorithms, and machine-learning techniques to identify the most likely explanation for the occurring events and prescribe the most beneficial action strategy

Big Data Analytics vs.  Traditional Analytics:

bda-vs-traditional-analytics

Main Big Data Analytics Platform:

  • Hadoop is the leading analytics platform that keeps evolving with time

hadoop-growth-new

Basic facts about Hadoop:

  • The MapReduce/GFS solution from Google was used to create an open source project called “Hadoop” in 2005
  • Hadoop is an Apache open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. And with computation being moved to where the data is stored.
  • It is designed to scale up from few servers to thousands of machines, each offering local computation and storage
  • It is also designed to run on commodity hardware, and is resilient to hardware/software failures
  • It achieves data reliability via data replication (typically, 3x)
  • Hadoop is written in Java and has a Java Interface. Using Python, R, etc. requires executable scripts and the Hadoop streaming API
  • HDFS is a distributed object store designed narrowly for the “write once, read many” Hadoop paradigm
  • The global Hadoop market CAGR growth is predicted to be around 60% until 2020

Spark is the next big thing

  • Spark is an in-memory alternative to Hadoop that is growing in popularity
  • SPARK Data is kept and processes in memory with Resilient Distributed Dataset (RDD)
  • Data is transferred to disk only when memory is full, which makes it much faster for multiple sequential/repeat MapReduce jobs (it doesn’t shut down M/R tasks)
  • This makes both Machine Learning (iterative) and any type of interactive programming much faster
  • Spark programming is done using Scala, Java, an interface to Python, R, and an interactive Shell
  • Spark is faster than Hadoop when processing the same amount of data (example)

Spark popularity is growing, which is confirmed by this Google Trends

  • Google Trends is a free service that is based on Google Search. It allows you to check how often particular search terms appear over time, relative to the total number of searches. It also allows you to compare the volume of searches between two or more items of interest, which allows you to infer how the interest in these items changed over time and, thus, “guess” a general future direction.

aaeaaqaaaaaaaaluaaaajdrmmdmyztq4lwvlnzmtngmzyy1hmthlltzjodm5ywy5owuwyq

Data Gravity Concept in Big Data

  • Data gravity is an analogy for the nature of data and its ability to attract additional applications and services. The Law of Gravity states that the attraction between objects is directly proportional to their mass.
  • Dave McCrory coined the term “data gravity” to describe the phenomenon in which the number and speed at which services, applications, and even customers are attracted to data increases as the mass of the data also increases. This means that as data size (mass) increases, the data becomes harder and harder to move for financial and technical reasons. For example, selecting one specific cloud-based data storage and analytics platforms (such as AWS or Azure) and using it for a while might make it expensive and complicated to migrate to another similar platform in the future.

Black Box Paradox in Big Data Analytics

  • This concept was introduced by Andrei Khurshudov and Daniel Lingenfelter in 2016, and follows an increase in popularity of the so-called “black box” (BB) algorithms in machine learning (ML) and Big Data Analytics (BDA)
  • Such “black-box” models are often the best in terms of prediction accuracy, but this accuracy often comes with less interpretability than other model choices, making it very difficult or nearly impossible to understand why some specific recommendations have been made
  • The “Black Box Paradox” in data modeling and analytics that we observe changes the way the models are being scrutinized and accepted by the development community
  • Therefore, model developers might be tempted to gravitate towards the less transparent or completely opaque models as a way to achieve quick results
  • While this looks like a bonus for the model developers, they have to realize that the responsibility for developing and testing an accurate model is still on them and not on the end users