The “Black Box Paradox” in Big Data Analytics and Data-Driven Modeling

boxes

Some predictive models are analytical and based on first principles, while others are solely data-driven. Analytical models are often based on a human’s understanding of nature, while data-driven models attempt to model nature using data alone. Some data-driven models, such as linear regression, are transparent and interpretable, while other “black-box” models are not transparent at all and can be difficult to interpret.

Disciplines as physics, chemistry, engineering, mathematics, and others typically rely on analytical, numerical, and statistical models to explain results, understand intrinsic relationships, and make predictions (inter-/extrapolations).

Some other disciplines – such as Big Data Analytics – deal with huge volumes of numerical and categorical data that are often noisy and incomplete.  Supervised machine-learning (ML) models are data-driven and are often the preferred choice for predictive models in these disciplines. These models, many of which were developed in the fields of statistics and computer science, learn structure in the data through a training process. In many practical settings, the learned model improves as the amount of data available to train the model grows. Some ML models result in “black-box” predictions, especially when a large amount of training data is used to learn complicated non-linear relationships in the data. Such “black-box” models are often the best in terms of prediction accuracy, but this accuracy often comes with less interpretability than other model choices.

A “classical” study involving analytical or statistical modeling will likely start with a general model description providing disclaimers about the main assumptions used in the model, limits of its applicability, and cautionary notes for the future users of the model (see example here – [1]).  Then, the study will dive into fine details of the model itself and provide all the mathematical formulas, approximations, and heuristics used – everything required for the reader to reproduce the model and its results.  In the case of numerical modeling, a very detailed procedure for how the modeling was done will be provided with references to the underlying methods used (e.g., Monte-Carlo simulation).  In both cases, a well-written study description will address the shortcuts used, the uncertainties in the inputs and will try to provide an accurate estimate of the prediction error.  Finally, the study will present its results and conclusions.

The above approach allows the reader to understand the fine details of the model and all of its elements:  data, algorithms, logic, etc.  This understanding allows for either the model’s adoption, its testing/criticism, or both.  It also allows one to draw important conclusions about “factor X” impacting “outcome Y” linearly or “factor Z” being “inversely proportional to “outcome Y”.  All of this is extremely valuable for understanding why the results are what they are.   In fact, this his how most readers and users are performing the ‘sanity check’ of the model and its results before they adopt it – by looking at those elements of the model and trying to check if the model is self-consistent and if its statements make physical, mathematical, and general sense.

Let’s now consider a different approach to modeling that is frequently used in Big Data Analytics:  relying on so-called “black box” (BB) algorithms from the field of machine learning.  One of the most well known examples of such an algorithm – the Random Forest Algorithm – was introduced by Leo Breiman [2] around 2000 and is used to address a broad spectrum of problems and practical applications.

In the case of a BB algorithm, the model (think again of the Random Forest as a perfect example) is being trained using a training sub-set of the total available dataset. In a random forest, this training set is then randomly sampled to create several different sample training sets. A separate decision tree is then trained to perform regression or classification on each sample training set (this process is called bootstrap aggregation or bagging [3]), resulting in several fitted trees.  Each tree will have some training examples that were not included in its sample training set, and the error on these examples gives a measure of accuracy and a means to compare different models. This process is called validation. Each tree is created using slightly different data but provides an answer to the same question: what is the best decision-making logic that explains a given outcome for a given input data?

Only when one combines all of the trees together (see figure) – or, collects one “consensus answer” from the entire forest – is the algorithm’s job is truly done.  This is very similar to the public vote – it is expected that different people will think differently and use slightly different information about some issue (or, receive slightly different information and then interpret it in a different way) but – when the decision needs to be made – the majority vote will decide the outcome of the debate. This model can make accurate predictions when applied correctly, but the large number of trees obscures the explanation of why a prediction was made.

random-forest

Similar to traditional models, this “black box” model allows for accuracy testing.  The “test” subset contains data (30% of available data, for example) not used for training or validation and has known, or labeled, answers. The test set is used to confirm the overall predictive power of the trained and validated forest. The so-called “confusion matrix” summarizes the accuracy of predicting both “passers” and “failures” and reports the overall accuracy of the model.  The attributes that are the most important to the trees in this forest are also ranked in the order of their importance.

Therefore, in many ways, the “black box” model is no different from the classical models.   Ultimately, it takes a known input, runs it through the “trained model”, and compares the answer to the known answers in order to estimate the accuracy of model’s predictions.

After the model is fully tested and satisfies the accuracy requirements, it starts accepting new data to making new predictions.

What is truly different for this type of BB model – when compared to the classical models – is the ability of both the model creator and the model user to understand why some particular answer was reached and what leads to such an answer.

With the classical model, one can, essentially, reproduce its decision-making process step-by-step and test every such step since all of those steps are “visible” as the model is “transparent” to the user.

While the BB models could be used to address some classes of problems better than their classical counterparts, what really happens inside the black box, stays inside the black box.

For the end-user, “black box” models trade transparency for the answer.

This is perfectly acceptable for some of the users, who want an accurate answer much more than an understanding of all the reasons and relationships leading to the answer.  For some other users, however, it might be completely unacceptable.

Which brings us to the main theme of this article: what does this lack of transparency mean in practice?

This is the right time to introduce the “BB Paradox” – an interesting (psychological) phenomenon that we observed in practical settings where model interpretability was not absolutely required.  The “paradox” could be formulated as follows:

Less transparent models are generally accepted by the engineering and development community faster and with less resistance or questioning than their more “transparent” counterparts.

In other words, people seem to trust the models they don’t completely understand over the models they can understand in fine details.

The way this works in practice could be described in the following approximate sequence of reviewing and deciding on a traditional, classical model:

  1. Model assumptions are presented and reviewed

  2. Model machinery (equations, heuristics, approximations) is scrutinized

  3. Data used in the model are reviewed

  4. General results and conclusions are reviewed

  5. Sub-conclusions are analyzed and checked for sanity and consistency

  6. Model and its results are either accepted or rejected

The difference between the above process flow for a classical model and that for BB model is that – for the BB model – steps 2) and 5) are either skipped completely or reduced to a superficial discussion about “how the forest finds the answer better than any one decision tree”, “the confusion metrics confirm the efficacy of the model”, or “the model was validated with empirical data”, etc.

The main conclusions from the above discussion are:

  • The “Black Box Paradox” in modeling that we observe changes the way the models are being scrutinized (less) and accepted (easier) by the development community:

    • Less transparent models are generally accepted by the engineering and development community faster and with less resistance or questioning than their more “transparent” counterparts.

  • Therefore, model developers might be tempted to gravitate towards those less transparent or completely opaque models as a way to achieve quick results

  • While this looks like a bonus for the model developers, they have to realize that the responsibility for developing and testing an accurate model is still on them and not on the end users

  • Users of such BB models have to continue improving their knowledge of the tools they use as they have to be able to better validate and question the models and tools they are using

In closing, the above “BB Paradox” prompts us to think about the possible near future when most analytics will be outsourced to fully-automated semi-intelligent systems that will resemble Black Box algorithms with their decision-making logic being mostly opaque to us.  Would this represent the phase when we (humans) will stop questioning automated decisions completely and simply accept every recommendation made to us by the machines?  Some of this seems to be already happening today.

Acknowledgement

This paper first appeared in CEOReview in 2016 and was co-authored by Daniel Lingenfelter

References:

[1] Daniel J. Lingenfelter, Andrei Khurshudov, and Dimitar Vlassarev. Efficient Disk Drive Performance Model for Realistic Workloads. IEEE Transactions on Magnetics, 50 (5): 1-9, 2014.

[2] Leo Breiman. Random Forests. Machine Learning, 45 (1): 5–32, 2001.

[3] Leo Brieman. “Bagging Predictors”. Machine Learning, 24 (2): 123-140, 1996.