My wife is an expert wine-buyer and every good wine bottle she brings home has a little story attached to it. Even though, occasionally, I (secretly) don’t enjoy the taste of some of those wines, I know they are all considered to be of “high quality” and “very popular”. Some of them simply just don’t match my taste.
However, there have been plenty of wines I’ve tried in the past that were just plain bad. This made me think about the wine manufacturers – why do they even sell a particular (bad) wine? Can’t they just predict a customer’s response by tasting their own wine or by measuring objectively a few things about a wine’s chemistry and physics?
Wine-making is a big industry and there are already quite a few studies done and papers published in this field trying to answer this question. In fact, after a quick search online, I found a few data science studies trying to address it, and many of them were referring to the following paper:
- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
This particular publication came with a couple of datasets available for everyone to play around with, so I decided to use them for my next project. As an advisor to BigML, which is a leading analytics company, I wanted to analyze this wine quality data using their online platform and see 1) can I answer my question about predicting wine quality from some objective measurements and 2) how quickly could this be accomplished using the BigML online solution?
First, I downloaded the wine composition and quality assessment data from here: there are two datasets available with 1599 entries for red wine and 4898 entries for white. Even if I prefer red wine, I decided to go with the larger dataset for my study.
The dataset included 11 wine features such as residual sugar, density, pH, alcohol and few others (check the dataset if interested) and one numerical value for quality of each wine, which was expressed as a number between 0 (very bad) and 10 (excellent).
I felt that the regression analysis (having a numerical output in mind) will be too noisy and inaccurate, I decided to simply split the dataset into two classes: bad wine (0-6) and good wine (7-10).
I used ExcelⓇ to do these initial data manipulations and then imported the dataset into the BigML online portal (a simple drag and drop). Notice in the picture below how convenient it is to see all the distributions for each data column and their descriptive statistics.