Note: We’re using the Microsoft version of R present in Machine Learning Services component of SQL Server 2017. If you’d like to know more, visit https://docs.microsoft.com/en-us/sql/advanced-analytics/r/r-services
With Machine Learning omnipresent, the hottest topic these days is facial recognition, self-driving cars, medical diagnostics, fraud detection, and beating champion GO players. We’re already reading articles about how AI is getting to where machines will write novels, design airplanes, and perform psychoanalysis. I have no doubt someday we’re going to see an AI robot family doctor going on home visits. I just don’t think it’s going to happen anytime soon. But that doesn’t matter. Fact is, this technology has proven itself already with some impressive accomplishments – but because we’re looking towards all of those applications so far in the future, we’re losing sight of how many amazing advances we can make with applications here today. Many of these applications aren’t that visible to the outside, but they can make powerful improvements. I’m here to discuss one of those domains that speaks directly business intelligence: the Data Warehouse
The ubiquitousness of Machine Learning and some of the Big Data technologies, had gotten businesses questioning the need for the Data Warehouse at all. Then they realized their mistake when they understood the Data Warehouse has a unique value proposition which is difficult to replace. After making strides in Machine Learning, they looked to way to bring it together with Business Intelligence. For now, most of what we see are separate processes each with their own unique workflows and outputs. In most new architecture diagrams, we find a couple of arrows pointing between each other suggesting there are data flows between them, but nothing more. I believe that is a good start, but I think it’s time to define some concrete strategies to tie these processes together in a way that enhances both of them.
The outputs that come from Data Science are the result of a particular workflow which is tied to the way it uses data. The Data Scientists load data, cleanse it, create models with it, and then do one of two things: create a presentation for the board, or automate predictions for specific operational tasks such as fraud detection or medical diagnostics. This has been the essential mainstay for research and predictive modelling in bioinformatics, physics, mathematics, and other scientific endeavours. But it’s only recently that businesses have noticed the capabilities this technology has to increase their bottom line and gain a competitive edge. Naturally, they take the same approach as a traditional scientific workflow. There’s nothing wrong with that, of course. It works. I would change nothing except for one thing. I’d add an additional output which I’ll be explaining in this article.
Meanwhile, the Data Warehouse has always been there to offer an accurate, up-to-date vision of the business by looking at the past without much to offer in built-in predictive capabilities. It does provide a long list of features relating to structure and information conformity, with context being high on the list – something that’s missing in current business oriented Data Science. Data Science output provides an answer to a specific question. The question might be “What will happen to our sales if we raise the price of our product?” The answer makes its way to a report or a presentation, but it’s not tied to the rest of the business. The prediction might tell you what the effect on sales might be, but without new analysis, it won’t be able to give any insight you what other numbers will change as a result of a change in sales, such as inventory levels, VAT costs, commission payments, and even employee retention.
The Data Warehouse, on the other hand, is designed with analytical power which comes from the dimensional structure that ties sales figures to inventory, finance, HR, and purchasing. It can give you an indication of how a change in product price changed other numbers in the past from which you can try to infer future movement. Select a date and all the figures will reflect that date. Since the numbers are all related to common contexts, if you also select a particular region, the numbers will once again change to reflect that slice. This slice and dice functionality, as it’s commonly called, is what gives users the ability to quickly navigate through the state of the business, asking new questions as they go along. This highly structured, navigational capability is unique to data warehouses.
Data Science outputs, on the other hand, respond with answers that often need to be interpreted by Data Scientists, which is one of the main reasons for the need to prepare a presentation or a paper before the business can act on the results. It’s not that Data Warehouses can’t contain information about predicted numbers; they lack the ability to produce them. Data Science has no problem generating predictions, it’s just not all that good at managing them afterwards. In reality, with very little modification to dimensional models, we can easily integrate a predictive element. All we need is input that comes from these models. If we can capture the output of a predictive model and load it into the structured environment of a Data Warehouse, then maybe we could take advantage of the features in both methods. The predictions will have a context provided by the Data Warehouse in such a way than all business users can use them to make business decisions just like they do by looking at past information.
In the same way that predictive models can enhance Business Intelligence, Data Warehouses can boost the performance of Machine Learning. Data integration and cleansing from source takes a Data Scientist on average about 75% of the total time of analysis. By sourcing from the Data Warehouse, the Data Scientist is pulling from data that has already been cleansed, tested, vetted, and structured. In this way, we set the stage for Data Science workflows to complete analysis up to four times faster. With a loop that has predictive modelling adding new data elements to the Data Warehouse, which in turn can serve as further inputs to predictive modelling, we have the potential to make exponential gains in new information usable by both sides of the analytic platform.
We’re going to start with a very simple exercise. We’ll create a customer segmentation model in R and integrate it into the BI platform so that we can make predictions directly from the Data Warehouse. Then we’ll process an SSAS cube with this data in it so that we can view the predicted data alongside the real data. The Data Warehouse that up until now only contained backward looking information will now contain a little bit of forward looking predictive information.
In this case, we have a database of customers, some which have signed up for one of our in-store credit cards, which happens to be quite profitable. Next, we have a batch of new customers. Our Data Warehouse tells us that it takes on average about two years before a customer applies for a card. We’d like to know which of the new customers we should propose our new store card to get them to use the card sooner. By using the characteristics of the customers who already use our card, we’ll try to predict who among the newcomers is likely to respond favorably to an offer to join in an attempt to reduce the time it takes for them to acquire it.
The workflow is going to look like this:
- Examine the new data
- Design a predictive model using R
- Train model from existing data in the SQL Server Data Warehouse
- Score the new prospects and store the information in the Data Warehouse
- Store the model itself in the Data Warehouse
- Run predictions using the model and store the results in the Data Wareehouse
- Make the new predictions available to business users
As you can see from the figure below, we’re aiming for a unified architecture that’s a mixture of the standard Data Warehouse and the Machine Learning workflow. The Machine Learning algorithms take data from the Data Warehouse to train predictive models and then send data back into the Data Warehouse so that it can be added to the analysis and user access layers. What we’re doing is creating a persistent output target for Machine Learning for business intelligence purposes. What follows is the details of how we accomplish this.
Building the Model
There are a variety of different model types we could try, but in this case, we’ll settle on the decision tree. This model makes it easy to understand the logic it uses to makes predictions. As with most predictive models, we need to understand the data so we can pick the more useful features to train it with. First, let’s examine the data and build the model locally to make sure it’ll meet or objectives.
We’ll start by extracting the data from the existing customer view in the data warehouse into an R dataset so we can get an idea of what it looks like.
First we load some of the R libraries we’re going to need
Then we’ll use the RevoscaleR functionality to load the customer data we need to train and evaluate the model
Looking at a summary gives us a good idea of what the overall distribution of values are:
## Age MaritalStatus Gender YearlyIncome TotalChildren
## Min. :28.00 M:10011 M:9351 Min. : 10000 0:5165
## 1st Qu.:36.00 S: 8473 F:9133 1st Qu.: 30000 1:3619
## Median :43.00 Median : 60000 2:3779
## Mean :45.18 Mean : 57306 3:2194
## 3rd Qu.:53.00 3rd Qu.: 70000 4:2303
## Max. :98.00 Max. :170000 5:1424
## NumberChildrenAtHome Education Occupation
## 0:11116 Bachelors :5356 Professional :5520
## 1: 2460 Partial College :5064 Management :3075
## 2: 1648 High School :3294 Skilled Manual:4577
## 3: 1204 Partial High School:1581 Clerical :2928
## 4: 1089 Graduate Degree :3189 Manual :2384
## 5: 967
## IsHomeOwner NumberCarsOwned YearsSinceFirstPurchase CommuteDistance
## Mode :logical 0:4238 0: 506 1-2 Miles :3232
## FALSE:5982 1:4883 1:12523 0-1 Miles :6310
## TRUE :12502 2:6457 2: 3225 2-5 Miles :3234
## 3:1645 3: 2216 5-10 Miles:3214
## 4:1261 4: 14 10+ Miles :2494
## CountryRegionCode StateProvinceCode IsCardUser
## AU:3591 CA :4444 Mode :logical
## US:7819 WA :2263 FALSE:16497
## CA:1571 ENG :1913 TRUE :1987
## DE:1780 NSW :1559
## GB:1913 BC :1559
## FR:1810 OR :1073
Already we can see that only about 12% of our customers signed up for the customer store card. It’s also interesting that we have as many homeowners as we do as well as the mean age group being 45.
By plotting the individual features we can be sure that there aren’t any obvious skews that could make it a poor choice for training a model. For example, the decision tree model would be next to useless if the data happened to have 90% males when we know the gender distribution should be roughly even.
The following plots show the distribution of customers by feature with the left side showing the population that did not sign up for the store card like the ones on the right did.
Nothing seems to stand out much except for the country. It’s obvious we have many more US customers than other countries, but that likely reflects the population of store customers anyway. At first glance, customers who have been coming to the store for 3 years or more have a higher percentage chance of signing up for the card. The data model should bring value by detecting the combination of features that identify a potential store card user in a way that’s not immediately obvious by just looking at the features individually.
Training the Model
We’ll take 80% of the data for training so that we can use the remaining 20% to evaluate the accuracy of the model later on. The first variable, IsCardUser, is the dependent variable; the value we’re trying to predict. The algorithm evaluates all the remaining variables to calculate how they affect the outcome.
Note that we’re using a RevoscaleR version of the Decision Tree model. This model is inspired heavily from the rpart library available on CRAN. If you’re already familiar with rpart, you’ll feel right at home with rxDTree. The Microsoft version is better equipped to handle large, distributed data sets, but the syntax is nearly the same. The resulting model is similar, but not the same. In the next section, you’ll see how to manage the compatibility issues when you need to use the model with other libraries that expect to be working with rpart trees.
Once we’ve built the tree, we’ll take a look at the model itself. One big advantage of the Decision Tree is how easy it is to see how it “thinks”.
Displaying the Decision Tree
The rpart.plot library, as you can see from the name, is designed to work with rpart generated Decision Trees. The rxAddInheritance function in RevoscaleR converts the rxDTree model to the rpart one so we can plot it with this library.
The tree seems a bit “too” accurate as we get below 5 levels. For example, the tree in the lowest left portion suggests that a customer would be a card user if they were not married but have 1 or 3 children (not 2). That one, unique case may be an accurate reflection of the training data we gave it, but it’ll probably fail with new data. Rather than recalculate the tree with a more rigorous complexity parameter (cp), we can prune the branches that split too easily by playing with the value until we get something that looks more reasonable. The higher the value of the cp parameter, the more nodes we eliminate. The algorithm uses this parameter to require a significance
that exceeds this value before splitting into deeper nodes:
prunedTree <- prune.rxDTree(tr,cp=0.03)
It should now look a bit more reasonable:
Based on the root level split, the model is saying that those in lower income brackets are most likely to sign up for the card except if they only have one child or no children. If they have more than one child, then it’s the single parents that are the most likely to sign up for the card .. unless the married ones have less than 3 cars. We can actually walk through the tree and gain some understanding of our customers even before using it to make predictions.
Now we’ll take the remaining 20% of the data from before and we’ll use the model to predict customers’ probability of signing up for the card to see if it works.
Next, we’re going to run through some formal testing methods to get an objective measure of how well our model is performing.
Evaluating the Model
The confusion matrix below should give us an idea of how the predictions went. The row labels refer to the predicted values and the columns represent the observed values. On the top left corner we have the number of people that we we predicted wouldn’t sign up for a card and indeed they didn’t. On the right of that is the number of misses. We predicted they wouldn’t take the card, but they actually did. On the bottom left is the number of people we thought would sign up for the card but who didn’t. And finally, at the bottom right, the number we were looking for: the number of times we correctly predicted who would sign up for a card.
By calculating common classification performance measures, we can get an idea of how useful this model is for us. Let’s find out what these numbers are and I’ll explain what they mean right after.
The measures themselves are only numbers which don’t tell us if the model is good or bad. It depends on how we plan to use it. For example, the 0.92 displays high accuracy. This might sound pretty good at first glance, but if you remember the summary table we saw before, only a little over 12% of customers have cards. If the model over-states the number of false negatives by suggesting that nobody wants the card, it’ll be accurate alright, but not very useful.
The precision tells us what proportion of the population that we predicted to be card users are indeed card users. Again, a low number here indicates that we’re flagging more people as card users than actually are. 42% accuracy seems low, but remember that there are relatively few card users out of the whole population. Less than half the users we find are actually card users, but we picked them out of a much larger pool – which brings us to the third measure on the list.
Sensitivity (or Recall) tells us how well we did finding people who actually have cards as we predicted they would have them. In our case, this is probably the most important measure. We want to make sure we don’t miss too many of these. 74% is pretty good. It’s far better than a random guess, which would be closer to 10%.
Specificity is the opposite of sensitivity. Of all the people we said would NOT have a card, how good were we at finding them? Again, because the population of non card users is so high, this one is hard to miss. For our data set, this measure is not that impressive. But if we had a particularly low number, that would be a strong signal that something is broken in our model.
The following is the an overall score that reflects the balance between precision and sensitivity. It’s a moderate score which we would hope to be a high as possible. It’s a bit moderate because the low precision number brings it down a bit. But because we’re not bothered about less than stellar precision, the 54% score is acceptable enough to make the model useful.
Now that we have the model, we can receive a file with prospective customers and run it through the model to score the predictions. Then we’d send the results back to whoever requested it, such as the Marketing department. That’s the usual step in the life cycle of predictive analysis. Another option is to find a way to operationalize this model so that it can be directly used by the business. There are a number of ways to do this:
- Publish the model to a web service so that developers can write applications that consume it
- Create a Shiny (an RStudio web platform) web application that allows users to interact with it
- Integrate the model into the Data Warehouse
These are all good options, but in this article, I’m going to focus on the last one – integrating the model in the data loading process of the Data Warehouse.
To do this, we need to create an R function that’ll train the Decision Tree model we’ve defined. Then we’ll use some of Microsoft’s utility functions to automatically create and publish a SQL Server stored procedure that’ll run the R code on the server.
Once we have that, then we can manually code another stored procedure in T-SQL which will load data into the Decision Tree model, train it, and store the resulting model in a table that we can then use for predictions.
Speaking of predictions, we’ll then manually code yet another stored procedure in T-SQL that reads the model stored in a table and runs a batch of predictions against a data set. This last procedure is going to make it so we can call SQL Server from an ETL load in SSIS, feed it independent variables, get the prediction output, and insert it into the data flow that finishes in a dimension in the Data Warehouse.
We’ll show you how this works in the steps that follow.
Publishing the model to SQL Server
We start by creating a function which will take a training data set and build the same tree that we used earlier. We create this as a function because we’re going to export it to SQL Server so that the model can be trained where the data resides.
The package SQLRUtils contains the StoredProcedure function which creates a SQL Server Stored Procedure object, and can also generate a text file containing the DDL commands that can be used to create the stored procedure using T-SQL. In this example, the StoredProcedure function is going to register the stored procedure to the SQL Server database for later use.
The code above automatically generates the following T-SQL stored procedure and creates it on the server:
Here we manually create a stored procedure that encapsulates the process of building the predictive model by calling the stored procedure above and storing it in the database, ready to be used.
There are many ways we can use this stored procedure. We can call it from withing SQL Server using T-SQL code, or we can write C# code to call it and return a dataset for predictions. In this case, we’re using SSIS to integrate the predictions into the dimensional loading workflow. In this task, we’re loading new customers into the table every day, and as we do, we use the customers’ attributes to predict if they will be card users or not.
The Prediction task in the diagram above calls a stored procedure which returns a table of predicted values. The stored procedure, which we also created manually, is defined as follows:
Now when we connect to to the SSAS Cube with Excel, notice that we can filter on the customers that currently have no cards, but that are predicted to eventually sign up for one. Because we can view profits by customer in the cube, it’s simple to see how much of a profit we make from the potential card users and even where they are located.
We also used a generalized linear model in the same way as described above in order to predict future month sales and integrated the figures in the fact table. We only made predictions at the product level yet we have predictions at the upper levels of the product hierarchy as well including the grand total for all sales. The category, subcategory, and grand total levels were automatically summed up by the the SSAS cube.
The Data warehouse is designed with well-defined methodolgies that call for rigid structures to support the freedom needed for ad-hoc data exploration. Data Science adds value through a completely different workflow, but by adding the outputs from Machine Learning to the Data Warehouse, we can take advantage of the mechanics behind its structure in such a way that business users across the board can benefit from the fruits of predictive analysis. We also gain some additional benefits:
- Backtesting. With the predicitions in the Data Warehouse, it becomes easy to find how accurate we were and where we could improve
- Auditability. If decisions were made in the past that no longer seem to hold true, it’s simple to generate predictions based on previous models and compare them to new ones
- Combinations. If we predict that we’re going to have an increase of 10% in card using customers, and we also have a separate prediction of future sales by card user, then by combining the two, we can create a third prediction based on the increase in sales by new card users without having to create a new model.
- Aggregation. If I can predict future sales by product, I can use the dimensional features of the Data Warehouse to aggregate forecasted sales figures along product categories without needing to create new models
- Preservation. Rather than getting lost or discarded, predictive and analytical models are stored for use by others. Over time, this process builds a library which becomes as valuable a corporate asset as the information itself
I’d like to thank Francesco Civardi for taking the time to discuss some of the points regarding Data Science workflows. As a man with a rare talent for mastering both Data Science and Data Management, his feedback was most valuable.