Academics and researchers have been practicing statistical and Machine Learning techniques like regression analysis, linear programming, supervised and unsupervised learning for ages, but now, these same people suddenly find themselves much closer to the world of software development than ever before. And as always, each discipline comes with it’s own set of methods and ideas. One of these ideas I see pop up a little too often from the data science side is this notion that it’s much better to store data in flat files, using databases only as a last resort.
Some very respected data scientists have gone as far as proclaiming online that database engines are needed only as a way to catch the spillover of data that doesn’t fit in memory. They argue that databases are too complicated and besides, memory is so much faster than disk. For example, here’s a quote from Hadley Wickham, one of the most prolific and well-respected developers and data scientists in the R community:
“If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating”
Hadley Wickham – https://dbplyr.tidyverse.org/articles/dbplyr.html
I can appreciate the power of this argument. It’s true that memory is 10,000 times faster than disk and so it would stand to reason that any data manipulation done there should complete a few thousand times faster as well. Unfortunately, this over-simplification is probably going to lead to some poor design decisions.
Unfortunately, I suspect that this view is common among many (not all) data scientists, particularly those that haven’t had the chance to work with databases. When someone suggests that the use of databases is “frustrating”, I can only assume they’re frustrated by their lack of familiarity with the platform.
I recently came across an article by Kan Nishida, a data scientist who writes for and maintains a good data science blog. The gist of this article also attacks SQL on the basis of its capabilities:
There are bunch of data that is still in the relational database, and SQL provides a simple grammar to access to the data in a quite flexible way. As long as you do the basic query like counting rows and calculating the grand total you can get by for a while, but the problem is when you start wanting to analyze the data beyond the way you normally do to calculate a simple grand total, for example.
That SQL is simple or not is an assessment which boils down to individual experience and preference. But I will disagree that the language is not suited for in-depth analysis beyond sums and counts. I’ll take it a step further and assert that not only is SQL ideally suited for many forms of data analysis, but that the RDBMS engine itself is uniquely designed to provide the speed, multi-user management, and security needed to support serious data science in a way that simply using dplyr or other similar technologies can’t.
I also don’t want to overstate a database engine’s capacity, of course. Python, Pandas, R, and dplyr all have analytical and computational capabilities that far exceed anything you’ll get from a relational database engine. I use these tools every day. It would be foolish at best to try to perform logistic regression or to build a classification tree with SQL when you have R or Python at your disposal.
To be clear, I have nothing but profound respect for both Hadley Wickham and Kan Nishida. Hadley is the author of a suite of R tools that I use every single day and which are one of the things that makes R the compelling tool that it is. Through his blog, Kan has contributed a great deal to the promotion of data science. But I do I respectfully disagree with their assessment of databases. I’m hoping that they might take the time to reconsider their position.
In the meantime, I’m going to do my best is to clarify what seems to be a confusing issue to some, and while I’m at it, I’ll offer some compelling reasons for when and why you should have a good RDBMS handy.
Big Data Set and Evaluation Methodology
For this article I’m using a dataset which tracks airline delays from 1987 to 2012 worldwide. It’s 31 gigabytes large, well publicized, and has all sorts of interesting features that we can analyze. If you’re interested in getting a version of this file for yourself you can download it here: https://packages.revolutionanalytics.com/datasets/AirOnTime87to12/ .
Many desktops and laptops have 8 gigabytes of ram with decent desktop systems having 16 to 32 gigabytes of RAM. The one I’m using for this article has 64 gigabytes which leaves plenty of room to spare to load this data in memory if we choose. The environment is as follows:
For the file-based examples:
Memory: 64 gigabytes
Disk: Samsung 500 GB EVO SSD M.2 with 3,200 MB/sec read speed
OS : Windows Pro 64 Bit
R Version: R Open 3.50
For the database examples:
Memory: 16 gigabytes
Disk: 2TB Seagate Barracuda, 3.5″ HDD, SATA III – 6Gb/s, 7200rpm, 64MB Cache, 9.5ms
OS : Windows Pro 64 Bit
R Version: R Open 3.50
RDBMS: SQL Server 2017 Enterprise
I’ll first start by addressing the issue of performance by answering the question: Is it faster to analyze data in memory or is it faster to use a database? I’ll first load the entire file 31 gigabyte file in memory and perform some basic analysis on airline delays and clock the time each query takes. I’ll then query the exact same data stored in a SQL Server database on the desktop I did the prior analysis on – and I’ll also record the time it takes. If the people I mentioned earlier are right, the times should show that the memory-based dplyr manipulations are faster than the equivalent database queries or at least close enough to be worth using in favor of a database engine.
File-Based, Example Using dplyr
First, this is the code needed to load the file. It takes a bit over a minute and a half to load the file in memory from an M.2 SSD disk. It takes over 12 minutes from a regular 7200 RPM hard drive.
#' #' Load files #' # Clear everything..first rm(list = ls()) gc() library(tictoc) library(data.table) library(dplyr) library(pryr) bigFile <- "bigfile.csv" fileSize <- round(file.info(bigFile)$size/(1000*1000*1000)) sprintf("File is about %d gb in size", fileSize) sprintf("%d threads available", getDTthreads()) tic("Load big file") dh <- fread(bigFile, showProgress = T ) toc()
 "File is about 32 gb in size"  "8 threads available" Load big file: 104.87 sec elapsed
First, I’ll start with a case in which the data frame should be faster than running in a database since we’re scanning the entire data set in memory. I’ll simply do a count, an average, and a standard deviation of the entire dataset:
tic("Simple summary") dh %>% summarize(Mean_Delay = mean(arrdelay, na.rm = T), Flights = n(), SDev = sd(arrdelay, na.rm = T)) toc()
Mean_Delay Flights SDev 1 6.5666 148619655 31.55641 Simple summary: 2.01 sec elapsed
I’m using a case study from Michael Kane’s chapter in the book Data Science in R called Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays. In this chapter he uses some queries to illustrate the cases which can cause difficulties in dealing with larger data sets. Of course, in this case, we’re using a data set over two and half times as large.
The first one he uses is to count the number of flights that occur on Saturdays in 1987 and 1988:
tic("Count Saturday flights between 1987 and 1988") dh %>% filter(dayofweek == "Sat" & year %in% 1987:1988) %>% summarize(flights = n()) timer.value <- toc()
Even though the filter brings back fewer rows to count, there is a price to pay for the filtering:
flights 1 871165 Count Saturday flights between 1987 and 1988: 2.34 sec elapsed
This is another one of Michael Kane’s scenarios used to illustrate how we could use SQLite – but we’re using dplyr instead:
tic("Grouping by year") dh %>% filter(dayofweek == "Sat") %>% group_by(year) %>% summarize(flights = n()) toc()
The times have doubled, but then again, we’ve added a grouping function as well as filtering:
# A tibble: 26 x 2 year flights <int> <int> 1 1987 173370 2 1988 697795 3 1989 668080 4 1990 694251 5 1991 652991 6 1992 656920 7 1993 663606 8 1994 695245 9 1995 695286 10 1996 699527 # ... with 16 more rows Grouping by year: 5.53 sec elapsed
The following is a scenario proposed by Kan Nishida on his blog which seeks to return a list of the top 10 most delayed flights by carrier. This is one of the cases which Ken uses to illustrate “why R is 100x better than SQL” .
tic("Top 10 delays by carrier") dh %>% select(uniquecarrier, dayofweek, flightdate, origin, dest, arrdelay) %>% group_by(uniquecarrier) %>% top_n(-10, arrdelay) %>% arrange(uniquecarrier, arrdelay) toc()
This takes a whopping 11.6 seconds, but then again, it’s performing some complicated data wrangling using a colossal data set.
# A tibble: 315 x 6 # Groups: uniquecarrier  uniquecarrier dayofweek flightdate origin dest arrdelay <chr> <chr> <chr> <chr> <chr> <int> 1 9E Sat 2009-02-07 EVV DTW -107 2 9E Fri 2010-08-27 JFK IND -87 3 9E Tues 2010-11-09 JFK IND -87 4 9E Tues 2010-08-10 JFK IND -83 5 9E Tues 2010-07-06 JFK IND -81 6 9E Mon 2010-09-06 JFK IND -81 7 9E Tues 2010-09-14 JFK IND -81 8 9E Sat 2010-09-04 JFK IND -80 9 9E Sat 2010-09-11 JFK IND -80 10 9E Sat 2009-04-25 BMI DTW -80 # ... with 305 more rows Top 10 delays by carrier: 11.6 sec elapsed
With such results, one can understand why it seems that running code in memory acceptable. But is it optimal? I loaded the exact same CSV file in the database. The following queries will return the same result sets as in the previous examples.
Examples Using a Database
There’s no need to load files. We only need to establish a connection:
library(tictoc) library(data.table) library(RODBC) con <- odbcConnect("SQLServer")
First we start with the simple summary:
tic("Simple summary") sql <- "select Mean_Delays=avg(arrdelay*1.0), Flights = count(1), sdev = stdev(arrdelay) from airlines.dbo.air_delays" sqlQuery(con, sql) toc()
This runs 20 milliseconds slower than the dplyr version. Of course one would expect this since the database can provide limited added value in a full scan as compared to memory.
Mean_Delays Flights sdev 1 6.432151 148619655 31.24553 Simple summary: 2.03 sec elapsed
Now, let’s try to repeat the second scenario which applies some filters:
tic("Count Saturday flights between 1987 and 1988 (db)") sql <- "select count(1) as flights from airlines.dbo.air_delays where [year] between 1987 and 1988 and [dayofweek] = 'Sat'" sqlQuery(con, sql) toc()
The difference is enormous! It takes 10 milliseconds instead of 2.34 seconds. In other words, it’s 234 times faster than the file based version :
flights 1 871165 Count Saturday flights between 1987 and 1988 (db): 0.01 sec elapsed
This is the same grouping scenario as above:
tic("Grouping by year (db)") sql <- "select [year], count(1) as flights from airlines.dbo.air_delays where [dayofweek] = 'Sat' group by [year] order by [year]" sqlQuery(con, sql) toc()
Again, the database engine excels at this kind of query. It takes 40 milliseconds instead of 5.53 seconds, meaning it’s more than 130 times faster than the memory-based query.
year flights 1 1987 173370 2 1988 697795 3 1989 668080 4 1990 694251 5 1991 652991 6 1992 656920 7 1993 663606 8 1994 695245 9 1995 695286 10 1996 699527 ... Grouping by year (db): 0.04 sec elapsed
Now here is the scenario that Kan promised was 100 times better in R than SQL. Kan points out and Hadley implies that the SQL language is verbose and complex. Again, I’m not looking to debate that issue since it looks to me like this could be a subjective judgement based on experience rather than on any objective measure. Having worked intensively with SQL for decades, I personally find the query below to be very simple to write because it matches a common pattern I’ve come across many times. But I can fully understand how someone who has less experience with SQL can find this a bit daunting at first.
Instead, I want t evaluate this by the speed and with the needed resource requirements:
tic("Top 10 delays by carrier") sql <- " with carr as ( select distinct uniquecarrier from airlines.dbo.air_delays ) select a.uniquecarrier ,b.* from carr a cross apply ( select top(10) x.[dayofweek], x.flightdate, x.origin, x.dest, x.arrdelay from airlines.dbo.air_delays x where x.uniquecarrier = a.uniquecarrier order by x.arrdelay ) b " sqlQuery(con, sql) toc()
Again, the results come back 25 times faster in the database. But it’s not just the multiplicative factor that highlights the difference – there’s a tangible perceived difference. If this query become part of an operationalized data science application such as R Shiny or ML Server, users will find that this query feels slow at 11 seconds while data that returns in less than half a second feels .. instantaneous. I can find no argument to suggest, based on this use case, that dplyr is any way better than a database, much less “a 100x better”.
uniquecarrier dayofweek flightdate origin dest arrdelay 1 AS Tues 1987-10-27 SJC BUR -720 2 AS Tues 1987-10-27 BUR SFO -715 3 AS Tues 1987-10-27 SFO BUR -715 4 AS Tues 1987-10-27 BUR SJC -710 5 AS Thur 2005-03-10 GEG SEA -692 6 AS Sun 2004-05-16 PHX SEA -664 7 AS Sat 1989-02-18 JNU ANC -658 8 AS Sun 1987-11-01 JNU SEA -652 9 AS Sat 1987-11-14 KTN JNU -608 10 AS Fri 2003-01-31 OAK SEA -574 ... Top 10 delays by carrier: 0.45 sec elapsed
Databases are especially good at joining multiple data sets together to return a single result but dplyr also provides this ability. The dataset comes with a file of information about individual airplanes. In Michael Kane’s chapter, he seeks to find a way to analyze the impact of the age of the plane on delays. While he didn’t make use of this file, I thought we could simply join the data from the plane information to find the manufactured year of the plane as compared to the year of the flight. This is an imperfect analysis, but it’s a good use case for a join. This is the dplyr version:
tic("Mean delays by week day including average plane age - memory") dh %>% inner_join(planes,by = c("tailnum" = "tailnum")) %>% mutate(plane_age=year-mfgyear) %>% group_by(dayofweek) %>% summarize(Mean_Delay = mean(arrdelay, na.rm = T), Flights = n() , Mean_Plane_Age = mean(plane_age, na.rm = T) ) toc()
Strangely, this operation required more memory than my system has. It reached the limits for my system.
"Error: cannot allocate vector of size 246.4 Mb"
This is a twist on Hadley Wickham’s rule. In fact, it’s not enough that your data fit in memory – it also needs to have unspecified amounts of extra RAM available for hashing and sorting:
The same query poses no problem for the database at all:
tic("Plane age and delays") sql <- "select a.[year], avg(arrDelay*1.0) as Mean_Delay, count(1) as Flights, avg(A.year-P.mfgyear*1.0) as Mean_Plane_Age from airlines.dbo.air_delays A join airlines.dbo.planes P on A.tailnum = P.tailnum group by A.[year] order by A.[year]" sqlQuery(con, sql) toc()
year Mean_Delay Flights Mean_Plane_Age 1 1995 6.837042 1665427 -0.389181 2 1996 9.833741 1705829 0.553131 3 1997 7.372220 1770314 1.485291 4 1998 7.253959 1846509 2.254350 5 1999 7.634752 2042197 2.923195 6 2000 10.001133 2212597 3.611676 7 2001 2.480787 256182 2.752562 8 2002 3.027531 2023428 4.836912 9 2003 3.641663 3393691 4.519689 10 2004 6.371273 4540424 4.852204 ... Plane age and delays: 5.99 sec elapsed
Performance Times (in seconds)
|Group by Year||5.53||0.05|
|Top 10 Delays by Carrier||11.6||0.45|
|Plane Age and Delays||(failed)||5.99|
Keep in mind that the database environment I used for this example is very much on the low-end. In many corporate environments, the database server will be running with multiple CPUs, arrays of SSD disks on Raid 10 platforms, and 256 gigabytes of RAM. Under those conditions, the database times could be reduced even further.
When Should You Consider Using a Database?
As we can see from the cases above, you should use a database if performance is important to you, particularly in larger datasets. We only used 31 gigabytes in this dataset and we could see a dramatic improvement in performance, but the effects would be even more pronounced in larger datasets. Modern database engines are the result of decades in design for optimal data retrieval using sophisticated indexing and caching strategies that simply don’t exist in simpler tools. The idea that an R library like dplyr is going to be the one technology that beats speed tests against SQL Server, Oracle, or PostgreSQL is just ludicrous.
Beyond just the performance benefits, there are other important reasons to use a database in a data science project. Oddly enough, I agree with Kan Nishida in his conclusion where he states:
.. The best strategy I keep seeing with many R users is to fine tune the SQL queries to extract the data at a manageable size for your PC’s memory either by filtering, aggregating, or sampling, then import into R instance in memory so that you can quickly and iteratively explore and analyze the data with all the statistical horse powers.
It’s a mistake to try to use R to do those things that database engines excel at. Anytime you’re looking towards the past to understand what already happened, you’ll want to compile aggregates, filters, and joins from historical data to form a picture of a state at a period in time. Business Intelligence tools have been perfecting this since the early 90’s, so it makes little sense to try to replicate that capability with R now. Where R and Python shine is in their power to build statistical models of varying complexity which then get used to make predictions about the future. It would be perfectly ludicrous to try to use a SQL engine to create those same models in the same way it makes no sense to use R to create sales reports.
The database engine should be seen as a way to offload the more power-hungry and more tedious data operations from R or Python, leaving those tools to apply their statistical modeling strengths. This division of labor make it easier to specialize your team. It makes more sense to hire experts that fully understand databases to prepare data for the persons in the team who are specialized in machine learning rather than ask for the same people to be good at both things.
1 – Server-side computing
If your data is going to be made available to a team of 2 or more people, you’ll want to be sure that your infrastructure can support concurrent access to your data; something relational database engines are especially good at. Scaling from 2 to several thousand users is not an issue.
You could put the file on a server to be used by R Shiny or ML Server, but doing makes it nearly impossible to scale beyond few users. In our Airline Data example, the same 30 gigabyte dataset will load separately for each user connection. So if it costs 30 gigabytes of memory for one user, for 10 concurrent users, you would need to find a way to make 300 gigabytes of RAM available somehow.
2 – Scalability
This article used a 30 gigabyte file as an example, but there are many cases when data sets are much larger. In financial institutions, research organizations, and telecoms, it’s possible to be working with several terabytes of data at a time. This is easy work for relational database systems, many which are designed to handle petabytes of data if needed. For the purposes of data science, you’ll need the ability to query subsets of data as well as aggregates computed on the server in a way that would be impossible on a server.
3 – Clean data source
Although I didn’t show it as part of this article, I had to perform quite a bit of cleanup of the data file such as converting strings to dates, removing “None” from numeric columns, and removing a few records that seemed corrupt. I also added some columns to provide a classifier for some numeric ranges so as to convert an arrival delay from a number to a value such as “Early”, “On Time”, “Late”, “Very Late”, etc. This is a time-consuming operation that would be good to perform once and then store the results so that you and other team members can be spared the expense of doing it every time you want to perform your analysis.
4 – Security
Most RDBMS engines come with robust security features that make it possible for data to be read-only for some users, inaccessible to others, and read/writable for super-users. If you’re working with confidential data, you can make use of more advanced security features that block access to portions of information or even obfuscate specific columns that contain things such as NHS or SSN numbers.
So When should you NOT use a Database?
Going back to Hadley Wickham’s rule, I’ll say he’s right to forgo the use of a database in the following cases:
1 – You have a small dataset
If a dataset contains thousands of relatively narrow rows, the database might not use indexes to optimize performance anyway even if it has them. Loading data in memory won’t tax resources much, and performance will remain acceptable
2 – You’re only performing a “one-time” analysis
It’s probably more time effective to wait the few extra minutes for your tasks to complete than it is to spend time creating, loading, and indexing a database that you’ll delete at the end anyway
3 – You’re the only user of the data
The data can stay in your local workstation or other user environment without needing to worry about other users also accessing the data, and if it’s small enough, then it’s likely not worth the bother of using a database
Kan Nishida illustrates in his blog how calculating the overall median is so much more difficult in SQL than in R. This is true but he also chose a function that happens not to be built-in the SQL language, so it’s easy to make the case that R is more efficient – for this function. While I don’t think it’s fair to judge SQL vs. R on this one function like he does, I do think that this does a good job of highlighting the fact that certain computations are more efficient in R than in SQL. To get the most out of each of these platforms, we need to have a good idea of when to use one or the other. As a general rule, vectorized operations are going to be more efficient in R and row-based operations are going to be better in SQL.
As a rule of thumb, let SQL do the heavy lifting when you need to sort, filter, join, group, compute built-in aggregates using SUM, AVG, STDEV, MIN, MAX, COUNT, and when you’re using SQL to perform integrity checks. Also use SQL to perform common types of BI-type analysis. There’s really no point in using a Machine Learning tool to tell you what the percent difference in sales is between this quarter and last quarter; let SQL do that. Use R or Python when you need to perform higher order statistical functions including regressions of all kinds, neural networks, decision trees, clustering, and the thousands of other variations available.
In other words, use SQL to retrieve the data just the way you need it. If you find you’re still doing cleanup work on your data using R after retrieving it from the database, then you’re underutilizing SQL. Then use R or Python to build your predictive models. The end result should be faster development, more possible iterations to build your models, and faster response times.
Just because your dataset fits in memory, it doesn’t mean that you’ll enjoy better performance than you would with a database. In fact, you may handicapping your Machine Learning efforts unnecessarily by ignoring your SQL RDBMS as a data management platform.
SQL is not a machine learning tool by any stretch, but it’s possibly the most powerful tool available to manage the data you need for your Machine Learning projects. If you’re moving towards operationalization, then it becomes essential to use one. R and Python are top class tools for Machine Learning and should be used as such. While these languages come with clever and convenient data manipulation tools, it would be a mistake to think that they can be a replacement for platforms that specialize in data management. Let SQL bring you the data exactly like you need it, and let the Machine Learning tools do their own magic.