Data Preparation in SQL, with Cheat Sheet!
(This article originally appeared on KDNuggets.com here. For more, visit https://www.kdnuggets.com/)
If your raw data is in a SQL-based data lake, why spend the time and money to export the data into a new platform for data prep?
Whether we like it or not, data prep is a major part of every data science project. Data preparation consists of tasks to prepare data in a repeatable process for use in business analytics, including data acquisition, data storage and handling, data cleaning, and early-stages of feature engineering.
"Why is it that data preparation is often described as 80% of the work involved in data-related tasks, and do you think this is an accurate generalization?"
—Matthew Mayo, KDnuggets"80%? I often hear >90%!"
—Sebastian Raschka, Michigan State UniversityFrom Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders, KDnuggets
A fundamental question is how and where to do the data prep. There are valid reasons to pick different tools and methods:
- Which tool or languages are your staff/peers most familiar with?
- Do you need a repeatable process, scheduled to run on a regular basis?
- Where is the bulk of your data currently located?
- What data volume and velocity are you dealing with?
- What current cloud services or on-prem resources exist to run the process?
- Which tools are most compatible with the current architecture in your organization?
- Does your dataset contain private elements (HIPA, FERPA, GDPR) that require special protection?
For many organizations, the answers to those questions are going to lead to SQL. Not only is SQL widely known and used in most organizations, but it also leverages existing database resources, security, and pipelines. If your raw data is in a SQL-based data lake, why spend the time and money to export the data into a new platform for data prep?
The following "Quick Reference" guide will give a sampling of SQL approaches to each of the steps in data preparation. This is not meant to be an exhaustive list of SQL functions or options, but rather a starting point.
Click for high resolution
Download the Quick Reference PDF here.
A final word on creating an interface to your model. SQL views allow you to wrap up the complexity of many of the data prep steps in a clean, secure, modular format. Rather than embedding long, complex queries in your Python or R code, you can create a view that allows access to that code in a simple, reusable format. Views are also a great way to apply security on private data elements by masking or hiding those from model access.
If you are already investing in high-performant database licenses, why not leverage them for data science by doing data prep in SQL?
Stay in the loop.
Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.