Automating Data Exploration With R

Build the tools needed to quickly turn data into model-ready data sets

Last updated 2022-01-10 | 4.7

- Build a pipeline to automate the processing of raw data for discovery and modeling
- Know the main steps to prepare data for modeling
- Know how to handle the different data types in R

What you'll learn

Build a pipeline to automate the processing of raw data for discovery and modeling
Know the main steps to prepare data for modeling
Know how to handle the different data types in R
Understand data imputation
Treat categorical data properly with binarization (making dummy columns)
Apply feature engineering to dates
integers and real numbers
Apply variable selection
correlation and significance tests
Model and measure prepared data using both supervised and unsupervised modeling

* Requirements

* Basic understanding of R programming
* Some statistical and modeling knowledge

Description

As data scientists and analysts we face constant repetitive task when approaching new data sets. This class aims at automating a lot of these tasks in order to get to the actual analysis as quickly as possible. Of course, there will always be exceptions to the rule, some manual work and customization will be required. But overall a large swath of that work can be automated by building a smart pipeline. This is what we’ll do here. This is especially important in the era of big data where handling variables by hand isn’t always possible.

It is also a great learning strategy to think in terms of a processing pipeline and to understand, design and build each stage as separate and independent units.

Who this course is for:

  • Interest and need to process raw data for exploration and modeling in R

Course content

6 sections • 21 lectures

What is Covered in this Class Preview 02:12

As data scientists and analysts we face constant repetitive task when approaching new data sets. This class aims at automating a lot of these tasks in order to get to the actual analysis as quickly as possible. Of course, there will always be exceptions to the rule, some manual work and customization will be required. But overall a large swath of that work can be automated by building a smart pipeline. This is what we'll do here. This is especially important in the era of big data where handling variables by hand isn’t always possible.

It is also a great learning strategy to think in terms of a processing pipeline and to understand, design and build each stage as separate and independent units.

Big Picture - Data Scrubbing Preview 02:47

Let's briefly talk big picture so we're all on the same page.

Optional - Getting RStudio Preview 02:57

Brief video on where to download R base and RStudio - Skip this if you have it up-and-running already.

Common Data Readers Preview 16:22

Let's take a look at popular data readers from the base package, readr, and data.table.

Dates - Reading and Casting Dates Preview 19:46

Let's start by looking at dates and how to format them properly

Text Data - Ways to Quantify Free-Form Text Preview 17:55

We need to find clever ways of turning text data into numbers. You can choose to ignore any text and just model off the numerical variables but you would be leaving a lot of intelligence on the table

Text Data - Categories Preview 18:40

In very few cases you can turn text into factors directly and model it - let's do it the right way and see what it takes to do it properly 

Text Data - Categories 2 & Pipeline Check Preview 12:02

Let's do a pipeline check and upgrade our Binarize_Features function

Imputing Data - Dealing with Missing Data Preview 13:47

Let's look at imputing missing data with 0's or the mean value of the feature

Pipeline Check Preview 08:24

Caret Library - nearZeroVar Preview 06:28

Here is a look at a cool function from the caret package - nearZeroVar. It can tell which features have no or little variance (no pdf associated with this video).

Engineering Dates - Getting Additional Features out of Dates Preview 16:19

Let's see what additional numerical data we can pull out of date features

Numerical Engineering - Integers and Real Numbers Preview 12:20

Just like we squeezed more intelligence out of dates, here we'll apply the same principals on integers and real numbers

Pipeline Check Preview 06:19

Correlations Preview 21:52

Let's look at pair-wise correlations and how to access results programmatically. 

Caret Library - findCorrelation Preview 04:49

A look at the findCorrelation function from the caret package (no pdf associated with this lecture).

Hunting Outliers Preview 10:31

Finding outliers in feature sets using the mean and standard deviation.

Random Forest - Titanic Data Set Preview 15:46

Let's see how our pipeline functions work on the Titanic data set and a random forest model.

GBM (Generalized Boosted Models)/Caret - Diabetes Data Set - 1 Preview 13:55

Here we use the caret package, two of our pipeline functions and a GBM model to predict hospital readmissions.

GBM - 2 Preview 14:52

K-means, Unstructured Modeling Preview 14:11