Feature Engineering For Machine Learning

Transform the variables in your data and build better performing machine learning models

Last updated 2022-01-10 | 4.7

- Learn multiple techniques for missing data imputation
- Transform categorical variables into numbers while capturing meaningful information
- Learn how to deal with infrequent
- rare and unseen categories

What you'll learn

Learn multiple techniques for missing data imputation
Transform categorical variables into numbers while capturing meaningful information
Learn how to deal with infrequent
rare and unseen categories
Transform skewed variables into Gaussian
Convert numerical variables into discrete
Remove outliers from your variables
Extract meaningful features from dates and time variables
Learn techniques used in organisations worldwide and in data competitions
Increase your repertoire of techniques to preprocess data and build more powerful machine learning models

* Requirements

* A Python installation
* Jupyter notebook installation
* Python coding skills
* Some experience with Numpy and Pandas
* Familiarity with Machine Learning algorithms
* Familiarity with Scikit-Learn

Description

Discover plenty of methods and tools to engineer the variables in your datasets and build more powerful machine learning models.


Learn and master popular and cutting-edge feature engineering techniques in this comprehensive course.


In this course, you will learn:


  • How to impute missing data

  • How to encode categorical variables

  • How to transform numerical variables or carry out discretization

  • How to remove outliers

  • How to extract features from date and time variables

  • How to implement these techniques in Python professionally and elegantly.


At the end of the course, you will be able to implement a sequence of feature transformations in a single and elegant pipeline, which will allow you to put your predictive models into production with maximum efficiency.


Content Overview.


This is the most comprehensive online course in variable engineering. You will learn a huge variety of engineering techniques used worldwide to clean and transform your data. The methods discussed in this course are based on scientific articles, white papers, data science competitions, and, of course, my own experience as a data scientist.


In this course, you will first learn the most widely used techniques for variable engineering, and then more advanced and cutting-edge methodologies that capture information while encoding or transforming your variables. You will find detailed explanations of the procedures, their advantages, limitations, and underlying assumptions, as well as the best programming practices to implement them in Python.


By the end of the course, you will be able to simplify your feature engineering procedures by utilizing open-source Python libraries like Scikit-learn, Category encoders, and Feature-engine.


All topics include hands-on Python code examples that you can use for reference, practice, and re-use in your own projects. In addition, the code is updated regularly to keep up with new trends and new Python library releases.


So what are you waiting for? Enroll today, embrace the power of feature engineering and build better machine learning models.

Who this course is for:

  • Data Scientists who want to get started in pre-processing datasets to build machine learning models
  • Data Scientists who want to learn more techniques for feature engineering for machine learning
  • Data Scientist who want to improve their coding skills and best programming practices for feature engineering
  • Software engineers, mathematicians and academics switching careers into data science
  • Data Scientists who want to try different feature engineering techniques on data competitions
  • Software engineers who want to learn how to use Scikit-learn and other open-source packages for feature engineering

Course content

14 sections • 139 lectures

Introduction Preview 05:16

Testing

Course curriculum overview Preview 06:00

Course requirements Preview 03:08

How to approach this course Preview 01:09

Setting up your computer Preview 01:27

Course Material Preview 01:59

Download Jupyter notebooks Preview 00:15

Download datasets Preview 01:23

Download course presentations Preview 00:04

Moving Forward Preview 02:14

FAQ: Data Science, Python programming, datasets, presentations and more... Preview 00:44

Variables | Intro Preview 02:37

Numerical variables Preview 05:03

Categorical variables Preview 03:43

Date and time variables Preview 01:58

Mixed variables Preview 02:16

Quiz about variable types

Variable characteristics Preview 02:43

Missing data Preview 06:46

Cardinality - categorical variables Preview 05:03

Rare Labels - categorical variables Preview 04:54

Linear models assumptions Preview 09:13

Linear model assumptions - additional reading resources (optional) Preview 00:35

Variable distribution Preview 05:08

Outliers Preview 08:27

Variable magnitude Preview 03:08

Variable characteristics and machine learning models Preview 00:10

Table illustrating the advantages and disadvantages of different machine learning algorithms, as well as their requirements in terms of feature engineering, and common applications. 

Additional reading resources Preview 00:38

Introduction to missing data imputation Preview 03:58

Complete Case Analysis Preview 06:46

In this lecture, I describe complete case analysis, what it is, what assumptions it makes, and what are the implications and consequences of handling missing values using this method.

Mean or median imputation Preview 07:53

In this lecture, I describe what I mean by replacing missing values by the mean or median of the variable, what are the assumptions, advantages and disadvantages, and how they may affect the performance of machine learning algorithms.

Arbitrary value imputation Preview 06:42

End of distribution imputation Preview 04:53

Frequent category imputation Preview 06:56

Missing category imputation Preview 04:05

Random sample imputation Preview 14:17

In this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.

Adding a missing indicator Preview 05:26

Here I describe the process of adding one additional binary variable to capture those observations where data is missing. 

Mean or median imputation with Scikit-learn Preview 10:33

Arbitrary value imputation with Scikit-learn Preview 05:35

Frequent category imputation with Scikit-learn Preview 03:48

Missing category imputation with Scikit-learn Preview 02:46

Adding a missing indicator with Scikit-learn Preview 04:06

Automatic determination of imputation method with Sklearn Preview 08:24

Introduction to Feature-engine Preview 05:10

Mean or median imputation with Feature-engine Preview 04:51

Arbitrary value imputation with Feature-engine Preview 03:30

End of distribution imputation with Feature-engine Preview 04:46

Frequent category imputation with Feature-engine Preview 01:38

Missing category imputation with Feature-engine Preview 02:57

Random sample imputation with Feature-engine Preview 02:28

Continues from previous lecture: in this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.

Adding a missing indicator with Feature-engine Preview 04:06

Overview of missing value imputation methods Preview 00:08

Conclusion: when to use each missing data imputation method Preview 01:27

Multivariate Imputation Preview 03:31

KNN Impute Preview 04:22

KNN Impute - Demo Preview 07:04

MICE Preview 07:07

missForest Preview 01:07

MICE and missForest - Demo Preview 03:58

Additional Reading resources (Optional) Preview 00:12

Categorical encoding | Introduction Preview 06:49

One hot encoding Preview 06:09

Important: Feature-engine version 1.0.0 Preview 00:22

One-hot-encoding: Demo Preview 14:12

One hot encoding of top categories Preview 03:06

One hot encoding of top categories | Demo Preview 08:35

Ordinal encoding | Label encoding Preview 01:50

Ordinal encoding | Demo Preview 08:08

Count or frequency encoding Preview 03:11

Count encoding | Demo Preview 04:33

Target guided ordinal encoding Preview 02:41

Target guided ordinal encoding | Demo Preview 08:30

Mean encoding Preview 02:16

Mean encoding | Demo Preview 05:31

Probability ratio encoding Preview 06:13

Weight of evidence (WoE) Preview 04:36

Weight of Evidence | Demo Preview 12:38

Comparison of categorical variable encoding Preview 10:36

Rare label encoding Preview 04:31

In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels  are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance. 

In this lecture I will focus on variables with one predominant category.

Rare label encoding | Demo Preview 10:25

In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels  are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance. 

In this lecture I will focus on variables with few categories.

Binary encoding and feature hashing Preview 06:12

Summary table of encoding techniques Preview 00:05

Additional reading resources Preview 00:18

Variable Transformation | Introduction Preview 04:48

Variable Transformation with Numpy and SciPy Preview 07:38

variable Transformation with Scikit-learn Preview 07:03

Variable transformation with Feature-engine Preview 03:41

Discretisation | Introduction Preview 03:01

Equal-width discretisation Preview 04:06

Important: Feature-engine v 1.0.0 Preview 00:17

Equal-width discretisation | Demo Preview 11:18

Equal-frequency discretisation Preview 04:13

Equal-frequency discretisation | Demo Preview 07:16

K-means discretisation Preview 04:13

K-means discretisation| Demo Preview 02:43

Discretisation plus categorical encoding Preview 02:54

Discretisation plus encoding | Demo Preview 05:45

Discretisation with classification trees Preview 05:05

Discretisation with decision trees using Scikit-learn Preview 11:55

Discretisation with decision trees using Feature-engine Preview 03:48

Domain knowledge discretisation Preview 03:52

Additional reading resources Preview 00:08

Outlier Engineering | Intro Preview 07:42

Outlier trimming Preview 07:21

Outlier capping with IQR Preview 06:24

In this lecture I will describe a common method to handle outliers in numerical variables. These methods are commonly used in surveys as well as in other business settings.

Outlier capping with mean and std Preview 04:44

This lecture continues from the previous one. 

I continue to describe a common method to handle outliers in numerical variables. These methods are commonly used in surveys as well as in other business settings.

Outlier capping with quantiles Preview 03:17

Arbitrary capping Preview 03:33

Important: Feature-engine v1.0.0 Preview 00:07

Additional reading resources Preview 00:05

Feature scaling | Introduction Preview 03:43

Standardisation Preview 05:30

Standardisation | Demo Preview 04:38

Mean normalisation Preview 04:01

Mean normalisation | Demo Preview 05:20

Scaling to minimum and maximum values Preview 03:23

MinMaxScaling | Demo Preview 03:00

Maximum absolute scaling Preview 03:01

MaxAbsScaling | Demo Preview 03:44

Scaling to median and quantiles Preview 02:45

Robust Scaling | Demo Preview 02:03

Scaling to vector unit length Preview 05:50

Scaling to vector unit length | Demo Preview 05:17

Additional reading resources Preview 00:09

Engineering mixed variables Preview 03:13

Engineering mixed variables | Demo Preview 06:10

Engineering datetime variables Preview 04:43

Engineering dates | Demo Preview 08:16

Engineering time variables and different timezones Preview 04:33

Putting it all together Preview 06:43

Feature Engineering Pipeline Preview 08:22

Classification pipeline Preview 13:14

Regression pipeline Preview 13:50

Beat the performance of my ML model by engineering features!!!

In this assignment, your challenge is to beat the performance of my Lasso regression model by implementing different feature engineering steps ONLY!!! The performance of my current model, as shown in the attached notebook is: test rmse: 32603 test r2: 0.845

Feature engineering pipeline with cross-validation Preview 06:47

More examples Preview 00:04