Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models
Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models
Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models
Ebook686 pages4 hours

Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Extract accurate information from data to train and improve machine learning models using NumPy, SciPy, pandas, and scikit-learn libraries




Key Features



  • Discover solutions for feature generation, feature extraction, and feature selection


  • Uncover the end-to-end feature engineering process across continuous, discrete, and unstructured datasets


  • Implement modern feature extraction techniques using Python's pandas, scikit-learn, SciPy and NumPy libraries



Book Description



Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code.







Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you'll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You'll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains.







By the end of this book, you'll have discovered tips and practical solutions to all of your feature engineering problems.




What you will learn



  • Simplify your feature engineering pipelines with powerful Python packages


  • Get to grips with imputing missing values


  • Encode categorical variables with a wide set of techniques


  • Extract insights from text quickly and effortlessly


  • Develop features from transactional data and time series data


  • Derive new features by combining existing variables


  • Understand how to transform, discretize, and scale your variables


  • Create informative variables from date and time



Who this book is for



This book is for machine learning professionals, AI engineers, data scientists, and NLP and reinforcement learning engineers who want to optimize and enrich their machine learning models with the best features. Knowledge of machine learning and Python coding will assist you with understanding the concepts covered in this book.

LanguageEnglish
Release dateJan 22, 2020
ISBN9781789807820
Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Related to Python Feature Engineering Cookbook

Related ebooks

Data Modeling & Design For You

View More

Related articles

Reviews for Python Feature Engineering Cookbook

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Python Feature Engineering Cookbook - Soledad Galli

    Python Feature Engineering Cookbook

    Python Feature Engineering Cookbook

    Over 70 recipes for creating, engineering, and transforming features to build machine learning models

    Soledad Galli

    BIRMINGHAM - MUMBAI

    Python Feature Engineering Cookbook

    Copyright © 2020 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Commissioning Editor: Pravin Dhandre

    Acquisition Editor: Devika Battike

    Content Development Editor: Nathanya Dias

    Senior Editor: Ayaan Hoda

    Technical Editor: Manikandan Kurup

    Copy Editor: Safis Editing

    Project Coordinator: Aishwarya Mohan

    Proofreader: Safis Editing

    Indexer: Manju Arasan

    Production Designer: Aparna Bhagat

    First published: January 2020

    Production reference: 1210120

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-78980-631-1

    www.packt.com

    Packt.com

    Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

    Why subscribe?

    Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

    Improve your learning with Skill Plans built especially for you

    Get a free eBook or video every month

    Fully searchable for easy access to vital information

    Copy and paste, print, and bookmark content

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

    At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

    Contributors

    About the author

    Soledad Galli is a lead data scientist with more than 10 years of experience in world-class academic institutions and renowned businesses. She has researched, developed, and put into production machine learning models for insurance claims, credit risk assessment, and fraud prevention. Soledad received a Data Science Leaders' award in 2018 and was named one of LinkedIn's voices in data science and analytics in 2019. She is passionate about enabling people to step into and excel in data science, which is why she mentors data scientists and speaks at data science meetings regularly. She also teaches online courses on machine learning in a prestigious Massive Open Online Course platform, which have reached more than 10,000 students worldwide.

    About the reviewer

    Greg Walters has been involved with computers and computer programming since 1972. He is well versed in Visual Basic, Visual Basic .NET, Python, and SQL, and is an accomplished user of MySQL, SQLite, Microsoft SQL Server, Oracle, C++, Delphi, Modula-2, Pascal, C, 80x86 Assembler, COBOL, and Fortran. He is a programming trainer and has trained numerous people on many pieces of computer software, including MySQL, Open Database Connectivity, Quattro Pro, Corel Draw!, Paradox, Microsoft Word, Excel, DOS, Windows 3.11, Windows for Workgroups, Windows 95, Windows NT, Windows 2000, Windows XP, and Linux. He is semi-retired and has written over 100 articles for the Full Circle magazine. He is open to working as a freelancer on various projects.

    Packt is searching for authors like you

    If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

    Table of Contents

    Title Page

    Copyright and Credits

    Python Feature Engineering Cookbook

    About Packt

    Why subscribe?

    Contributors

    About the author

    About the reviewer

    Packt is searching for authors like you

    Preface

    Who this book is for

    What this book covers

    To get the most out of this book

    Download the example code files

    Download the color images

    Conventions used

    Sections

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Get in touch

    Reviews

    Foreseeing Variable Problems When Building ML Models

    Technical requirements

    Identifying numerical and categorical variables

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Quantifying missing data

    Getting ready

    How to do it...

    How it works...

    Determining cardinality in categorical variables

    Getting ready

    How to do it...

    How it works...

    There's more...

    Pinpointing rare categories in categorical variables

    Getting ready

    How to do it...

    How it works...

    Identifying a linear relationship

    How to do it...

    How it works...

    There's more...

    See also

    Identifying a normal distribution

    How to do it...

    How it works...

    There's more...

    See also

    Distinguishing variable distribution

    Getting ready

    How to do it...

    How it works...

    See also

    Highlighting outliers

    Getting ready

    How to do it...

    How it works...

    Comparing feature magnitude

    Getting ready

    How to do it...

    How it works...

    Imputing Missing Data

    Technical requirements

    Removing observations with missing data

    How to do it...

    How it works...

    See also

    Performing mean or median imputation

    How to do it...

    How it works...

    There's more...

    See also

    Implementing mode or frequent category imputation

    How to do it...

    How it works...

    See also

    Replacing missing values with an arbitrary number

    How to do it...

    How it works...

    There's more...

    See also

    Capturing missing values in a bespoke category

    How to do it...

    How it works...

    See also

    Replacing missing values with a value at the end of the distribution

    How to do it...

    How it works...

    See also

    Implementing random sample imputation

    How to do it...

    How it works...

    See also

    Adding a missing value indicator variable

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Performing multivariate imputation by chained equations

    Getting ready

    How to do it...

    How it works...

    There's more...

    Assembling an imputation pipeline with scikit-learn

    How to do it...

    How it works...

    See also

    Assembling an imputation pipeline with Feature-engine

    How to do it...

    How it works...

    See also

    Encoding Categorical Variables

    Technical requirements

    Creating binary variables through one-hot encoding

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Performing one-hot encoding of frequent categories

    Getting ready

    How to do it...

    How it works...

    There's more...

    Replacing categories with ordinal numbers

    How to do it...

    How it works...

    There's more...

    See also

    Replacing categories with counts or frequency of observations

    How to do it...

    How it works...

    There's more...

    Encoding with integers in an ordered manner

    How to do it...

    How it works...

    See also

    Encoding with the mean of the target

    How to do it...

    How it works...

    See also

    Encoding with the Weight of Evidence

    How to do it...

    How it works...

    See also

    Grouping rare or infrequent categories

    How to do it...

    How it works...

    See also

    Performing binary encoding

    Getting ready

    How to do it...

    How it works...

    See also

    Performing feature hashing

    Getting ready

    How to do it...

    How it works...

    See also

    Transforming Numerical Variables

    Technical requirements

    Transforming variables with the logarithm

    How to do it...

    How it works...

    See also

    Transforming variables with the reciprocal function

    How to do it...

    How it works...

    See also

    Using square and cube root to transform variables

    How to do it...

    How it works...

    There's more...

    Using power transformations on numerical variables

    How to do it...

    How it works...

    There's more...

    See also

    Performing Box-Cox transformation on numerical variables

    How to do it...

    How it works...

    See also

    Performing Yeo-Johnson transformation on numerical variables

    How to do it...

    How it works...

    See also

    Performing Variable Discretization

    Technical requirements

    Dividing the variable into intervals of equal width

    How to do it...

    How it works...

    See also

    Sorting the variable values in intervals of equal frequency

    How to do it...

    How it works...

    Performing discretization followed by categorical encoding

    How to do it...

    How it works...

    See also

    Allocating the variable values in arbitrary intervals

    How to do it...

    How it works...

    Performing discretization with k-means clustering

    How to do it...

    How it works...

    Using decision trees for discretization

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Working with Outliers

    Technical requirements

    Trimming outliers from the dataset

    How to do it...

    How it works...

    There's more...

    Performing winsorization

    How to do it...

    How it works...

    There's more...

    See also

    Capping the variable at arbitrary maximum and minimum values

    How to do it...

    How it works...

    There's more...

    See also

    Performing zero-coding – capping the variable at zero

    How to do it...

    How it works...

    There's more...

    See also

    Deriving Features from Dates and Time Variables

    Technical requirements

    Extracting date and time parts from a datetime variable

    How to do it...

    How it works...

    See also

    Deriving representations of the year and month

    How to do it...

    How it works...

    See also

    Creating representations of day and week

    How to do it...

    How it works...

    See also

    Extracting time parts from a time variable

    How to do it...

    How it works...

    Capturing the elapsed time between datetime variables

    How to do it...

    How it works...

    See also

    Working with time in different time zones

    How to do it...

    How it works...

    See also

    Performing Feature Scaling

    Technical requirements

    Standardizing the features

    How to do it...

    How it works...

    See also

    Performing mean normalization

    How to do it...

    How it works...

    There's more...

    See also

    Scaling to the maximum and minimum values

    How to do it...

    How it works...

    See also

    Implementing maximum absolute scaling

    How to do it...

    How it works...

    There's more...

    See also

    Scaling with the median and quantiles

    How to do it...

    How it works...

    See also

    Scaling to vector unit length

    How to do it...

    How it works...

    See also

    Applying Mathematical Computations to Features

    Technical requirements

    Combining multiple features with statistical operations

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Combining pairs of features with mathematical functions

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Performing polynomial expansion

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Deriving new features with decision trees

    Getting ready

    How to do it...

    How it works...

    There's more...

    Carrying out PCA

    Getting ready

    How to do it...

    How it works...

    See also

    Creating Features with Transactional and Time Series Data

    Technical requirements

    Aggregating transactions with mathematical operations

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Aggregating transactions in a time window

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Determining the number of local maxima and minima

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Deriving time elapsed between time-stamped events

    How to do it...

    How it works...

    There's more...

    See also

    Creating features from transactions with Featuretools

    How to do it...

    How it works...

    There's more...

    See also

    Extracting Features from Text Variables

    Technical requirements

    Counting characters, words, and vocabulary

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Estimating text complexity by counting sentences

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating features with bag-of-words and n-grams

    Getting ready

    How to do it...

    How it works...

    See also

    Implementing term frequency-inverse document frequency

    Getting ready

    How to do it...

    How it works...

    See also

    Cleaning and stemming text variables

    Getting ready

    How to do it...

    How it works...

    Other Books You May Enjoy

    Leave a review - let other readers know what you think

    Preface

    Python Feature Engineering Cookbook covers well-demonstrated recipes focused on solutions that will assist machine learning teams in identifying and extracting features to develop highly optimized and enriched machine learning models. This book includes recipes to extract and transform features from structured datasets, time series, transactions data and text. It includes recipes concerned with automating the feature engineering process, along with the widest arsenal of tools for categorical variable encoding, missing data imputation and variable discretization. Further, it provides different strategies of feature transformation, such as Box-Cox transform and other mathematical operations and includes the use of decision trees to combine existing features into new ones. Each of these recipes is demonstrated in practical terms with the help of NumPy, SciPy, pandas, scikit-learn, Featuretools and Feature-engine in Python.

    Throughout this book, you will be practicing feature generation, feature extraction and transformation, leveraging the power of scikit-learn’s feature engineering arsenal, Featuretools and Feature-engine using Python and its powerful libraries.

    Who this book is for

    This book is intended for machine learning professionals, AI engineers, and data scientists who want to optimize and enrich their machine learning models with the best features. Prior knowledge of machine learning and Python coding is expected.

    What this book covers

    Chapter 1, Foreseeing Variable Problems in Building ML Models, covers how to identify the different problems that variables may present and that challenge machine learning algorithm performance. We'll learn how to identify missing data in variables, quantify the cardinality of the variable, and much more besides.

    Chapter 2, Imputing Missing Data, explains how to engineer variables that show missing information for some observations. In a typical dataset, variables will display values for certain observations, while values will be missing for other observations. We'll introduce various techniques to fill those missing values with some additional values, and the code to execute the techniques.

    Chapter 3, Encoding Categorical Variables, introduces various classical and widely used techniques to transform categorical variables into numerical variables and also demonstrates a technique for reducing the dimension of highly cardinal variables as well as how to tackle infrequent values. This chapter also includes more complex techniques for encoding categorical variables, as described and used in the 2009 KDD competition.

    Chapter 4, Transforming Numerical Variables, uses various recipes to transform numerical variables, typically non-Gaussian, into variables that follow a more Gaussian-like distribution by applying multiple mathematical functions.

    Chapter 5, Performing Variable Discretization, covers how to create bins and distribute the values of the variables across them. The aim of this technique is to improve the spread of values across a range. It also includes well established and frequently used techniques like equal width and equal frequency discretization and more complex processes like discretization with decision trees and many more.

    Chapter 6, Working with Outliers, teaches a few mainstream techniques to remove outliers from the variables in the dataset. We'll also learn how to cap outliers at a given arbitrary minimum/maximum value.

    Chapter 7, Deriving Features from Dates and Time Variables, describes how to create features from dates and time variables. Date variables can't be used as such to build machine learning models for multiple reasons. We'll learn how to combine information from multiple time variables, like calculating time elapsed between variables and also, importantly, working with variables in different time zones.

    Chapter 8, Performing Feature Scaling, covers the methods that we can use to put the variables within the same scale. We'll also learn how to standardize variables, how to scale to minimum and maximum value, how to do mean normalization or scale to vector norm, among other techniques.

    Chapter 9, Applying Mathematical Computations to Features, explains how to create new variables from existing ones by utilizing different mathematical computations. We'll learn how to create new features through the addition/difference/multiplication/division of existing variables and more. We will also learn how to expand the feature space with polynomial expansion and how to combine features using decision trees.

    Chapter 10, Creating Features with Transactional and Time Series Data, covers how to create static features from transactional information, so that we obtain a static view of a customer, or client, at any point in time. We'll learn how to combine features using math operations, across transactions, in specific time windows and capture time between transactions. We'll also discuss how to determine time between special events. We'll briefly dive into signal processing and learn how to determine and quantify local maxima and local minima.

    Chapter 11, Extracting Features from Text Variables, explains how to derive features from text variables. We'll learn to create new features through the addition of existing variables. We will learn how to capture the complexity of the text by capturing the number of characters, words, sentences, the vocabulary and the lexical variety. We will also learn how to create Bag of Words and how to implement TF-IDF with and without n-grams

    To get the most out of this book

    Python Feature Engineering Cookbook will help machine learning practitioners improve their data preprocessing and manipulation skills, empowering them to modify existing variables or create new features from existing data. You will learn how to implement many feature engineering techniques with multiple open source tools, streamlining and simplifying code while adhering to coding best practices. Thus, to make the most of this book, you are expected to have an understanding of machine learning and machine learning algorithms, some previous experience with data processing, and a degree of familiarity with datasets. In addition, working knowledge of Python and some familiarity with Python numerical computing libraries such as NumPy, pandas, Matplotlib, and scikit-learn will be beneficial. You are required to be experienced in the use of Python through Jupyter Notebooks, in iterative Python through a Python console or Command Prompt, or have experience using a dedicated Python IDE, such as PyCharm or Spyder.

    Download the example code files

    You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

    You can download the code files by following these steps:

    Log in or register at www.packt.com.

    Select the Support tab.

    Click on Code Downloads.

    Enter the name of the book in the Search box and follow the onscreen instructions.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR/7-Zip for Windows

    Zipeg/iZip/UnRarX for Mac

    7-Zip/PeaZip for Linux

    The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789806311_ColorImages.pdf.

    Conventions used

    There are a number of text conventions used throughout this book.

    CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The nunique() method ignores missing values by default.

    A block of code is set as follows:

    import pandas as pd

    from sklearn.datasets import load_boston

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import PolynomialFeatures

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    X_train['A7'] = np.where(X_train['A7'].isin(frequent_cat), X_train['A7'], 'Rare')

    X_test['A7'] = np.where(X_test['A7'].isin(frequent_cat), X_test['A7'], 'Rare')

    Any command-line input or output is written as follows:

    $ pip install feature-engine

    Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: Click the Download button.

    Warnings or important notes appear like this.

    Tips and tricks appear like this.

    Sections

    In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

    To give clear instructions on how to complete a recipe, use these sections as follows:

    Getting ready

    This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

    How to do it…

    This section contains the steps required to follow the recipe.

    How it works…

    This section usually consists of a detailed explanation of what happened in the previous section.

    There's more…

    This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

    See also

    This section provides helpful links to other useful information for the recipe.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Reviews

    Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

    For more information about Packt, please visit packt.com.

    Foreseeing Variable Problems When Building ML Models

    A variable is a characteristic, number, or quantity that can be measured or counted. Most variables in a dataset are either numerical or categorical. Numerical variables take numbers as values and can be discrete or continuous, whereas for categorical variables, the values are selected from a group of categories,

    Enjoying the preview?
    Page 1 of 1