Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning
Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning
Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning
Ebook691 pages5 hours

Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning

Rating: 0 out of 5 stars

()

Read preview

About this ebook

If data is the new oil, then machine learning is the drill. As companies gain access to ever-increasing quantities of raw data, the ability to deliver state-of-the-art predictive models that support business decision-making becomes more and more valuable.

In this book, you’ll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you’ll experience in real-world data science projects.

You’ll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.

Now in its second edition, this book will take you through the end-to-end process of exploring data and delivering machine learning models. Updated for 2021, this edition includes brand new content on XGBoost, SHAP values, algorithmic fairness, and the ethical concerns of deploying a model in the real world.

By the end of this data science book, you’ll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.

LanguageEnglish
Release dateJul 29, 2021
ISBN9781800569447
Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning

Related to Data Science Projects with Python.

Related ebooks

Data Visualization For You

View More

Related articles

Reviews for Data Science Projects with Python.

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science Projects with Python. - Stephen Klosterman

    9781800564480cov_Low_Res.png

    Data Science Projects with Python

    second edition

    Copyright © 2021 Packt Publishing

    All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Author: Stephen Klosterman

    Reviewers: Ashish Jain and Deepti Miyan Gupta

    Managing Editor: Mahesh Dhyani

    Acquisitions Editors: Sneha Shinde and Anindya Sil

    Production Editor: Shantanu Zagade

    Editorial Board: Megan Carlisle, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Abhishek Rane, Brendan Rodrigues, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

    First published: April 2019

    Second edition: July 2021

    Production reference: 1280721

    ISBN: 978-1-80056-448-0

    Published by Packt Publishing Ltd.

    Livery Place, 35 Livery Street

    Birmingham B3 2PB, UK

    Table of Contents

    Preface

    1. Data Exploration and Cleaning

    Introduction

    Python and the Anaconda Package Management System

    Indexing and the Slice Operator

    Exercise 1.01: Examining Anaconda and Getting Familiar with Python

    Different Types of Data Science Problems

    Loading the Case Study Data with Jupyter and pandas

    Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook

    Getting Familiar with Data and Performing Data Cleaning

    The Business Problem

    Data Exploration Steps

    Exercise 1.03: Verifying Basic Data Integrity

    Boolean Masks

    Exercise 1.04: Continuing Verification of Data Integrity

    Exercise 1.05: Exploring and Cleaning the Data

    Data Quality Assurance and Exploration

    Exercise 1.06: Exploring the Credit Limit and Demographic Features

    Deep Dive: Categorical Features

    Exercise 1.07: Implementing OHE for a Categorical Feature

    Exploring the Financial History Features in the Dataset

    Activity 1.01: Exploring the Remaining Financial Features in the Dataset

    Summary

    2. Introduction to Scikit-Learn and Model Evaluation

    Introduction

    Exploring the Response Variable and Concluding the Initial Exploration

    Introduction to Scikit-Learn

    Generating Synthetic Data

    Data for Linear Regression

    Exercise 2.01: Linear Regression in Scikit-Learn

    Model Performance Metrics for Binary Classification

    Splitting the Data: Training and Test Sets

    Classification Accuracy

    True Positive Rate, False Positive Rate, and Confusion Matrix

    Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python

    Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?

    Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model

    The Receiver Operating Characteristic (ROC) Curve

    Precision

    Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

    Summary

    3. Details of Logistic Regression and Feature Exploration

    Introduction

    Examining the Relationships Between Features and the Response Variable

    Pearson Correlation

    Mathematics of Linear Correlation

    F-test

    Exercise 3.01: F-test and Univariate Feature Selection

    Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions

    Hypotheses and Next Steps

    Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable

    Univariate Feature Selection: What it Does and Doesn't Do

    Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python

    Exercise 3.03: Plotting the Sigmoid Function

    Scope of Functions

    Why Is Logistic Regression Considered a Linear Model?

    Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression

    From Logistic Regression Coefficients to Predictions Using Sigmoid

    Exercise 3.05: Linear Decision Boundary of Logistic Regression

    Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients

    Summary

    4. The Bias-Variance Trade-Off

    Introduction

    Estimating the Coefficients and Intercepts of Logistic Regression

    Gradient Descent to Find Optimal Parameter Values

    Exercise 4.01: Using Gradient Descent to Minimize a Cost Function

    Assumptions of Logistic Regression

    The Motivation for Regularization: The Bias-Variance Trade-Off

    Exercise 4.02: Generating and Modeling Synthetic Classification Data

    Lasso (L1) and Ridge (L2) Regularization

    Cross-Validation: Choosing the Regularization Parameter

    Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem

    Options for Logistic Regression in Scikit-Learn

    Scaling Data, Pipelines, and Interaction Features in Scikit-Learn

    Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data

    Summary

    5. Decision Trees and Random Forests

    Introduction

    Decision Trees

    The Terminology of Decision Trees and Connections to Machine Learning

    Exercise 5.01: A Decision Tree in Scikit-Learn

    Training Decision Trees: Node Impurity

    Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions

    Training Decision Trees: A Greedy Algorithm

    Training Decision Trees: Different Stopping Criteria and Other Options

    Using Decision Trees: Advantages and Predicted Probabilities

    A More Convenient Approach to Cross-Validation

    Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree

    Random Forests: Ensembles of Decision Trees

    Random Forest: Predictions and Interpretability

    Exercise 5.03: Fitting a Random Forest

    Checkerboard Graph

    Activity 5.01: Cross-Validation Grid Search with Random Forest

    Summary

    6. Gradient Boosting, XGBoost, and SHAP Values

    Introduction

    Gradient Boosting and XGBoost

    What Is Boosting?

    Gradient Boosting and XGBoost

    XGBoost Hyperparameters

    Early Stopping

    Tuning the Learning Rate

    Other Important Hyperparameters in XGBoost

    Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters

    Another Way of Growing Trees: XGBoost's grow_policy

    Explaining Model Predictions with SHAP Values

    Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values

    Missing Data

    Saving Python Variables to a File

    Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP

    Summary

    7. Test Set Analysis, Financial Insights, and Delivery to the Client

    Introduction

    Review of Modeling Results

    Feature Engineering

    Ensembling Multiple Models

    Different Modeling Techniques

    Balancing Classes

    Model Performance on the Test Set

    Distribution of Predicted Probability and Decile Chart

    Exercise 7.01: Equal-Interval Chart

    Calibration of Predicted Probabilities

    Financial Analysis

    Financial Conversation with the Client

    Exercise 7.02: Characterizing Costs and Savings

    Activity 7.01: Deriving Financial Insights

    Final Thoughts on Delivering a Predictive Model to the Client

    Model Monitoring

    Ethics in Predictive Modeling

    Summary

    Appendix

    Preface

    About the Book

    If data is the new oil, then machine learning is the drill. As companies gain access to ever-increasing quantities of raw data, the ability to deliver state-of-the-art predictive models that support business decision-making becomes more and more valuable.

    In this book, you'll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you'll experience in real-world data science projects.

    You'll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.

    Now in its second edition, this book will take you through the end-to-end process of exploring data and delivering machine learning models. Updated for 2021, this edition includes brand new content on XGBoost, SHAP values, algorithmic fairness, and the ethical concerns of deploying a model in the real world.

    By the end of this data science book, you'll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.

    About the Author

    Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.

    Objectives

    Load, explore, and process data using the pandas Python package

    Use Matplotlib to create effective data visualizations

    Implement predictive machine learning models with scikit-learn and XGBoost

    Use lasso and ridge regression to reduce model overfitting

    Build ensemble models of decision trees, using random forest and gradient boosting

    Evaluate model performance and interpret model predictions

    Deliver valuable insights by making clear business recommendations

    Audience

    Data Science Projects with Python – Second Edition is for anyone who wants to get started with data science and machine learning. If you're keen to advance your career by using data analysis and predictive modeling to generate business insights, then this book is the perfect place to begin. To quickly grasp the concepts covered, it is recommended that you have basic experience with programming in Python or another similar language (R, Matlab, C, etc). Additionally, knowledge of statistics that would be covered in a basic course, including topics such as probability and linear regression, or a willingness to learn about these on your own while reading this book would be useful.

    Approach

    Data Science Projects with Python takes a practical case study approach to learning, teaching concepts in the context of a real-world dataset. Clear explanations will deepen your knowledge, while engaging exercises and challenging activities will reinforce it with hands-on practice.

    About the Chapters

    Chapter 1, Data Exploration and Cleaning, gets you started with Python and Jupyter notebooks. The chapter then explores the case study dataset and delves into exploratory data analysis, quality assurance, and data cleaning using pandas.

    Chapter 2, Introduction to Scikit-Learn and Model Evaluation, introduces you to the evaluation metrics for binary classification models. You'll learn how to build and evaluate binary classification models using scikit-learn.

    Chapter 3, Details of Logistic Regression and Feature Exploration, dives deep into logistic regression and feature exploration. You'll learn how to generate correlation plots of many features and a response variable and interpret logistic regression as a linear model.

    Chapter 4, The Bias-Variance Trade-Off, explores the foundational machine learning concepts of overfitting, underfitting, and the bias-variance trade-off by examining how the logistic regression model can be extended to address the overfitting problem.

    Chapter 5, Decision Trees and Random Forests, introduces you to tree-based machine learning models. You'll learn how to train decision trees for machine learning purposes, visualize trained decision trees, and train random forests and visualize the results.

    Chapter 6, Gradient Boosting, XGBoost, and SHAP Values, introduces you to two key concepts: gradient boosting and shapley additive explanations (SHAP). You'll learn to train XGBoost models and understand how SHAP values can be used to provide individualized explanations for model predictions from any dataset.

    Chapter 7, Test Set Analysis, Financial Insights, and Delivery to the Client, presents several techniques for analyzing a model test set for deriving insights into likely model performance in the future. The chapter also describes key elements to consider when delivering and deploying a model, such as the format of delivery and ways to monitor the model as it is being used.

    Hardware Requirements

    For the optimal student experience, we recommend the following hardware configuration:

    Processor: Intel Core i5 or equivalent

    Memory: 4 GB RAM

    Storage: 35 GB available space

    Software Requirements

    You'll also need the following software installed in advance:

    OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit, Ubuntu Linux, or the latest version of OS X

    Browser: Google Chrome/Mozilla Firefox Latest Version

    Notepad++/Sublime Text as IDE (this is optional, as you can practice everything using the Jupyter Notebook on your browser)

    Python 3.8+ (This book uses Python 3.8.2) installed (from https://python.org, or via Anaconda as recommended below) . At the time of writing, the SHAP library used in Chapter 6, Gradient Boosting, XGBoost, and SHAP Values, is not compatible with Python 3.9. Hence, if you are using Python 3.9 as your base environment, we suggest that you set up a Python 3.8 environment as described in the next section.

    Python libraries as needed (Jupyter, NumPy, Pandas, Matplotlib, and so on, installed via Anaconda as recommended below)

    Installation and Setup

    Before you start this book, it is recommended to install the Anaconda package manager and use it to coordinate installation of Python and its packages.

    Code Bundle

    Please find the code bundle for this book, hosted on GitHub at https://github.com/PacktPublishing/Data-Science-Projects-with-Python-Second-Ed.

    Anaconda and Setting up Your Environment

    You can install Anaconda by visiting the following link: https://www.anaconda.com/products/individual. Scroll down to the bottom of the page and download the installer relevant to your system.

    It is recommended to create an environment in Anaconda to do the exercises and activities in this book, which have been tested against the software versions indicated here. Once you have Anaconda installed, open a Terminal, if you're using macOS or Linux, or a Command Prompt window in Windows, and do the following:

    Create an environment with most required packages. You can call it whatever you want; here it's called dspwp2. Copy and paste, or type the entire statement here on one line in the terminal:

    conda create -n dspwp2 python=3.8.2 jupyter=1.0.0 pandas=1.2.1 scikit-learn=0.23.2 numpy=1.19.2 matplotlib=3.3.2 seaborn=0.11.1 python-graphviz=0.15 xlrd=2.0.1

    Type 'y' and press [Enter] when prompted.

    Activate the environment:

    conda activate dspwp2

    Install the remaining packages:

    conda install -c conda-forge xgboost=1.3.0 shap=0.37.0

    Type 'y' and [Enter] when prompted.

    You are ready to use the environment. To deactivate it when finished:

    conda deactivate

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Conventions

    Code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: By typing conda list at the command line, you can see all the packages installed in your environment.

    A block of code is set as follows:

    import numpy as np #numerical computation

    import pandas as pd #data wrangling

    import matplotlib.pyplot as plt #plotting package

    #Next line helps with rendering plots

    %matplotlib inline

    import matplotlib as mpl #add'l plotting functionality

    mpl.rcParams['figure.dpi'] = 400 #high res figures

    import graphviz #to visualize decision trees

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Create a new Python 3 notebook from the New menu as shown.

    Code Presentation

    Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

    For example:

    my_new_lr = LogisticRegression(penalty='l2', dual=False,\

                                   tol=0.0001, C=1.0,\

                                   fit_intercept=True,\

                                   intercept_scaling=1,\

                                   class_weight=None,\

                                   random_state=None,\

                                   solver='lbfgs',\

                                   max_iter=100,\

                                   multi_class='auto',\

                                   verbose=0, warm_start=False,\

                                   n_jobs=None, l1_ratio=None)

    Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

    import pandas as pd

    import matplotlib.pyplot as plt #import plotting package

    #render plotting automatically

    %matplotlib inline

    Get in Touch

    Feedback from our readers is always welcome.

    General feedback: If you have any questions about this book, please mention the book title in the subject of your message and email us at [email protected].

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you could report this to us. Please visit www.packtpub.com/support/errata and complete the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you could provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Please Leave a Review

    Let us know what you think by leaving a detailed, impartial review on Amazon. We appreciate all feedback – it helps us continue to make great products and help aspiring developers build their skills. Please spare a few minutes to give your thoughts – it makes a big difference to us. You can leave a review by clicking the following link: https://packt.link/r/1800564481.

    1. Data Exploration and Cleaning

    Overview

    In this chapter, you will take your first steps with Python and Jupyter notebooks, some of the most common tools data scientists use. You'll then take the first look at the dataset for the case study project that will form the core of this book. You will begin to develop an intuition for quality assurance checks that data needs to be put through before model building. By the end of the chapter, you will be able to use pandas, the top package for wrangling tabular data in Python, to do exploratory data analysis, quality assurance, and data cleaning.

    Introduction

    Most businesses possess a wealth of data on their operations and customers. Reporting on this data in the form of descriptive charts, graphs, and tables is a good way to understand the current state of the business. However, in order to provide quantitative guidance on future business strategies and operations, it is necessary to go a step further. This is where the practices of machine learning and predictive modeling are needed. In this book, we will show how to go from descriptive analyses to concrete guidance for future operations, using predictive models.

    To accomplish this goal, we'll introduce some of the most widely used machine learning tools via Python and many of its packages. You will also get a sense of the practical skills necessary to execute successful projects: inquisitiveness when examining data and communication with the client. Time spent looking in detail at a dataset and critically examining whether it accurately meets its intended purpose is time well spent. You will learn several techniques for assessing data quality here.

    In this chapter, after getting familiar with the basic tools for data exploration, we will discuss a few typical working scenarios for how you may receive data. Then, we will begin a thorough exploration of the case study dataset and help you learn how you can uncover possible issues, so that when you are ready for modeling, you may proceed with confidence.

    Python and the Anaconda Package Management System

    In this book, we will use the Python programming language. Python is a top language for data science and is one of the fastest-growing programming languages. A commonly cited reason for Python's popularity is that it is easy to learn. If you have Python experience, that's great; however, if you have experience with other languages, such as C, Matlab, or R, you shouldn't have much trouble using Python. You should be familiar with the general constructs of computer programming to get the most out of this book. Examples of such constructs are for loops and if statements that guide the control flow of a program. No matter what language you have used, you are likely familiar with these constructs, which you will also find in Python.

    A key feature of Python that is different from some other languages is that it is zero-indexed; in other words, the first element of an ordered collection has an index of 0. Python also supports negative indexing, where the index -1 refers to the last element of an ordered collection and negative indices count backward from the end. The slice operator, :, can be used to select multiple elements of an ordered collection from within a range, starting from the beginning, or going to the end of the collection.

    Indexing and the Slice Operator

    Here, we demonstrate how indexing and the slice operator work. To have something to index, we will create a list, which is a mutable ordered collection that can contain any type of data, including numerical and string types. Mutable just means the elements of the list can be changed after they are first assigned. To create the numbers for our list, which will be consecutive integers, we'll use the built-in range() Python function. The range() function technically creates an iterator that we'll convert to a list using the list() function, although you need not be concerned with that detail here. The following screenshot shows a list of the first five positive integers being printed on the console, as well as a few indexing operations, and changing the first item of the list to a new value of a different data type:

    Figure 1.1: List creation and indexing

    Figure 1.1: List creation and indexing

    A few things to notice about Figure 1.1: the endpoint of an interval is open for both slice indexing and the range() function, while the starting point is closed. In other words, notice how when we specify the start and end of range(), endpoint 6 is not included in the result but starting point 1 is. Similarly, when indexing the list with the slice [:3], this includes all elements of the list with indices up to, but not including, 3.

    We've referred to ordered collections, but Python also includes unordered collections. An important one of these is called a dictionary. A dictionary is an unordered collection of key:value pairs. Instead of looking up the values of a dictionary by integer indices, you look them up by keys, which could be numbers or strings. A dictionary can be created using curly braces {} and with the key:value pairs separated by commas. The following screenshot is an example of how we can create a dictionary with counts of fruit – examine the number of apples, then add a new type of fruit and its count:

    Figure 1.2: An example dictionary

    Figure 1.2: An example dictionary

    There are many other distinctive features of Python and we just want to give you a flavor here, without getting into too much detail. In fact, you will probably use packages such as pandas (pandas) and NumPy (numpy) for most of your data handling in Python. NumPy provides fast numerical computation on arrays and matrices, while pandas provides a wealth of data wrangling and exploration capabilities on tables of data called DataFrames. However, it's good to be familiar with some of the basics of Python—the language that sits at the foundation of all of this. For example, indexing works the same in NumPy and pandas as it does in Python.

    One of the strengths of Python is that it is open source and has an active community of developers creating amazing tools. We will use several of these tools in this book. A potential pitfall of having open source packages from different contributors is the dependencies between various packages. For example, if you want to install pandas, it may rely on a certain version of NumPy, which you may or may not have installed. Package management systems make life easier in this respect. When you install a new package through the package management system, it will ensure that all the dependencies are met. If they aren't, you will be prompted to upgrade or install new packages as necessary.

    For this book, we will use the Anaconda package management system, which you should already have installed. While we will only use Python here, it is also possible to run R with Anaconda.

    Note: Environments

    It is recommended to create a new Python 3.x environment for this book. Environments are like separate installations of Python, where the set of packages you have installed can be different, as well as the version of Python. Environments are useful for developing projects that need to be deployed in different versions of Python, possibly with different dependencies. For general information on this, see https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html. See the Preface for specific instructions on setting up an Anaconda environment for this book before you begin the upcoming exercises.

    Exercise 1.01: Examining Anaconda and Getting Familiar with Python

    In this exercise, you will examine the packages in your Anaconda installation and practice with some basic Python control flow and data structures, including a for loop, dict, and list. This will confirm that you have completed the installation steps in the preface and show you how Python syntax and data structures may be a little different from other programming languages you may be familiar with. Perform the following steps to complete the exercise:

    Note

    Before executing the exercises and the activity in this chapter, please make sure you have followed the instructions regarding setting up your Python environment as mentioned in the Preface. The code file for this exercise can be found here: https://packt.link/N0RPT.

    Open up Terminal, if you're using macOS or Linux, or a Command Prompt window in Windows. If you're using an environment, activate it using conda activate . Then type condalist at the command line. You should observe an output similar to the following:

    Figure 1.3: Selection of packages from conda list

    Figure 1.3: Selection of packages from conda list

    You can see all the packages installed in your environment, including the packages we will directly interact with, as well as their dependencies which are needed for them to function. Managing dependencies among packages is one of the main advantages of a package management system.

    Note

    For more information about Anaconda and command-line interaction, check out this cheat sheet: https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf.

    Type python in Terminal to open a command-line Python interpreter. You should obtain an output similar to the following:

    Figure 1.4: Command-line Python

    Figure 1.4: Command-line Python

    You should see some information about your version of Python, as well as the Python Command Prompt (>>>). When you type after this prompt, you are writing Python code.

    Note

    Although we will be using the Jupyter notebook in this book, one of the aims of this exercise is to go through the basic steps of writing and running Python programs on the Command Prompt.

    Write a for loop at the Command Prompt to print values from 0 to 4 using the following code (note that the three dots at the beginning of the second and third lines appear automatically if you are writing code in the command-line Python interpreter; if you're instead writing in a Jupyter notebook, these won't appear):

    for counter in range(5):

    ...    print(counter)

    ...

    Once you hit Enter when you see ... on the prompt, you should obtain this output:

    Figure 1.5: Output of a for loop at the command line

    Figure 1.5: Output of a for loop at the command line

    Notice that in Python, the opening of the for loop is followed by a colon, and the body of the loop requires indentation. It's typical to use four spaces to indent a code block. Here, the for loop prints the values returned by the range() iterator, having repeatedly accessed them using the counter variable with the in keyword.

    Note

    For many more details on Python code conventions, refer to the following: https://www.python.org/dev/peps/pep-0008/.

    Now, we will return to our dictionary example. The first step here is to create the dictionary.

    Create a dictionary of fruits (apples, oranges, and bananas) using the following code:

    example_dict = {'apples':5, 'oranges':8, 'bananas':13}

    Convert the dictionary to a list using the list() function, as shown in the following snippet:

    dict_to_list = list(example_dict)

    dict_to_list

    Once you run the preceding code, you should obtain the following output:

    ['apples', 'oranges', 'bananas']

    Notice that when this is done and we examine the contents, only the keys of the dictionary have been captured in the list. If we wanted the values, we would have had to specify that with the .values() method of the list. Also, notice that the list of dictionary keys happens to be in the same order that we wrote them when creating the dictionary. This is not guaranteed, however, as dictionaries are unordered collection types.

    One convenient thing you can do with lists is to append other lists to them with the + operator. As an example, in the next step, we will combine the existing list of fruit with a list that contains just one more type of fruit, overwriting the variable containing the original list, like this: list(example_dict.values()); the interested readers can confirm this for themselves.

    Use the + operator to combine the existing list of fruits with a new list containing only one fruit (pears):

    dict_to_list = dict_to_list + ['pears']

    dict_to_list

    Your output will be as follows:

    ['apples', 'oranges', 'bananas', 'pears']

    What if we wanted to sort our list of fruit types?

    Python provides a built-in sorted() function that can be used for this; it will return a sorted version of the input. In our case, this means the list of fruit types will be sorted alphabetically.

    Sort the list of fruits in alphabetical order using the sorted() function, as shown in the following snippet:

    sorted(dict_to_list)

    Once you run the preceding code, you should see the following output:

    ['apples', 'bananas', 'oranges', 'pears']

    That's enough Python for now. We will show you how to execute the code for this book, so your Python knowledge should improve along the way. While you have the Python interpreter open, you may wish to run the code examples shown in Figures 1.1 and 1.2. When you're done with the interpreter, you can type quit() to exit.

    Note

    As you learn more and inevitably want to try new things, consult the official Python documentation: https://docs.python.org/3/.

    Different Types of Data Science Problems

    Much of your time as a data scientist is likely to be spent wrangling data: figuring out how to get it, getting it, examining it, making sure it's correct and complete, and joining it with other types of data. pandas is a widely used tool for data analysis in Python, and it can facilitate the data exploration process for you, as we will see in this chapter. However, one of the key goals of this book is to start you on your journey to becoming a machine learning data scientist, for which you will need to master the art and science of predictive modeling. This means using a mathematical model, or idealized mathematical formulation, to learn relationships within the data, in the hope of making accurate and useful predictions when new data comes in.

    For predictive modeling use cases, data is typically organized in a tabular structure, with features and a response variable. For example, if you want to predict the price of a house based on some characteristics about it, such as area and number of bedrooms, these attributes would be considered the features and the price of the house would be the response variable. The response variable

    Enjoying the preview?
    Page 1 of 1