Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning

Ebook691 pages5 hours

Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning

Name: Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning
Author: Stephen Klosterman
ISBN: 9781800569447

By Stephen Klosterman

Rating: 0 out of 5 stars

()

Read preview

About this ebook

If data is the new oil, then machine learning is the drill. As companies gain access to ever-increasing quantities of raw data, the ability to deliver state-of-the-art predictive models that support business decision-making becomes more and more valuable.

In this book, you’ll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you’ll experience in real-world data science projects.

You’ll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.

Now in its second edition, this book will take you through the end-to-end process of exploring data and delivering machine learning models. Updated for 2021, this edition includes brand new content on XGBoost, SHAP values, algorithmic fairness, and the ethical concerns of deploying a model in the real world.

By the end of this data science book, you’ll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateJul 29, 2021

ISBN9781800569447

Author

Stephen Klosterman

Related authors

Skip carousel

Related to Data Science Projects with Python.

Related ebooks

Skip carousel

The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
Ebook
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
byBlaine Bateman
Rating: 0 out of 5 stars
0 ratings
Data Science for Marketing Analytics.: A practical guide to forming a killer marketing strategy through data analysis with Python
Ebook
Data Science for Marketing Analytics.: A practical guide to forming a killer marketing strategy through data analysis with Python
byMirza Rahim Baig
Rating: 0 out of 5 stars
0 ratings
SQL for Data Analytics: Harness the power of SQL to extract insights from data
Ebook
SQL for Data Analytics: Harness the power of SQL to extract insights from data
byBenjamin Johnston
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
Ebook
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
byRoy Jafari
Rating: 0 out of 5 stars
0 ratings
The Applied SQL Data Analytics Workshop - Second Edition: Develop your practical skills and prepare to become a professional data analyst, 2nd Edition
Ebook
The Applied SQL Data Analytics Workshop - Second Edition: Develop your practical skills and prepare to become a professional data analyst, 2nd Edition
byMatt Goldwasser
Rating: 0 out of 5 stars
0 ratings
Mastering Predictive Analytics with scikit-learn and TensorFlow: Implement machine learning techniques to build advanced predictive models using Python
Ebook
Mastering Predictive Analytics with scikit-learn and TensorFlow: Implement machine learning techniques to build advanced predictive models using Python
byAlan Fontaine
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning By Example: The easiest way to get into machine learning
Ebook
Python Machine Learning By Example: The easiest way to get into machine learning
byYuxi (Hayden) Liu
Rating: 5 out of 5 stars
5/5
Learning Predictive Analytics with Python
Ebook
Learning Predictive Analytics with Python
byKumar Ashish
Rating: 0 out of 5 stars
0 ratings
Python: Advanced Predictive Analytics: Gain practical insights by exploiting data in your business to build advanced predictive modeling applications
Ebook
Python: Advanced Predictive Analytics: Gain practical insights by exploiting data in your business to build advanced predictive modeling applications
byKumar Ashish
Rating: 0 out of 5 stars
0 ratings
Building Machine Learning Systems with Python
Ebook
Building Machine Learning Systems with Python
byWilli Richert
Rating: 4 out of 5 stars
4/5
Mathematica Data Analysis
Ebook
Mathematica Data Analysis
bySuchok Sergiy
Rating: 0 out of 5 stars
0 ratings
Machine Learning Fundamentals: Use Python and scikit-learn to get up and running with the hottest developments in machine learning
Ebook
Machine Learning Fundamentals: Use Python and scikit-learn to get up and running with the hottest developments in machine learning
byHyatt Saleh
Rating: 0 out of 5 stars
0 ratings
Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly
Ebook
Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly
byMichael Walker
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning, Second Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow
Ebook
Python Machine Learning, Second Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow
bySebastian Raschka
Rating: 0 out of 5 stars
0 ratings
Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value
Ebook
Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value
byBipin Chadha
Rating: 0 out of 5 stars
0 ratings
Supervised Learning with Python: Concepts and Practical Implementation Using Python
Ebook
Supervised Learning with Python: Concepts and Practical Implementation Using Python
byVaibhav Verdhan
Rating: 0 out of 5 stars
0 ratings
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Ebook
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
byAnthony So
Rating: 0 out of 5 stars
0 ratings
A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg
Ebook
A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg
byDr. Ranja Sarkar
Rating: 0 out of 5 stars
0 ratings
Regression Analysis with Python
Ebook
Regression Analysis with Python
byLuca Massaron
Rating: 0 out of 5 stars
0 ratings
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
Ebook
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
byCésar Pérez López
Rating: 0 out of 5 stars
0 ratings
Essential Statistics for Non-STEM Data Analysts: Get to grips with the statistics and math knowledge needed to enter the world of data science with Python
Ebook
Essential Statistics for Non-STEM Data Analysts: Get to grips with the statistics and math knowledge needed to enter the world of data science with Python
byRongpeng Li
Rating: 0 out of 5 stars
0 ratings
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Ebook
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
byPeter Bradley
Rating: 0 out of 5 stars
0 ratings
Data Forecasting and Segmentation Using Microsoft Excel: Perform data grouping, linear predictions, and time series machine learning statistics without using code
Ebook
Data Forecasting and Segmentation Using Microsoft Excel: Perform data grouping, linear predictions, and time series machine learning statistics without using code
byFernando Roque
Rating: 0 out of 5 stars
0 ratings
Applied Deep Learning with Python: Use scikit-learn, TensorFlow, and Keras to create intelligent systems and machine learning solutions
Ebook
Applied Deep Learning with Python: Use scikit-learn, TensorFlow, and Keras to create intelligent systems and machine learning solutions
byAlex Galea
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Machine Learning
Ebook
Microsoft Azure Machine Learning
bySumit Mund
Rating: 4 out of 5 stars
4/5
Learning Data Mining with Python
Ebook
Learning Data Mining with Python
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
The Python Workshop: Learn to code in Python and kickstart your career in software development or data science
Ebook
The Python Workshop: Learn to code in Python and kickstart your career in software development or data science
byAndrew Bird
Rating: 5 out of 5 stars
5/5
The C++ Workshop: Learn to write clean, maintainable code in C++ and advance your career in software engineering
Ebook
The C++ Workshop: Learn to write clean, maintainable code in C++ and advance your career in software engineering
byGreen Dale
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
MATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results
Ebook
MATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results
byGiuseppe Ciaburro
Rating: 0 out of 5 stars
0 ratings

Data Visualization For You

Skip carousel

How to Lie with Maps
Ebook
How to Lie with Maps
byMark Monmonier
Rating: 4 out of 5 stars
4/5
Learning pandas - Second Edition
Ebook
Learning pandas - Second Edition
byHeydt Michael
Rating: 4 out of 5 stars
4/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
Ebook
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
byStefanie Molin
Rating: 0 out of 5 stars
0 ratings
Teach Yourself VISUALLY Power BI
Ebook
Teach Yourself VISUALLY Power BI
byAlexander Loth
Rating: 0 out of 5 stars
0 ratings
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios
Ebook
The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios
bySteve Wexler
Rating: 4 out of 5 stars
4/5
Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals
Ebook
Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals
byBrent Dykes
Rating: 4 out of 5 stars
4/5
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
Ebook
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Data Visualization: A Practical Introduction
Ebook
Data Visualization: A Practical Introduction
byKieran Healy
Rating: 5 out of 5 stars
5/5
Data Analytics & Visualization All-in-One For Dummies
Ebook
Data Analytics & Visualization All-in-One For Dummies
byJack A. Hyman
Rating: 0 out of 5 stars
0 ratings
Python For Beginners.Learn Data Science in 5 Days the Smart Way and Remember it Longer. With Easy Step by Step Guidance & Hands on Examples. (Python Crash Course-Programming for Beginners): Python for Beginners
Ebook
Python For Beginners.Learn Data Science in 5 Days the Smart Way and Remember it Longer. With Easy Step by Step Guidance & Hands on Examples. (Python Crash Course-Programming for Beginners): Python for Beginners
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
DAX Patterns: Second Edition
Ebook
DAX Patterns: Second Edition
byMarco Russo
Rating: 5 out of 5 stars
5/5
Excel 2024: Mastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step GuideMastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step Guide
Ebook
Excel 2024: Mastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step GuideMastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step Guide
byThomas Reynolds
Rating: 0 out of 5 stars
0 ratings
Tableau 10 Business Intelligence Cookbook
Ebook
Tableau 10 Business Intelligence Cookbook
bySantos Donabel
Rating: 0 out of 5 stars
0 ratings
How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics
Ebook
How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics
byJohn J Burrett
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Science for Marketing: Improve your marketing strategies with machine learning using Python and R
Ebook
Hands-On Data Science for Marketing: Improve your marketing strategies with machine learning using Python and R
byYoon Hyup Hwang
Rating: 5 out of 5 stars
5/5
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Ebook
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
byVibrant Publishers
Rating: 1 out of 5 stars
1/5
Mastering Excel: Excel Apps
Ebook
Mastering Excel: Excel Apps
byMark Moore
Rating: 3 out of 5 stars
3/5
Practical Data Analysis Cookbook
Ebook
Practical Data Analysis Cookbook
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Splunk Developer's Guide - Second Edition
Ebook
Splunk Developer's Guide - Second Edition
byKyle Smith
Rating: 0 out of 5 stars
0 ratings
Data Pipelines with Apache Airflow
Ebook
Data Pipelines with Apache Airflow
byJulian de Ruiter
Rating: 0 out of 5 stars
0 ratings
Data Visualization For Dummies
Ebook
Data Visualization For Dummies
byMico Yuk
Rating: 2 out of 5 stars
2/5
Ultimate Azure Data Engineering
Ebook
Ultimate Azure Data Engineering
byAshish Agarwal
Rating: 0 out of 5 stars
0 ratings
The Chicago Guide to Writing About Numbers
Ebook
The Chicago Guide to Writing About Numbers
byJane E. Miller
Rating: 0 out of 5 stars
0 ratings
Visualizing Graph Data
Ebook
Visualizing Graph Data
byCorey Lanum
Rating: 0 out of 5 stars
0 ratings
How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech
Ebook
How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech
byAnnie Nelson
Rating: 0 out of 5 stars
0 ratings
No-Code Data Science: Mastering Advanced Analytics, Machine Learning, and Artificial Intelligence
Ebook
No-Code Data Science: Mastering Advanced Analytics, Machine Learning, and Artificial Intelligence
byDavid Patrishkoff
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments | Vasco Duarte: BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! In today's...
Podcast episode
BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments | Vasco Duarte: BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! In today's...
byScrum Master Toolbox Podcast: Agile storytelling from the trenches
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
Design Secrets of A Climate Action Dashboard for Cities: A Deep Dive into Behavioral Science
Podcast episode
Design Secrets of A Climate Action Dashboard for Cities: A Deep Dive into Behavioral Science
byHow to Save the World | The Psychology & Science of Environmental Behavior
0 ratings
0% found this document useful
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
Podcast episode
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
byMLOps.community
0 ratings
0% found this document useful
The Computational Complexity of Machine Learning: In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave an...
Podcast episode
The Computational Complexity of Machine Learning: In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave an...
byData Skeptic
0 ratings
0% found this document useful
Looking Back at AI in 2021 with Jeremie from Towards Data Science: For our first episode in 2022, we are joined with our friends from the Towards Data Science podcast to discuss our thoughts about the AI-related trends and events that happened in 2021. Some things we discuss are: Foundation models continue to grow, ...
Podcast episode
Looking Back at AI in 2021 with Jeremie from Towards Data Science: For our first episode in 2022, we are joined with our friends from the Towards Data Science podcast to discuss our thoughts about the AI-related trends and events that happened in 2021. Some things we discuss are: Foundation models continue to grow, ...
byLast Week in AI
0 ratings
0% found this document useful
High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor
Podcast episode
High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
SWaV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (Mathilde Caron)
Podcast episode
SWaV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (Mathilde Caron)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
22. Luke Marsden - Data Science Infrastructure and MLOps
Podcast episode
22. Luke Marsden - Data Science Infrastructure and MLOps
byTowards Data Science
0 ratings
0% found this document useful
Data Journeys with Bruno Aziza: On the show this week, and share two popular episodes of ’s YouTube series Data Journeys. First up, Bruno talks with Aaron Biller of Postmates about their triangle of complex data that includes customer, courier, and merchant. He details their...
Podcast episode
Data Journeys with Bruno Aziza: On the show this week, and share two popular episodes of ’s YouTube series Data Journeys. First up, Bruno talks with Aaron Biller of Postmates about their triangle of complex data that includes customer, courier, and merchant. He details their...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
The Art & Science of Finding You Top Performers: The Art & Science of Finding You Top Performers Advanced Insights into Data Analysis and Optimization with Dr. Ellis Welcome to this episode of Seller Sessions, where we dive deep into the nuanced world of data analysis and optimisation with the...
Podcast episode
The Art & Science of Finding You Top Performers: The Art & Science of Finding You Top Performers Advanced Insights into Data Analysis and Optimization with Dr. Ellis Welcome to this episode of Seller Sessions, where we dive deep into the nuanced world of data analysis and optimisation with the...
bySeller Sessions Amazon FBA and Private Label
0 ratings
0% found this document useful
051: Strategy evaluation techniques, flaws and solutions with Dave Walton: Today we’re covering a topic which can really be a concern for traders of all levels, from beginner to pro, and that is the topic of strategy evaluation. Have you ever found that real-life performance does not match expected results? Or perhaps you...
Podcast episode
051: Strategy evaluation techniques, flaws and solutions with Dave Walton: Today we’re covering a topic which can really be a concern for traders of all levels, from beginner to pro, and that is the topic of strategy evaluation. Have you ever found that real-life performance does not match expected results? Or perhaps you...
byBetter System Trader
0 ratings
0% found this document useful
Understanding Machine Learning Features and Platforms
Podcast episode
Understanding Machine Learning Features and Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
Podcast episode
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
byAI Live & Unbiased
0 ratings
0% found this document useful
Pricing: everything you always wanted to know but were afraid to ask, with ProfitWell CEO Patrick Campbell
Podcast episode
Pricing: everything you always wanted to know but were afraid to ask, with ProfitWell CEO Patrick Campbell
byACQ2 by Acquired
0 ratings
0% found this document useful
BONUS: Helping Product Owners Make Decisions Quickly With Experiments, And Lean Startup | Vasco Duarte: BONUS: Helping Product Owners Make Decisions Quickly With Experiments, And Lean Startup, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! We hope your...
Podcast episode
BONUS: Helping Product Owners Make Decisions Quickly With Experiments, And Lean Startup | Vasco Duarte: BONUS: Helping Product Owners Make Decisions Quickly With Experiments, And Lean Startup, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! We hope your...
byScrum Master Toolbox Podcast: Agile storytelling from the trenches
0 ratings
0% found this document useful
E84: Using Process Mapping and Regression to Reduce Electricity Usage
Podcast episode
E84: Using Process Mapping and Regression to Reduce Electricity Usage
byLean Six Sigma Bursts
0 ratings
0% found this document useful
BONUS: The Future Of Agility, Experiment Driven Development | Vasco Duarte: BONUS: The Future Of Agility, Experiment Driven Development, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! As we bid farewell to 2023, we present the...
Podcast episode
BONUS: The Future Of Agility, Experiment Driven Development | Vasco Duarte: BONUS: The Future Of Agility, Experiment Driven Development, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! As we bid farewell to 2023, we present the...
byScrum Master Toolbox Podcast: Agile storytelling from the trenches
0 ratings
0% found this document useful
How accountants can appropriately rely on AI: Danielle Supkis Cheek, CPA, vice president and head of Analytics & AI for Caseware, had an interesting way to continue her exploration of the ethics of using AI tools in accounting: She asked ChatGPT to give her an answer. The response helped in...
Podcast episode
How accountants can appropriately rely on AI: Danielle Supkis Cheek, CPA, vice president and head of Analytics & AI for Caseware, had an interesting way to continue her exploration of the ethics of using AI tools in accounting: She asked ChatGPT to give her an answer. The response helped in...
byJournal of Accountancy Podcast
0 ratings
0% found this document useful
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
Podcast episode
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 24: Calculators: What are the Rules This Week?
Podcast episode
Episode 24: Calculators: What are the Rules This Week?
byChad Cargill's ACT Test Prep
0 ratings
0% found this document useful
220 | Data Dynamics: Managing ERP Costs and Change Orders for Success: Welcome to another episode of "The Art of Consulting Podcast" with your hosts, Andy Fry and Cat Lam. As seasoned IT consultants, CPAs, and professional development connoisseurs, we aim to bring you inspiring messages to help you discover the X factor...
Podcast episode
220 | Data Dynamics: Managing ERP Costs and Change Orders for Success: Welcome to another episode of "The Art of Consulting Podcast" with your hosts, Andy Fry and Cat Lam. As seasoned IT consultants, CPAs, and professional development connoisseurs, we aim to bring you inspiring messages to help you discover the X factor...
byArt of Consulting Podcast
0 ratings
0% found this document useful
3 Act Math Tasks Aren’t Working
Podcast episode
3 Act Math Tasks Aren’t Working
byMaking Math Moments That Matter
0 ratings
0% found this document useful
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
Podcast episode
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
byThe Digital Adoption Show | Upskilling the Future Digital Workforce
0 ratings
0% found this document useful
AI and ML Networking: bridging the gap between performance and economy
Podcast episode
AI and ML Networking: bridging the gap between performance and economy
byTechnology Now
0 ratings
0% found this document useful
Defining Success: Metrics and KPIs - Adam Sroka
Podcast episode
Defining Success: Metrics and KPIs - Adam Sroka
byDataTalks.Club
0 ratings
0% found this document useful
The Ethics of Procedural Fidelity: Session 272 with Claire St. Peter: Whether one calls it Procedural Fidelity, Treatment Integrity, or any combination of those, and/or many other related terms, this is an important and often overlooked issue when it comes to implementing behavior analytic interventions. Think about it...
Podcast episode
The Ethics of Procedural Fidelity: Session 272 with Claire St. Peter: Whether one calls it Procedural Fidelity, Treatment Integrity, or any combination of those, and/or many other related terms, this is an important and often overlooked issue when it comes to implementing behavior analytic interventions. Think about it...
byThe Behavioral Observations Podcast with Matt Cicoria
0 ratings
0% found this document useful
Bringing Feature Stores and MLOps to the Enterprise at Tecton: An interview with Kevin Stumpf, CTO of Tecton, about his work building an enterprise grade feature store and how it functions as the core element of an MLOps strategy.
Podcast episode
Bringing Feature Stores and MLOps to the Enterprise at Tecton: An interview with Kevin Stumpf, CTO of Tecton, about his work building an enterprise grade feature store and how it functions as the core element of an MLOps strategy.
byData Engineering Podcast
0 ratings
0% found this document useful
Retrieval-Augmented Generation for AI-Generated Content: A Survey: Advancements in model algorithms, the growth of foundational models, and access to high-quality datasets have propelled the evolution of Artificial Intelligence Generated Content (AIGC). Despite its notable successes, AIGC still faces hurdles such as...
Podcast episode
Retrieval-Augmented Generation for AI-Generated Content: A Survey: Advancements in model algorithms, the growth of foundational models, and access to high-quality datasets have propelled the evolution of Artificial Intelligence Generated Content (AIGC). Despite its notable successes, AIGC still faces hurdles such as...
byPapers Read on AI
0 ratings
0% found this document useful
87: Michael Katz: The Evolution of packaged CDPs, democratizing ML and the myths of composable and zero data copy
Podcast episode
87: Michael Katz: The Evolution of packaged CDPs, democratizing ML and the myths of composable and zero data copy
byHumans of Martech
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Comparing Time Series Data Like A Pro
Linux Format
Article
Comparing Time Series Data Like A Pro
Jun 1, 2021
8 min read
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
The European Business Review
Article
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
Dec 3, 2019
7 min read
How To Make Sense From And With AI ?
The European Business Review
Article
How To Make Sense From And With AI ?
Sep 25, 2021
4 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
Machine Learning And Investing: The Cautious Seldom Err Or Write Great Poetry
Finweek - English
Article
Machine Learning And Investing: The Cautious Seldom Err Or Write Great Poetry
Oct 18, 2019
5 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
WHAT EVERY MANAGER SHOULD KNOW ABOUT HUMAN-CENTERED AI: A Manager’s Introduction to Human-Centered Artificial Intelligence
The European Business Review
Article
WHAT EVERY MANAGER SHOULD KNOW ABOUT HUMAN-CENTERED AI: A Manager’s Introduction to Human-Centered Artificial Intelligence
Dec 3, 2019
9 min read
Cognitive Enterprise
Techfastly
Article
Cognitive Enterprise
Dec 1, 2021
6 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Deep Learning Technique for Object Detection
Techfastly
Article
Deep Learning Technique for Object Detection
Jun 1, 2021
3 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
Article
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
Deconstructing Management Analytics
Rotman Management
Article
Deconstructing Management Analytics
Sep 1, 2022
7 min read
The Infrastructure of an AI Factory
Techfastly
Article
The Infrastructure of an AI Factory
Mar 3, 2021
Data is a crucial element for machine learning algorithms. It can be considered as a fuel of AI factories. Collection of useful data and feeding it into frameworks and models is the foremost step. Data acts as a case or example that the algorithms re
1 min read
STRATEGIC EXCELLENCE: Steps to Maximise ROI in GEN AI Implementations
The European Business Review
Article
STRATEGIC EXCELLENCE: Steps to Maximise ROI in GEN AI Implementations
May 28, 2024
4 min read
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Rotman Management
Article
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Jan 1, 2018
You believe that the world of leadership has hit an inflection point. How so? As useful as popular mental models and heuristics are, machine models now outstrip human performance in about half of the portfolio of cognitive tasks. Going forward, we wi
6 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
How To Train Computers Faster For ‘Extreme’ Datasets
Futurity
Article
How To Train Computers Faster For ‘Extreme’ Datasets
Dec 12, 2019
4 min read
Help Yourself To Avoid These Pitfalls
MacLife
Article
Help Yourself To Avoid These Pitfalls
Dec 11, 2018
GETTING UP TO full speed with the Shortcuts app takes time, and you’ll inevitably make a few mistakes along the way. Having to troubleshoot your efforts doesn’t mean you’ve failed — with years of experience, even professional programmers do this. Tak
2 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
Article
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
How Mature Is Your Organisation With Regards To Digital And Web Analytics?
NZ Marketing
Article
How Mature Is Your Organisation With Regards To Digital And Web Analytics?
Jun 9, 2021
1 min read
Integrated Workplace Management Systems
Facility Management
Article
Integrated Workplace Management Systems
Dec 23, 2018
Property and facilities management are data-rich operating worlds. This is becoming even more complex as the Internet of Things (IoT) provides the capability to imbed sensors and diagnostic tools to monitor the use and performance of everything in re
4 min read
The AI race
Racecar Engineering
Article
The AI race
Jul 7, 2023
10 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Why a Hedge Fund Started a Video Game Competition
Nautilus
Article
Why a Hedge Fund Started a Video Game Competition
Nov 30, 2017
There’s a weird way in which a hedge fund is a confluence of everything. There’s the money of course—Two Sigma, located in lower Manhattan, manages over $50 billion, an amount that has grown 600 percent in 6 years and is roughly the size of the econo
9 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
AI And Design: Questions Of Ethics
Architecture Australia
Article
AI And Design: Questions Of Ethics
Mar 4, 2024
Artificial intelligence (AI) is a very old idea, but the term AI and the field of AI as it relates to modern programmable digital computing have taken their contemporary forms in the past 70 years.1Today, we interact with AI technologies constantly,
5 min read

Related categories

Skip carousel

Reviews for Data Science Projects with Python.

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Data Science Projects with Python. - Stephen Klosterman

9781800564480cov_Low_Res.png

Data Science Projects with Python

second edition

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Author: Stephen Klosterman

Reviewers: Ashish Jain and Deepti Miyan Gupta

Managing Editor: Mahesh Dhyani

Acquisitions Editors: Sneha Shinde and Anindya Sil

Production Editor: Shantanu Zagade

Editorial Board: Megan Carlisle, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Abhishek Rane, Brendan Rodrigues, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: April 2019

Second edition: July 2021

Production reference: 1280721

ISBN: 978-1-80056-448-0

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface

1. Data Exploration and Cleaning

Introduction

Python and the Anaconda Package Management System

Indexing and the Slice Operator

Exercise 1.01: Examining Anaconda and Getting Familiar with Python

Different Types of Data Science Problems

Loading the Case Study Data with Jupyter and pandas

Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook

Getting Familiar with Data and Performing Data Cleaning

The Business Problem

Data Exploration Steps

Exercise 1.03: Verifying Basic Data Integrity

Boolean Masks

Exercise 1.04: Continuing Verification of Data Integrity

Exercise 1.05: Exploring and Cleaning the Data

Data Quality Assurance and Exploration

Exercise 1.06: Exploring the Credit Limit and Demographic Features

Deep Dive: Categorical Features

Exercise 1.07: Implementing OHE for a Categorical Feature

Exploring the Financial History Features in the Dataset

Activity 1.01: Exploring the Remaining Financial Features in the Dataset

Summary

2. Introduction to Scikit-Learn and Model Evaluation

Introduction

Exploring the Response Variable and Concluding the Initial Exploration

Introduction to Scikit-Learn

Generating Synthetic Data

Data for Linear Regression

Exercise 2.01: Linear Regression in Scikit-Learn

Model Performance Metrics for Binary Classification

Splitting the Data: Training and Test Sets

Classification Accuracy

True Positive Rate, False Positive Rate, and Confusion Matrix

Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python

Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?

Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model

The Receiver Operating Characteristic (ROC) Curve

Precision

Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

Summary

3. Details of Logistic Regression and Feature Exploration

Introduction

Examining the Relationships Between Features and the Response Variable

Pearson Correlation

Mathematics of Linear Correlation

F-test

Exercise 3.01: F-test and Univariate Feature Selection

Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions

Hypotheses and Next Steps

Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable

Univariate Feature Selection: What it Does and Doesn't Do

Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python

Exercise 3.03: Plotting the Sigmoid Function

Scope of Functions

Why Is Logistic Regression Considered a Linear Model?

Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression

From Logistic Regression Coefficients to Predictions Using Sigmoid

Exercise 3.05: Linear Decision Boundary of Logistic Regression

Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients

Summary

4. The Bias-Variance Trade-Off

Introduction

Estimating the Coefficients and Intercepts of Logistic Regression

Gradient Descent to Find Optimal Parameter Values

Exercise 4.01: Using Gradient Descent to Minimize a Cost Function

Assumptions of Logistic Regression

The Motivation for Regularization: The Bias-Variance Trade-Off

Exercise 4.02: Generating and Modeling Synthetic Classification Data

Lasso (L1) and Ridge (L2) Regularization

Cross-Validation: Choosing the Regularization Parameter

Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem

Options for Logistic Regression in Scikit-Learn

Scaling Data, Pipelines, and Interaction Features in Scikit-Learn

Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data

Summary

5. Decision Trees and Random Forests

Introduction

Decision Trees

The Terminology of Decision Trees and Connections to Machine Learning

Exercise 5.01: A Decision Tree in Scikit-Learn

Training Decision Trees: Node Impurity

Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions

Training Decision Trees: A Greedy Algorithm

Training Decision Trees: Different Stopping Criteria and Other Options

Using Decision Trees: Advantages and Predicted Probabilities

A More Convenient Approach to Cross-Validation

Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree

Random Forests: Ensembles of Decision Trees

Random Forest: Predictions and Interpretability

Exercise 5.03: Fitting a Random Forest

Checkerboard Graph

Activity 5.01: Cross-Validation Grid Search with Random Forest

Summary

6. Gradient Boosting, XGBoost, and SHAP Values

Introduction

Gradient Boosting and XGBoost

What Is Boosting?

Gradient Boosting and XGBoost

XGBoost Hyperparameters

Early Stopping

Tuning the Learning Rate

Other Important Hyperparameters in XGBoost

Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters

Another Way of Growing Trees: XGBoost's grow_policy

Explaining Model Predictions with SHAP Values

Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values

Missing Data

Saving Python Variables to a File

Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP

Summary

7. Test Set Analysis, Financial Insights, and Delivery to the Client

Introduction

Review of Modeling Results

Feature Engineering

Ensembling Multiple Models

Different Modeling Techniques

Balancing Classes

Model Performance on the Test Set

Distribution of Predicted Probability and Decile Chart

Exercise 7.01: Equal-Interval Chart

Calibration of Predicted Probabilities

Financial Analysis

Financial Conversation with the Client

Exercise 7.02: Characterizing Costs and Savings

Activity 7.01: Deriving Financial Insights

Final Thoughts on Delivering a Predictive Model to the Client

Model Monitoring

Ethics in Predictive Modeling

Summary

Appendix

Preface

About the Book

In this book, you'll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you'll experience in real-world data science projects.

You'll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.

Now in its second edition, this book will take you through the end-to-end process of exploring data and delivering machine learning models. Updated for 2021, this edition includes brand new content on XGBoost, SHAP values, algorithmic fairness, and the ethical concerns of deploying a model in the real world.

By the end of this data science book, you'll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.

About the Author

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.

Objectives

Load, explore, and process data using the pandas Python package

Use Matplotlib to create effective data visualizations

Implement predictive machine learning models with scikit-learn and XGBoost

Use lasso and ridge regression to reduce model overfitting

Build ensemble models of decision trees, using random forest and gradient boosting

Evaluate model performance and interpret model predictions

Deliver valuable insights by making clear business recommendations

Audience

Data Science Projects with Python – Second Edition is for anyone who wants to get started with data science and machine learning. If you're keen to advance your career by using data analysis and predictive modeling to generate business insights, then this book is the perfect place to begin. To quickly grasp the concepts covered, it is recommended that you have basic experience with programming in Python or another similar language (R, Matlab, C, etc). Additionally, knowledge of statistics that would be covered in a basic course, including topics such as probability and linear regression, or a willingness to learn about these on your own while reading this book would be useful.

Approach

Data Science Projects with Python takes a practical case study approach to learning, teaching concepts in the context of a real-world dataset. Clear explanations will deepen your knowledge, while engaging exercises and challenging activities will reinforce it with hands-on practice.

About the Chapters

Chapter 1, Data Exploration and Cleaning, gets you started with Python and Jupyter notebooks. The chapter then explores the case study dataset and delves into exploratory data analysis, quality assurance, and data cleaning using pandas.

Chapter 2, Introduction to Scikit-Learn and Model Evaluation, introduces you to the evaluation metrics for binary classification models. You'll learn how to build and evaluate binary classification models using scikit-learn.

Chapter 3, Details of Logistic Regression and Feature Exploration, dives deep into logistic regression and feature exploration. You'll learn how to generate correlation plots of many features and a response variable and interpret logistic regression as a linear model.

Chapter 4, The Bias-Variance Trade-Off, explores the foundational machine learning concepts of overfitting, underfitting, and the bias-variance trade-off by examining how the logistic regression model can be extended to address the overfitting problem.

Chapter 5, Decision Trees and Random Forests, introduces you to tree-based machine learning models. You'll learn how to train decision trees for machine learning purposes, visualize trained decision trees, and train random forests and visualize the results.

Chapter 6, Gradient Boosting, XGBoost, and SHAP Values, introduces you to two key concepts: gradient boosting and shapley additive explanations (SHAP). You'll learn to train XGBoost models and understand how SHAP values can be used to provide individualized explanations for model predictions from any dataset.

Chapter 7, Test Set Analysis, Financial Insights, and Delivery to the Client, presents several techniques for analyzing a model test set for deriving insights into likely model performance in the future. The chapter also describes key elements to consider when delivering and deploying a model, such as the format of delivery and ways to monitor the model as it is being used.

Hardware Requirements

For the optimal student experience, we recommend the following hardware configuration:

Processor: Intel Core i5 or equivalent

Memory: 4 GB RAM

Storage: 35 GB available space

Software Requirements

You'll also need the following software installed in advance:

OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit, Ubuntu Linux, or the latest version of OS X

Browser: Google Chrome/Mozilla Firefox Latest Version

Notepad++/Sublime Text as IDE (this is optional, as you can practice everything using the Jupyter Notebook on your browser)

Python 3.8+ (This book uses Python 3.8.2) installed (from https://python.org, or via Anaconda as recommended below) . At the time of writing, the SHAP library used in Chapter 6, Gradient Boosting, XGBoost, and SHAP Values, is not compatible with Python 3.9. Hence, if you are using Python 3.9 as your base environment, we suggest that you set up a Python 3.8 environment as described in the next section.

Python libraries as needed (Jupyter, NumPy, Pandas, Matplotlib, and so on, installed via Anaconda as recommended below)

Installation and Setup

Before you start this book, it is recommended to install the Anaconda package manager and use it to coordinate installation of Python and its packages.

Code Bundle

Please find the code bundle for this book, hosted on GitHub at https://github.com/PacktPublishing/Data-Science-Projects-with-Python-Second-Ed.

Anaconda and Setting up Your Environment

You can install Anaconda by visiting the following link: https://www.anaconda.com/products/individual. Scroll down to the bottom of the page and download the installer relevant to your system.

It is recommended to create an environment in Anaconda to do the exercises and activities in this book, which have been tested against the software versions indicated here. Once you have Anaconda installed, open a Terminal, if you're using macOS or Linux, or a Command Prompt window in Windows, and do the following:

Create an environment with most required packages. You can call it whatever you want; here it's called dspwp2. Copy and paste, or type the entire statement here on one line in the terminal:

conda create -n dspwp2 python=3.8.2 jupyter=1.0.0 pandas=1.2.1 scikit-learn=0.23.2 numpy=1.19.2 matplotlib=3.3.2 seaborn=0.11.1 python-graphviz=0.15 xlrd=2.0.1

Type 'y' and press [Enter] when prompted.

Activate the environment:

conda activate dspwp2

Install the remaining packages:

conda install -c conda-forge xgboost=1.3.0 shap=0.37.0

Type 'y' and [Enter] when prompted.

You are ready to use the environment. To deactivate it when finished:

conda deactivate

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions

Code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: By typing conda list at the command line, you can see all the packages installed in your environment.

A block of code is set as follows:

import numpy as np #numerical computation

import pandas as pd #data wrangling

import matplotlib.pyplot as plt #plotting package

#Next line helps with rendering plots

%matplotlib inline

import matplotlib as mpl #add'l plotting functionality

mpl.rcParams['figure.dpi'] = 400 #high res figures

import graphviz #to visualize decision trees

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Create a new Python 3 notebook from the New menu as shown.

Code Presentation

Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example:

my_new_lr = LogisticRegression(penalty='l2', dual=False,\

tol=0.0001, C=1.0,\

fit_intercept=True,\

intercept_scaling=1,\

class_weight=None,\

random_state=None,\

solver='lbfgs',\

max_iter=100,\

multi_class='auto',\

verbose=0, warm_start=False,\

n_jobs=None, l1_ratio=None)

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

import pandas as pd

import matplotlib.pyplot as plt #import plotting package

#render plotting automatically

%matplotlib inline

Get in Touch

Feedback from our readers is always welcome.

General feedback: If you have any questions about this book, please mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you could report this to us. Please visit www.packtpub.com/support/errata and complete the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you could provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Please Leave a Review

Let us know what you think by leaving a detailed, impartial review on Amazon. We appreciate all feedback – it helps us continue to make great products and help aspiring developers build their skills. Please spare a few minutes to give your thoughts – it makes a big difference to us. You can leave a review by clicking the following link: https://packt.link/r/1800564481.

1. Data Exploration and Cleaning

Overview

In this chapter, you will take your first steps with Python and Jupyter notebooks, some of the most common tools data scientists use. You'll then take the first look at the dataset for the case study project that will form the core of this book. You will begin to develop an intuition for quality assurance checks that data needs to be put through before model building. By the end of the chapter, you will be able to use pandas, the top package for wrangling tabular data in Python, to do exploratory data analysis, quality assurance, and data cleaning.

Introduction

Most businesses possess a wealth of data on their operations and customers. Reporting on this data in the form of descriptive charts, graphs, and tables is a good way to understand the current state of the business. However, in order to provide quantitative guidance on future business strategies and operations, it is necessary to go a step further. This is where the practices of machine learning and predictive modeling are needed. In this book, we will show how to go from descriptive analyses to concrete guidance for future operations, using predictive models.

To accomplish this goal, we'll introduce some of the most widely used machine learning tools via Python and many of its packages. You will also get a sense of the practical skills necessary to execute successful projects: inquisitiveness when examining data and communication with the client. Time spent looking in detail at a dataset and critically examining whether it accurately meets its intended purpose is time well spent. You will learn several techniques for assessing data quality here.

In this chapter, after getting familiar with the basic tools for data exploration, we will discuss a few typical working scenarios for how you may receive data. Then, we will begin a thorough exploration of the case study dataset and help you learn how you can uncover possible issues, so that when you are ready for modeling, you may proceed with confidence.

Python and the Anaconda Package Management System

In this book, we will use the Python programming language. Python is a top language for data science and is one of the fastest-growing programming languages. A commonly cited reason for Python's popularity is that it is easy to learn. If you have Python experience, that's great; however, if you have experience with other languages, such as C, Matlab, or R, you shouldn't have much trouble using Python. You should be familiar with the general constructs of computer programming to get the most out of this book. Examples of such constructs are for loops and if statements that guide the control flow of a program. No matter what language you have used, you are likely familiar with these constructs, which you will also find in Python.

A key feature of Python that is different from some other languages is that it is zero-indexed; in other words, the first element of an ordered collection has an index of 0. Python also supports negative indexing, where the index -1 refers to the last element of an ordered collection and negative indices count backward from the end. The slice operator, :, can be used to select multiple elements of an ordered collection from within a range, starting from the beginning, or going to the end of the collection.

Indexing and the Slice Operator

Here, we demonstrate how indexing and the slice operator work. To have something to index, we will create a list, which is a mutable ordered collection that can contain any type of data, including numerical and string types. Mutable just means the elements of the list can be changed after they are first assigned. To create the numbers for our list, which will be consecutive integers, we'll use the built-in range() Python function. The range() function technically creates an iterator that we'll convert to a list using the list() function, although you need not be concerned with that detail here. The following screenshot shows a list of the first five positive integers being printed on the console, as well as a few indexing operations, and changing the first item of the list to a new value of a different data type:

Figure 1.1: List creation and indexing

Figure 1.1: List creation and indexing

A few things to notice about Figure 1.1: the endpoint of an interval is open for both slice indexing and the range() function, while the starting point is closed. In other words, notice how when we specify the start and end of range(), endpoint 6 is not included in the result but starting point 1 is. Similarly, when indexing the list with the slice [:3], this includes all elements of the list with indices up to, but not including, 3.

We've referred to ordered collections, but Python also includes unordered collections. An important one of these is called a dictionary. A dictionary is an unordered collection of key:value pairs. Instead of looking up the values of a dictionary by integer indices, you look them up by keys, which could be numbers or strings. A dictionary can be created using curly braces {} and with the key:value pairs separated by commas. The following screenshot is an example of how we can create a dictionary with counts of fruit – examine the number of apples, then add a new type of fruit and its count:

Figure 1.2: An example dictionary

Figure 1.2: An example dictionary

There are many other distinctive features of Python and we just want to give you a flavor here, without getting into too much detail. In fact, you will probably use packages such as pandas (pandas) and NumPy (numpy) for most of your data handling in Python. NumPy provides fast numerical computation on arrays and matrices, while pandas provides a wealth of data wrangling and exploration capabilities on tables of data called DataFrames. However, it's good to be familiar with some of the basics of Python—the language that sits at the foundation of all of this. For example, indexing works the same in NumPy and pandas as it does in Python.

One of the strengths of Python is that it is open source and has an active community of developers creating amazing tools. We will use several of these tools in this book. A potential pitfall of having open source packages from different contributors is the dependencies between various packages. For example, if you want to install pandas, it may rely on a certain version of NumPy, which you may or may not have installed. Package management systems make life easier in this respect. When you install a new package through the package management system, it will ensure that all the dependencies are met. If they aren't, you will be prompted to upgrade or install new packages as necessary.

For this book, we will use the Anaconda package management system, which you should already have installed. While we will only use Python here, it is also possible to run R with Anaconda.

Note: Environments

It is recommended to create a new Python 3.x environment for this book. Environments are like separate installations of Python, where the set of packages you have installed can be different, as well as the version of Python. Environments are useful for developing projects that need to be deployed in different versions of Python, possibly with different dependencies. For general information on this, see https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html. See the Preface for specific instructions on setting up an Anaconda environment for this book before you begin the upcoming exercises.

Exercise 1.01: Examining Anaconda and Getting Familiar with Python

In this exercise, you will examine the packages in your Anaconda installation and practice with some basic Python control flow and data structures, including a for loop, dict, and list. This will confirm that you have completed the installation steps in the preface and show you how Python syntax and data structures may be a little different from other programming languages you may be familiar with. Perform the following steps to complete the exercise:

Note

Before executing the exercises and the activity in this chapter, please make sure you have followed the instructions regarding setting up your Python environment as mentioned in the Preface. The code file for this exercise can be found here: https://packt.link/N0RPT.

Open up Terminal, if you're using macOS or Linux, or a Command Prompt window in Windows. If you're using an environment, activate it using conda activate . Then type condalist at the command line. You should observe an output similar to the following:

Figure 1.3: Selection of packages from conda list

Figure 1.3: Selection of packages from conda list

You can see all the packages installed in your environment, including the packages we will directly interact with, as well as their dependencies which are needed for them to function. Managing dependencies among packages is one of the main advantages of a package management system.

Note

For more information about Anaconda and command-line interaction, check out this cheat sheet: https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf.

Type python in Terminal to open a command-line Python interpreter. You should obtain an output similar to the following:

Figure 1.4: Command-line Python

Figure 1.4: Command-line Python

You should see some information about your version of Python, as well as the Python Command Prompt (>>>). When you type after this prompt, you are writing Python code.

Note

Although we will be using the Jupyter notebook in this book, one of the aims of this exercise is to go through the basic steps of writing and running Python programs on the Command Prompt.

Write a for loop at the Command Prompt to print values from 0 to 4 using the following code (note that the three dots at the beginning of the second and third lines appear automatically if you are writing code in the command-line Python interpreter; if you're instead writing in a Jupyter notebook, these won't appear):

for counter in range(5):

... print(counter)

...

Once you hit Enter when you see ... on the prompt, you should obtain this output:

Figure 1.5: Output of a for loop at the command line

Figure 1.5: Output of a for loop at the command line

Notice that in Python, the opening of the for loop is followed by a colon, and the body of the loop requires indentation. It's typical to use four spaces to indent a code block. Here, the for loop prints the values returned by the range() iterator, having repeatedly accessed them using the counter variable with the in keyword.

Note

For many more details on Python code conventions, refer to the following: https://www.python.org/dev/peps/pep-0008/.

Now, we will return to our dictionary example. The first step here is to create the dictionary.

Create a dictionary of fruits (apples, oranges, and bananas) using the following code:

example_dict = {'apples':5, 'oranges':8, 'bananas':13}

Convert the dictionary to a list using the list() function, as shown in the following snippet:

dict_to_list = list(example_dict)

dict_to_list

Once you run the preceding code, you should obtain the following output:

['apples', 'oranges', 'bananas']

Notice that when this is done and we examine the contents, only the keys of the dictionary have been captured in the list. If we wanted the values, we would have had to specify that with the .values() method of the list. Also, notice that the list of dictionary keys happens to be in the same order that we wrote them when creating the dictionary. This is not guaranteed, however, as dictionaries are unordered collection types.

One convenient thing you can do with lists is to append other lists to them with the + operator. As an example, in the next step, we will combine the existing list of fruit with a list that contains just one more type of fruit, overwriting the variable containing the original list, like this: list(example_dict.values()); the interested readers can confirm this for themselves.

Use the + operator to combine the existing list of fruits with a new list containing only one fruit (pears):

dict_to_list = dict_to_list + ['pears']

dict_to_list

Your output will be as follows:

['apples', 'oranges', 'bananas', 'pears']

What if we wanted to sort our list of fruit types?

Python provides a built-in sorted() function that can be used for this; it will return a sorted version of the input. In our case, this means the list of fruit types will be sorted alphabetically.

Sort the list of fruits in alphabetical order using the sorted() function, as shown in the following snippet:

sorted(dict_to_list)

Once you run the preceding code, you should see the following output:

['apples', 'bananas', 'oranges', 'pears']

That's enough Python for now. We will show you how to execute the code for this book, so your Python knowledge should improve along the way. While you have the Python interpreter open, you may wish to run the code examples shown in Figures 1.1 and 1.2. When you're done with the interpreter, you can type quit() to exit.

Note

As you learn more and inevitably want to try new things, consult the official Python documentation: https://docs.python.org/3/.

Different Types of Data Science Problems

Much of your time as a data scientist is likely to be spent wrangling data: figuring out how to get it, getting it, examining it, making sure it's correct and complete, and joining it with other types of data. pandas is a widely used tool for data analysis in Python, and it can facilitate the data exploration process for you, as we will see in this chapter. However, one of the key goals of this book is to start you on your journey to becoming a machine learning data scientist, for which you will need to master the art and science of predictive modeling. This means using a mathematical model, or idealized mathematical formulation, to learn relationships within the data, in the hope of making accurate and useful predictions when new data comes in.

For predictive modeling use cases, data is typically organized in a tabular structure, with features and a response variable. For example, if you want to predict the price of a house based on some characteristics about it, such as area and number of bedrooms, these attributes would be considered the features and the price of the house would be the response variable. The response variable

Enjoying the preview?

Page 1 of 1

Data Science Projects with Python.: A case study approach to gaining valuable insights from real data with machine learning

About this ebook

Stephen Klosterman

Related authors

Related to Data Science Projects with Python.

Related ebooks

Data Visualization For You

Related podcast episodes

Related articles

Related categories

Reviews for Data Science Projects with Python.

What did you think?

Book preview

Data Science Projects with Python. - Stephen Klosterman

About the Book

About the Author

Objectives

Audience

Approach

About the Chapters

Hardware Requirements

Software Requirements

Installation and Setup

Code Bundle

Anaconda and Setting up Your Environment

Conventions

Code Presentation

Get in Touch

Please Leave a Review

1. Data Exploration and Cleaning

Overview

Introduction

Python and the Anaconda Package Management System

Indexing and the Slice Operator

Note: Environments

Exercise 1.01: Examining Anaconda and Getting Familiar with Python

Note

Note

Note

Note

Note

Different Types of Data Science Problems