Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Machine Learning with LightGBM and Python: A practitioner's guide to developing production-ready machine learning systems
Machine Learning with LightGBM and Python: A practitioner's guide to developing production-ready machine learning systems
Machine Learning with LightGBM and Python: A practitioner's guide to developing production-ready machine learning systems
Ebook512 pages3 hours

Machine Learning with LightGBM and Python: A practitioner's guide to developing production-ready machine learning systems

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Machine Learning with LightGBM and Python is a comprehensive guide to learning the basics of machine learning and progressing to building scalable machine learning systems that are ready for release.
This book will get you acquainted with the high-performance gradient-boosting LightGBM framework and show you how it can be used to solve various machine-learning problems to produce highly accurate, robust, and predictive solutions. Starting with simple machine learning models in scikit-learn, you’ll explore the intricacies of gradient boosting machines and LightGBM. You’ll be guided through various case studies to better understand the data science processes and learn how to practically apply your skills to real-world problems. As you progress, you’ll elevate your software engineering skills by learning how to build and integrate scalable machine-learning pipelines to process data, train models, and deploy them to serve secure APIs using Python tools such as FastAPI.
By the end of this book, you’ll be well equipped to use various -of-the-art tools that will help you build production-ready systems, including FLAML for AutoML, PostgresML for operating ML pipelines using Postgres, high-performance distributed training and serving via Dask, and creating and running models in the Cloud with AWS Sagemaker.

LanguageEnglish
Release dateSep 29, 2023
ISBN9781800563056
Machine Learning with LightGBM and Python: A practitioner's guide to developing production-ready machine learning systems

Related to Machine Learning with LightGBM and Python

Related ebooks

Programming For You

View More

Related articles

Reviews for Machine Learning with LightGBM and Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Machine Learning with LightGBM and Python - Andrich van Wyk

    Cover.png

    Machine Learning with LightGBM and Python

    Copyright © 2023 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Niranjan Naikwadi

    Publishing Product Manager: Tejashwini R

    Senior Editor: Gowri Rekha

    Content Development Editor: Manikandan Kurup

    Technical Editor: Kavyashree K S

    Copy Editor: Safis Editing

    Project Coordinator: Farheen Fathima

    Proofreader: Safis Editing

    Indexer: Subalakshmi Govindhan

    Production Designer: Shyam Sundar Korumilli

    Marketing Coordinator: Vinishka Kalra

    First published: September 2023

    Production reference: 1220923

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK.

    ISBN: 978-1-80056-474-9

    www.packtpub.com

    Countless nights and weekends have been dedicated to completing this book, and I would like to thank my wife, Irene, for her eternal support, without which, nobody would be reading any of this. Further, I’m grateful to my daughter, Emily, for inspiring me to reach a little further.

    – Andrich van Wyk

    Contributors

    About the author

    Andrich van Wyk has 15 years of experience in machine learning R&D, building AI-driven solutions, and consulting in the AI domain. He also has broad experience as a software engineer and architect with over a decade of industry experience working on enterprise systems.

    He graduated cum laude with an M.Sc. in Computer Science from the University of Pretoria, focusing on neural networks and evolutionary algorithms.

    Andrich enjoys writing about machine learning engineering and the software industry at large. He currently resides in South Africa with his wife and daughter.

    About the reviewers

    Valentine Shkulov is a renowned visiting lecturer at a top tech university, where he seamlessly melds academia with real-world expertise as a distinguished Data Scientist in Fintech and E-commerce. His ingenuity in crafting ML-driven solutions has transformed businesses, from tech giants to budding startups. Valentine excels at introducing AI innovations and refining current systems, ensuring they profoundly influence vital business metrics. His passion for navigating product challenges has established him as a pioneer in leveraging ML to elevate businesses.

    Above all, a heartfelt thanks to my spouse, the unwavering pillar of support in my remarkable journey.

    Kayewan M Karanjia has over 7 years of experience in machine learning, artificial intelligence (AI), and data technologies, and brings a wealth of expertise to his current role at DrDoctor. Here, as a machine learning engineer, he is dedicated to implementing advanced machine learning models that have a direct impact on enhancing healthcare services and process optimization for the NHS. In the past, he has also worked with multiple MNCs such as Reliance Industries Limited, and implemented solutions for the government of India.

    Table of Contents

    Preface

    Part 1: Gradient Boosting and LightGBM Fundamentals

    1

    Introducing Machine Learning

    Technical requirements

    What is machine learning?

    Machine learning paradigms

    Introducing models, datasets, and supervised learning

    Models

    Hyperparameters

    Datasets

    Overfitting and generalization

    Supervised learning

    Model performance metrics

    A modeling example

    Decision tree learning

    Entropy and information gain

    Building a decision tree using C4.5

    Overfitting in decision trees

    Building decision trees with scikit-learn

    Decision tree hyperparameters

    Summary

    References

    2

    Ensemble Learning – Bagging and Boosting

    Technical requirements

    Ensemble learning

    Bagging and random forests

    Random forest

    Gradient-boosted decision trees

    Gradient descent

    Gradient boosting

    Gradient-boosted decision tree hyperparameters

    Gradient boosting in scikit-learn

    Advanced boosting algorithm – DART

    Summary

    References

    3

    An Overview of LightGBM in Python

    Technical requirements

    Introducing LightGBM

    LightGBM optimizations

    Hyperparameters

    Limitations of LightGBM

    Getting started with LightGBM in Python

    LightGBM Python API

    LightGBM scikit-learn API

    Building LightGBM models

    Cross-validation

    Parameter optimization

    Predicting student academic success

    Summary

    References

    4

    Comparing LightGBM, XGBoost, and Deep Learning

    Technical requirements

    An overview of XGBoost

    Comparing XGBoost and LightGBM

    Python XGBoost example

    Deep learning and TabTransformers

    What is deep learning?

    Introducing TabTransformers

    Comparing LightGBM, XGBoost, and TabTransformers

    Predicting census income

    Detecting credit card fraud

    Summary

    References

    Part 2: Practical Machine Learning with LightGBM

    5

    LightGBM Parameter Optimization with Optuna

    Technical requirements

    Optuna and optimization algorithms

    Introducing Optuna

    Optimization algorithms

    Pruning strategies

    Optimizing LightGBM with Optuna

    Advanced Optuna features

    Summary

    References

    6

    Solving Real-World Data Science Problems with LightGBM

    Technical requirements

    The data science life cycle

    Defining the data science life cycle

    Predicting wind turbine power generation with LightGBM

    Problem definition

    Data collection

    Data preparation

    EDA

    Modeling

    Model deployment

    Communicating results

    Classifying individual credit scores with LightGBM

    Problem definition

    Data collection

    Data preparation

    EDA

    Modeling

    Model deployment and results

    Summary

    References

    7

    AutoML with LightGBM and FLAML

    Technical requirements

    Automated machine learning

    Automating feature engineering

    Automating model selection and tuning

    Risks of using AutoML systems

    Introducing FLAML

    Cost Frugal Optimization

    BlendSearch

    FLAML limitations

    Case study – using FLAML with LightGBM

    Feature engineering

    FLAML AutoML

    Zero-shot AutoML

    Summary

    References

    Part 3: Production-ready Machine Learning with LightGBM

    8

    Machine Learning Pipelines and MLOps with LightGBM

    Technical requirements

    Introducing machine learning pipelines

    Scikit-learn pipelines

    Understanding MLOps

    Deploying an ML pipeline for customer churn

    Building an ML pipeline using scikit-learn

    Building an ML API using FastAPI

    Containerizing our API

    Deploying LightGBM to Google Cloud

    Summary

    9

    LightGBM MLOps with AWS SageMaker

    Technical requirements

    An introduction to AWS and SageMaker

    AWS

    SageMaker

    SageMaker Clarify

    Building a LightGBM ML pipeline with Amazon SageMaker

    Setting up a SageMaker session

    Preprocessing step

    Model training and tuning

    Evaluation, bias, and explainability

    Deploying and monitoring the LightGBM model

    Results

    Summary

    References

    10

    LightGBM Models with PostgresML

    Technical requirements

    Introducing PostgresML

    Latency and round trips

    Getting started with PostgresML

    Training models

    Deploying and prediction

    PostgresML dashboard

    Case study – customer churn with PostgresML

    Data loading and preprocessing

    Training and hyperparameter optimization

    Predictions

    Summary

    References

    11

    Distributed and GPU-Based Learning with LightGBM

    Technical requirements

    Distributed learning with LightGBM and Dask

    GPU training for LightGBM

    Setting up LightGBM for the GPU

    Running LightGBM on the GPU

    Summary

    References

    Index

    Other Books You May Enjoy

    Preface

    Welcome to Machine Learning with LightGBM and Python: A Practitioner’s Guide to Developing Production-Ready Machine Learning Systems. In this book, you’ll embark on a rich journey, taking you from the foundational principles of machine learning to the advanced realms of MLOps. The cornerstone of our exploration is LightGBM, a powerful and flexible gradient-boosting framework that can be harnessed for a wide range of machine-learning challenges.

    This book is tailor-made for anyone passionate about transforming raw data into actionable insights using the power of Machine Learning (ML). Whether you’re an ML novice eager to get your hands dirty or an experienced data scientist seeking to master the intricacies of LightGBM, there’s something in here for you.

    The digital era has equipped us with a treasure trove of data. However, the challenges often lie in extracting meaningful insights from this data and deploying scalable, efficient, and reliable models in production environments. This book will guide you in overcoming these challenges. By diving into gradient boosting, the data science life cycle, and the nuances of production deployment, you will gain a comprehensive skill set to navigate the ever-evolving landscape of ML.

    Each chapter is designed with practicality in mind. Real-world case studies interspersed with theoretical insights ensure your learning is grounded in tangible applications. Our focus on LightGBM, which sometimes gets overshadowed by more mainstream algorithms, provides a unique lens to appreciate and apply gradient boosting in various scenarios.

    For those curious about what sets this book apart, it’s our pragmatic approach. We take pride in transcending beyond merely explaining algorithms or tools. Instead, we will prioritize hands-on applications, case studies, and real-world challenges, ensuring you’re not just reading but also doing ML.

    As we traverse through the chapters, remember that the world of ML is vast and constantly evolving. This book, while comprehensive, is a stepping stone in your lifelong journey of learning and exploration in the domain. As you navigate the world of LightGBM, data science, MLOps, and more, keep your mind open, your curiosity alive, and your hands ready to code.

    Who this book is for

    Machine Learning with LightGBM and Python: A Practitioner’s Guide to Developing Production-Ready Machine Learning Systems is tailored for a broad spectrum of readers passionate about harnessing data’s power through ML. The target audience for this book includes the following:

    Beginners in ML: Individuals just stepping into the world of ML will find this book immensely beneficial. It starts with foundational ML principles and introduces them to gradient boosting using LightGBM, making it an excellent entry point for newcomers.

    Experienced data scientists and ML practitioners: For those who are already familiar with the landscape of ML but want to deepen their knowledge of LightGBM and/or MLOps, this book offers advanced insights, techniques, and practical applications.

    Software engineers and architects looking to learn more about data science: Software professionals keen on transitioning to data science or integrating ML into their applications will find this book valuable. The book approaches ML theoretically and practically, emphasizing hands-on coding and real-world applications.

    MLOps engineers and DevOps professionals: Individuals working in the field of MLOps or those who wish to understand the deployment, scaling, and monitoring of ML models in production environments will benefit from the chapters dedicated to MLOps, pipelines, and deployment strategies.

    Academicians and students: Faculty members teaching ML, data science, or related courses, as well as students pursuing these fields, will find this book to be both an informative textbook and a practical guide.

    Knowledge of how to program Python is necessary. Familiarity with Jupyter notebooks and Python environments is a bonus. No prior knowledge of ML is required.

    In essence, anyone with a penchant for data, a background in Python programming, and an eagerness to explore the multifaceted world of ML using LightGBM will find this book a valuable addition to their repertoire.

    What this book covers

    Chapter 1, Introducing Machine Learning, starts our journey into ML, viewing it through the lens of software engineering. We will elucidate vital concepts central to the field, such as models, datasets, and the various learning paradigms, ensuring clarity with a hands-on example using decision trees.

    Chapter 2, Ensemble Learning – Bagging and Boosting, delves into ensemble learning, focusing on bagging and boosting techniques applied to decision trees. We will explore algorithms such as random forests, gradient-boosted decision trees, and more advanced concepts such as Dropout meets Additive Regression Trees (DART).

    Chapter 3, An Overview of LightGBM in Python, examines LightGBM, an advanced gradient-boosting framework with tree-based learners. Highlighting its unique innovations and enhancements to ensemble learning, we will guide you through its Python APIs. A comprehensive modeling example using LightGBM, enriched with advanced validation and optimization techniques, sets the stage for a deeper dive into data science and production systems ML.

    Chapter 4, Comparing LightGBM, XGBoost, and Deep Learning, pits LightGBM against two prominent tabular data modeling methods – XGBoost and deep neural networks (DNNs), specifically TabTransformer. We will assess each method’s complexity, performance, and computational cost through evaluations of two datasets. The essence of this chapter is ascertaining LightGBM’s competitiveness in the broader ML landscape, rather than an in-depth study of XGBoost or DNNs.

    Chapter 5, LightGBM Parameter Optimization with Optuna, focuses on the pivotal task of hyperparameter optimization, introducing the Optuna framework as a potent solution. Covering various optimization algorithms and strategies to prune the hyperparameter space, this chapter guides you through a hands-on example of refining LightGBM parameters using Optuna.

    Chapter 6, Solving Real-World Data Science Problems with LightGBM, methodically breaks down the data science process, applying it to two distinct case studies – a regression and a classification problem. The chapter illuminates each step of the data science life cycle. You will experience hands-on modeling with LightGBM, paired with comprehensive theory. This chapter also serves as a blueprint for data science projects using LightGBM.

    Chapter 7, AutoML with LightGBM and FLAML, delves into automated machine learning (AutoML), emphasizing its significance in simplifying and expediting data engineering and model development. We will introduce FLAML, a notable library that automates model selection and fine-tuning with efficient hyperparameter algorithms. Through a practical case study, you will witness FLAML’s synergy with LightGBM and the transformative Zero-Shot AutoML functionality, which renders the tuning process obsolete.

    Chapter 8, Machine Learning Pipelines and MLOps with LightGBM, moves on from modeling intricacies to the world of production ML. It introduces you to ML pipelines, ensuring consistent data processing and model building, and ventures into MLOps, a fusion of DevOps and ML, which is vital to deploying resilient ML systems.

    Chapter 9, LightGBM MLOps with AWS SageMaker, steers our journey toward Amazon SageMaker, Amazon Web Services’ comprehensive suite to craft and maintain ML solutions. We will deepen our understanding of ML pipelines by delving into advanced areas such as bias detection, explainability in models, and the nuances of automated, scalable deployments.

    Chapter 10, LightGBM Models with PostgresML, introduces PostgresML, a distinct MLOps platform and a PostgreSQL database extension that facilitates ML model development and deployment directly via SQL. This approach, while contrasting the scikit-learn programming style that we’ve embraced, showcases the benefits of database-level ML, particularly regarding data movement efficiencies and faster inferencing.

    Chapter 11, Distributed and GPU-Based Learning with LightGBM, delves into the expansive realm of training LightGBM models, leveraging distributed computing clusters and GPUs. By harnessing distributed computing, you will understand how to substantially accelerate training workloads and manage datasets that exceed a single machine’s memory capacity.

    To get the most out of this book

    This book is written assuming that you have some knowledge of Python programming. None of the Python code is very complex, so even understanding the basics of Python should be enough to get you through most of the code examples.

    Jupyter notebooks are used for the practical examples in all the chapters. Jupyter Notebooks is an open source tool that allows you to create code notebooks that contain live code, visualizations, and markdown text. Tutorials to get started with Jupyter Notebooks are available at https://realpython.com/jupyter-notebook-introduction/ and at https://plotly.com/python/ipython-notebook-tutorial/.

    We recommend using Anaconda for Python environment management when setting up your own environment. Anaconda also bundles many data science packages, so you don’t have to install them individually. Anaconda can be downloaded from https://www.anaconda.com/download. Notably, the book is accompanied by a GitHub repository, which includes an Anaconda environment file, to create the environment required to run the code examples in this book.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Practical-Machine-Learning-with-LightGBM-and-Python. If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Conventions used

    There are several text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The code is almost identical to our classification example – instead of a classifier, we use DecisionTreeRegressor as our model and calculate mean_absolute_error instead of the F1 score.

    A block of code is set as follows:

    import numpy as np

    import pandas as pd

    from matplotlib import pyplot as plt

    import seaborn as sns

    from sklearn.linear_model import LinearRegression

    from sklearn.metrics import mean_absolute_error

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    model = DecisionTreeRegressor(random_state=157, max_depth=3, min_samples_split=2)

    model = model.fit(X_train, y_train)

    mean_absolute_error(y_test, model.predict(X_test))

    Any command-line input or output is written as follows:

    conda create -n your_env_name python=3.9

    Bold: Indicates a new term, an important word, or words you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Therefore, data preparation and cleaning are essential parts of the machine-learning process.

    Tips or important notes

    Appear in blocks such as these.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please get in touch with us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Share Your Thoughts

    Once you’ve read Machine Learning with LightGBM and Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help

    Enjoying the preview?
    Page 1 of 1