Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI
The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI
The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI
Ebook1,235 pages11 hours

The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI

Rating: 0 out of 5 stars

()

Read preview

About this ebook

David Ping, Head of GenAI and ML Solution Architecture for global industries at AWS, provides expert insights and practical examples to help you become a proficient ML solutions architect, linking technical architecture to business-related skills.
You'll learn about ML algorithms, cloud infrastructure, system design, MLOps , and how to apply ML to solve real-world business problems. David explains the generative AI project lifecycle and examines Retrieval Augmented Generation (RAG), an effective architecture pattern for generative AI applications. You’ll also learn about open-source technologies, such as Kubernetes/Kubeflow, for building a data science environment and ML pipelines before building an enterprise ML architecture using AWS. As well as ML risk management and the different stages of AI/ML adoption, the biggest new addition to the handbook is the deep exploration of generative AI.
By the end of this book , you’ll have gained a comprehensive understanding of AI/ML across all key aspects, including business use cases, data science, real-world solution architecture, risk management, and governance. You’ll possess the skills to design and construct ML solutions that effectively cater to common use cases and follow established ML architecture patterns, enabling you to excel as a true professional in the field.

LanguageEnglish
Release dateApr 15, 2024
ISBN9781805124825
The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI

Related to The Machine Learning Solutions Architect Handbook

Related ebooks

Computers For You

View More

Related articles

Reviews for The Machine Learning Solutions Architect Handbook

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    The Machine Learning Solutions Architect Handbook - David Ping

    cover.png

    The Machine Learning Solutions Architect Handbook

    Second Edition

    Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI

    David Ping

    The Machine Learning Solutions Architect Handbook

    Second Edition

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Publishing Product Manager: Bhavesh Amin

    Acquisition Editor – Peer Reviews: Gaurav Gavas

    Project Editor: Amisha Vathare

    Content Development Editor: Tanya D’cruz

    Copy Editor: Safis Editing

    Technical Editor: Anjitha Murali

    Proofreader: Safis Editing

    Indexer: Hemangini Bari

    Presentation Designer: Ajay Patule

    Developer Relations Marketing Executive: Monika Sangwan

    First published: January 2022

    Second edition: April 2024

    Production reference: 1080424

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK.

    ISBN 978-1-80512-250-0

    www.packt.com

    Contributors

    About the author

    David Ping is a seasoned technology executive with over 25 years of experience in the technology and financial services sectors. Specializing in cloud architecture, AI/ML, generative AI, ML platforms, and data analytics, he currently leads a global AI/ML solutions architecture team for industries at AWS, guiding companies worldwide in deploying cutting-edge AI/ML solutions. Previously holding executive roles at Credit Suisse and JPMorgan, David began his career as a software engineer at Intel after graduating with an engineering degree from Cornell University.

    About the reviewers

    Sepehr Pakbaz has been developing software since 2000 and has experience in full-stack software development, working with a variety of programming languages such as Python, JavaScript, .NET, and recently Golang. He has also worked as a product owner, consultant, and cloud solution architect. He has worked for companies like IBM and Microsoft in the past and is currently a Solutions Architect at Amazon Web Services. Additionally, he works as a consultant for his own company, Starspak LLC, as a side hustle.

    Chakravarthy Nagarajan is a technology evangelist with 23 years of industry experience in ML, big data, and high performance computing. He is currently working as a Principal AI/ML Specialist Solutions Architect at Amazon Web Services based in Bay Area, USA. He helps customers solve real-world complex business problems by building prototypes with end-to-end AI/ML solutions on cloud and edge devices. His specialization includes generative AI, computer vision, natural language processing, time series forecasting, and personalization. In his current role, Chakravarthy helps customers across start-ups, enterprises, and ISVs to solve their business problems using AI and ML solutions across North America.

    Amit Nandi is a Solutions and Enterprise Architect specializing in driving innovation across diverse industries, including financial, pharmaceutical, manufacturing, and retail. He is recognized for architecting and implementing groundbreaking business paradigms through the integration of big data technologies, real-time streaming, and cutting-edge ML and AI solutions. He built an ML/AI - powered cybersecurity platform and enabled MLOps for the research team of a large pharmaceutical company.

    Join our community on Discord

    Join our community’s Discord space for discussions with the author and other readers:

    https://packt.link/mlsah

    Contents

    Preface

    Who this book is for

    What this book covers

    To get the most out of this book

    Get in touch

    Navigating the ML Lifecycle with ML Solutions Architecture

    ML versus traditional software

    ML lifecycle

    Business problem understanding and ML problem framing

    Data understanding and data preparation

    Model training and evaluation

    Model deployment

    Model monitoring

    Business metric tracking

    ML challenges

    ML solutions architecture

    Business understanding and ML transformation

    Identification and verification of ML techniques

    System architecture design and implementation

    ML platform workflow automation

    Security and compliance

    Summary

    Exploring ML Business Use Cases

    ML use cases in financial services

    Capital market front office

    Sales trading and research

    Investment banking

    Wealth management

    Capital market back office operations

    Net Asset Value review

    Post-trade settlement failure prediction

    Risk management and fraud

    Anti-money laundering

    Trade surveillance

    Credit risk

    Insurance

    Insurance underwriting

    Insurance claim management

    ML use cases in media and entertainment

    Content development and production

    Content management and discovery

    Content distribution and customer engagement

    ML use cases in healthcare and life sciences

    Medical imaging analysis

    Drug discovery

    Healthcare data management

    ML use cases in manufacturing

    Engineering and product design

    Manufacturing operations – product quality and yield

    Manufacturing operations – machine maintenance

    ML use cases in retail

    Product search and discovery

    Targeted marketing

    Sentiment analysis

    Product demand forecasting

    ML use cases in the automotive industry

    Autonomous vehicles

    Perception and localization

    Decision and planning

    Control

    Advanced driver assistance systems (ADAS)

    Summary

    Exploring ML Algorithms

    Technical requirements

    How machines learn

    Overview of ML algorithms

    Consideration for choosing ML algorithms

    Algorithms for classification and regression problems

    Linear regression algorithms

    Logistic regression algorithms

    Decision tree algorithms

    Random forest algorithm

    Gradient boosting machine and XGBoost algorithms

    K-nearest neighbor algorithm

    Multi-layer perceptron (MLP) networks

    Algorithms for clustering

    Algorithms for time series analysis

    ARIMA algorithm

    DeepAR algorithm

    Algorithms for recommendation

    Collaborative filtering algorithm

    Multi-armed bandit/contextual bandit algorithm

    Algorithms for computer vision problems

    Convolutional neural networks

    ResNet

    Algorithms for natural language processing (NLP) problems

    Word2Vec

    BERT

    Generative AI algorithms

    Generative adversarial network

    Generative pre-trained transformer (GPT)

    Large Language Model

    Diffusion model

    Hands-on exercise

    Problem statement

    Dataset description

    Setting up a Jupyter Notebook environment

    Running the exercise

    Summary

    Data Management for ML

    Technical requirements

    Data management considerations for ML

    Data management architecture for ML

    Data storage and management

    AWS Lake Formation

    Data ingestion

    Kinesis Firehose

    AWS Glue

    AWS Lambda

    Data cataloging

    AWS Glue Data Catalog

    Custom data catalog solution

    Data processing

    ML data versioning

    S3 partitions

    Versioned S3 buckets

    Purpose-built data version tools

    ML feature stores

    Data serving for client consumption

    Consumption via API

    Consumption via data copy

    Special databases for ML

    Vector databases

    Graph databases

    Data pipelines

    Authentication and authorization

    Data governance

    Data lineage

    Other data governance measures

    Hands-on exercise – data management for ML

    Creating a data lake using Lake Formation

    Creating a data ingestion pipeline

    Creating a Glue Data Catalog

    Discovering and querying data in the data lake

    Creating an Amazon Glue ETL job to process data for ML

    Building a data pipeline using Glue workflows

    Summary

    Exploring Open-Source ML Libraries

    Technical requirements

    Core features of open-source ML libraries

    Understanding the scikit-learn ML library

    Installing scikit-learn

    Core components of scikit-learn

    Understanding the Apache Spark ML library

    Installing Spark ML

    Core components of the Spark ML library

    Understanding the TensorFlow deep learning library

    Installing TensorFlow

    Core components of TensorFlow

    Hands-on exercise – training a TensorFlow model

    Understanding the PyTorch deep learning library

    Installing PyTorch

    Core components of PyTorch

    Hands-on exercise – building and training a PyTorch model

    How to choose between TensorFlow and PyTorch

    Summary

    Kubernetes Container Orchestration Infrastructure Management

    Technical requirements

    Introduction to containers

    Overview of Kubernetes and its core concepts

    Namespaces

    Pods

    Deployment

    Kubernetes Job

    Kubernetes custom resources and operators

    Services

    Networking on Kubernetes

    Security and access management

    API authentication and authorization

    Hands-on – creating a Kubernetes infrastructure on AWS

    Problem statement

    Lab instruction

    Summary

    Open-Source ML Platforms

    Core components of an ML platform

    Open-source technologies for building ML platforms

    Implementing a data science environment

    Building a model training environment

    Registering models with a model registry

    Serving models using model serving services

    The Gunicorn and Flask inference engine

    The TensorFlow Serving framework

    The TorchServe serving framework

    KFServing framework

    Seldon Core

    Triton Inference Server

    Monitoring models in production

    Managing ML features

    Automating ML pipeline workflows

    Apache Airflow

    Kubeflow Pipelines

    Designing an end-to-end ML platform

    ML platform-based strategy

    ML component-based strategy

    Summary

    Building a Data Science Environment Using AWS ML Services

    Technical requirements

    SageMaker overview

    Data science environment architecture using SageMaker

    Onboarding SageMaker users

    Launching Studio applications

    Preparing data

    Preparing data interactively with SageMaker Data Wrangler

    Preparing data at scale interactively

    Processing data as separate jobs

    Creating, storing, and sharing features

    Training ML models

    Tuning ML models

    Deploying ML models for testing

    Best practices for building a data science environment

    Hands-on exercise – building a data science environment using AWS services

    Problem statement

    Dataset description

    Lab instructions

    Setting up SageMaker Studio

    Launching a JupyterLab notebook

    Training the BERT model in the Jupyter notebook

    Training the BERT model with the SageMaker Training service

    Deploying the model

    Building ML models with SageMaker Canvas

    Summary

    Designing an Enterprise ML Architecture with AWS ML Services

    Technical requirements

    Key considerations for ML platforms

    The personas of ML platforms and their requirements

    ML platform builders

    Platform users and operators

    Common workflow of an ML initiative

    Platform requirements for the different personas

    Key requirements for an enterprise ML platform

    Enterprise ML architecture pattern overview

    Model training environment

    Model training engine using SageMaker

    Automation support

    Model training lifecycle management

    Model hosting environment

    Inference engines

    Authentication and security control

    Monitoring and logging

    Adopting MLOps for ML workflows

    Components of the MLOps architecture

    Monitoring and logging

    Model training monitoring

    Model endpoint monitoring

    ML pipeline monitoring

    Service provisioning management

    Best practices in building and operating an ML platform

    ML platform project execution best practices

    ML platform design and implementation best practices

    Platform use and operations best practices

    Summary

    Advanced ML Engineering

    Technical requirements

    Training large-scale models with distributed training

    Distributed model training using data parallelism

    Parameter server overview

    AllReduce overview

    Distributed model training using model parallelism

    Naïve model parallelism overview

    Tensor parallelism/tensor slicing overview

    Implementing model-parallel training

    Achieving low-latency model inference

    How model inference works and opportunities for optimization

    Hardware acceleration

    Central processing units (CPUs)

    Graphics processing units (GPUs)

    Application-specific integrated circuit

    Model optimization

    Quantization

    Pruning (also known as sparsity)

    Graph and operator optimization

    Graph optimization

    Operator optimization

    Model compilers

    TensorFlow XLA

    PyTorch Glow

    Apache TVM

    Amazon SageMaker Neo

    Inference engine optimization

    Inference batching

    Enabling parallel serving sessions

    Picking a communication protocol

    Inference in large language models

    Text Generation Inference (TGI)

    DeepSpeed-Inference

    FastTransformer

    Hands-on lab – running distributed model training with PyTorch

    Problem statement

    Dataset description

    Modifying the training script

    Modifying and running the launcher notebook

    Summary

    Building ML Solutions with AWS AI Services

    Technical requirements

    What are AI services?

    Overview of AWS AI services

    Amazon Comprehend

    Amazon Textract

    Amazon Rekognition

    Amazon Transcribe

    Amazon Personalize

    Amazon Lex V2

    Amazon Kendra

    Amazon Q

    Evaluating AWS AI services for ML use cases

    Building intelligent solutions with AI services

    Automating loan document verification and data extraction

    Loan document classification workflow

    Loan data processing flow

    Media processing and analysis workflow

    E-commerce product recommendation

    Customer self-service automation with intelligent search

    Designing an MLOps architecture for AI services

    AWS account setup strategy for AI services and MLOps

    Code promotion across environments

    Monitoring operational metrics for AI services

    Hands-on lab – running ML tasks using AI services

    Summary

    AI Risk Management

    Understanding AI risk scenarios

    The regulatory landscape around AI risk management

    Understanding AI risk management

    Governance oversight principles

    AI risk management framework

    Applying risk management across the AI lifecycle

    Business problem identification and definition

    Data acquisition and management

    Risk considerations

    Risk mitigations

    Experimentation and model development

    Risk considerations

    Risk mitigations

    AI system deployment and operations

    Risk considerations

    Risk mitigations

    Designing ML platforms with governance and risk management considerations

    Data and model documentation

    Lineage and reproducibility

    Observability and auditing

    Scalability and performance

    Data quality

    Summary

    Bias, Explainability, Privacy, and Adversarial Attacks

    Understanding bias

    Understanding ML explainability

    LIME

    SHAP

    Understanding security and privacy-preserving ML

    Differential privacy

    Understanding adversarial attacks

    Evasion attacks

    PGD attacks

    HopSkipJump attacks

    Data poisoning attacks

    Clean-label backdoor attack

    Model extraction attack

    Attacks against generative AI models

    Defense against adversarial attacks

    Robustness-based methods

    Detector-based method

    Open-source tools for adversarial attacks and defenses

    Hands-on lab – detecting bias, explaining models, training privacy-preserving mode, and simulating adversarial attack

    Problem statement

    Detecting bias in the training dataset

    Explaining feature importance for a trained model

    Training privacy-preserving models

    Simulate a clean-label backdoor attack

    Summary

    Charting the Course of Your ML Journey

    ML adoption stages

    Exploring AI/ML

    Disjointed AI/ML

    Integrated AI/ML

    Advanced AI/ML

    AI/ML maturity and assessment

    Technical maturity

    Business maturity

    Governance maturity

    Organization and talent maturity

    Maturity assessment and improvement process

    AI/ML operating models

    Centralized model

    Decentralized model

    Hub and spoke model

    Solving ML journey challenges

    Developing the AI vision and strategy

    Getting started with the first AI/ML initiative

    Solving scaling challenges with AI/ML adoption

    Solving ML use case scaling challenges

    Solving technology scaling challenges

    Solving governance scaling challenges

    Summary

    Navigating the Generative AI Project Lifecycle

    The advancement and economic impact of generative AI

    What industries are doing with generative AI

    Financial services

    Healthcare and life sciences

    Media and entertainment

    Automotive and manufacturing

    The lifecycle of a generative AI project and the core technologies

    Business use case selection

    FM selection and evaluation

    Initial screening via manual assessment

    Automated model evaluation

    Human evaluation

    Assessing AI risks for FMs

    Other evaluation consideration

    Building FMs from scratch via pre-training

    Adaptation and customization

    Domain adaptation pre-training

    Fine-tuning

    Reinforcement learning from human feedback

    Prompt engineering

    Model management and deployment

    The limitations, risks, and challenges of adopting generative AI

    Summary

    Designing Generative AI Platforms and Solutions

    Operational considerations for generative AI platforms and solutions

    New generative AI workflow and processes

    New technology components

    New roles

    Exploring generative AI platforms

    The prompt management component

    FM benchmark workbench

    Supervised fine-tuning and RLHF

    FM monitoring

    The retrieval-augmented generation pattern

    Open-source frameworks for RAG

    LangChain

    LlamaIndex

    Evaluating a RAG pipeline

    Advanced RAG patterns

    Designing a RAG architecture on AWS

    Choosing an LLM adaptation method

    Response quality

    Cost of the adaptation

    Implementation complexity

    Bringing it all together

    Considerations for deploying generative AI applications in production

    Model readiness

    Decision-making workflow

    Responsible AI assessment

    Guardrails in production environments

    External knowledge change management

    Practical generative AI business solutions

    Generative AI-powered semantic search engine

    Financial data analysis and research workflow

    Clinical trial recruiting workflow

    Media entertainment content creation workflow

    Car design workflow

    Contact center customer service operation

    Are we close to having artificial general intelligence?

    The symbolic approach

    The connectionist/neural network approach

    The neural-symbolic approach

    Summary

    Other Books You May Enjoy

    Index

    Landmarks

    Cover

    Index

    Preface

    As artificial intelligence (AI) continues to gain traction across diverse industries, the need for proficient machine learning (ML) solutions architects is on the rise. These professionals play a pivotal role in bridging business requirements with ML solutions, crafting ML technology platforms that address both business and technical challenges. This book is designed to equip individuals with a comprehensive understanding of business use cases, ML algorithms, system architecture patterns, ML tools, AI risk management, enterprise AI adoption strategies, and the emerging field of generative AI.

    Upon completing this book, you will possess a comprehensive understanding of AI/ML and generative AI topics, encompassing business use cases, scientific principles, technological underpinnings, architectural considerations, risk management, operational aspects, and the journey towards enterprise adoption. Moreover, you will acquire hands-on technical proficiency with a diverse array of open-source and AWS technologies, empowering you to build and deploy cutting-edge AI/ML and generative AI solutions effectively. This holistic knowledge and practical skillset will enable you to articulate and address the multifaceted challenges and opportunities presented by these disruptive technologies.

    Who this book is for

    This book is designed for two primary audiences: developers and cloud architects who are looking for guidance and hands-on learning materials to become ML solutions architects, and experienced ML architecture practitioners and data scientists who are looking to develop a broader understanding of industry ML use cases, enterprise data and ML architecture patterns, data management and ML tools, ML governance, and advanced ML engineering techniques. This book can also benefit data engineers and cloud system administrators looking to understand how data management and cloud system architecture fit into the overall ML platform architecture. Risk professionals, AI product managers, and technology decision makers will also benefit from topics on AI risk management, business AI use cases, and ML maturity journey and best practices.

    This book assumes you have some Python programming knowledge and are familiar with AWS services. Some of the chapters are designed for ML beginners to learn the core ML fundamentals, and they might overlap with the knowledge already possessed by experienced ML practitioners.

    What this book covers

    Chapter 1, Navigating the ML Lifecycle with ML Solutions Architecture, introduces ML solutions architecture functions, covering its fundamentals and scope.

    Chapter 2, Exploring ML Business Use Cases, talks about real-world applications of AI/ML across various industries such as financial services, healthcare, media entertainment, automotive, manufacturing, and retail.

    Chapter 3, Exploring ML Algorithms, introduces common ML and deep learning algorithms for classification, regression, clustering, time series, recommendations, computer vision, natural language processing, and generative AI tasks. You will get hands-on experience of setting up a Jupyter server and building ML models on your local machine.

    Chapter 4, Data Management for ML, addresses the crucial topic of data management for ML, detailing how to leverage an array of AWS services to construct robust data management architectures. You will develop hands-on skills with AWS services for building data management pipelines for ML.

    Chapter 5, Exploring Open-Source ML Libraries, covers the core features of scikit-learn, Spark ML, PyTorch and TensorFlow, and how to use these ML libraries for data preparation, model training, and model serving. You will practice building deep learning models using TensorFlow and PyTorch.

    Chapter 6, Kubernetes Container Orchestration Infrastructure Management, introduces containers, Kubernetes concepts, Kubernetes networking, and Kubernetes security. Kubernetes is a core open-source infrastructure for building open-source ML solutions. You will also practice setting up the Kubernetes platform on AWS EKS and deploying an ML workload in Kubernetes.

    Chapter 7, Open-Source ML Platforms, talks about the core concepts and the technical details of various open-source ML platform technologies, such as Kubeflow, MLflow, AirFlow, and Seldon Core. The chapter also covers how to use these technologies to build a data science environment and ML automation pipeline.

    Chapter 8, Building a Data Science Environment Using AWS ML Services, introduces various AWS managed services for building data science environments, including Amazon SageMaker, Amazon ECR, and Amazon CodeCommit. You will also get hands-on experience with these services to configure a data science environment for experimentation and model training.

    Chapter 9, Designing an Enterprise ML Architecture with AWS ML Services, talks about the core requirements for an enterprise ML platform, discusses the architecture patterns and best practices for building an enterprise ML platform on AWS, and dives deep into the various core ML capabilities of SageMaker and other AWS services.

    Chapter 10, Advanced ML Engineering, provides insights into advanced ML engineering aspects such as distributed model training and low-latency model serving, crucial for meeting the demands of large-scale model training and high-performance serving requirements. You will also get hands on with distributed data parallel model training using a SageMaker training cluster.

    Chapter 11, Building ML Solutions with AWS AI Services, will introduce AWS AI services and the types of problems these services can help solve without building an ML model from scratch. You will learn about the core capabilities of some key AI services and where they can be leveraged for building ML-powered business applications.

    Chapter 12, AI Risk Management, explores AI risk management principles, frameworks, and risk and mitigation, providing comprehensive coverage of AI risk scenarios, guiding principles, frameworks, and risk mitigation considerations across the entire ML lifecycle. It elucidates how ML platforms can facilitate governance through documentation, model inventory maintenance, and monitoring processes.

    Chapter 13, Bias, Explainability, Privacy, and Adversarial Attacks, delves into the technical aspects of various risks, providing in-depth explanations of bias detection techniques, model explainability methods, privacy preservation approaches, as well as adversarial attack scenarios and corresponding mitigation strategies.

    Chapter 14, Charting the Course of Your ML Journey, outlines the stages of adoption and presents a corresponding maturity model designed to facilitate progress along the ML journey. Additionally, it addresses key considerations essential for overcoming the hurdles encountered throughout this process.

    Chapter 15, Navigating the Generative AI Project Lifecycle, discusses the advancement and economic impact of generative AI, the various industry trends in generative AI adoption, and guides readers through the various stages of a generative AI project, from ideation to deployment, exploring various generative AI technologies, and limitations and challenges along the way.

    Chapter 16, Designing Generative AI Platforms and Solutions, explores generative AI platforms’ architecture, the retrieval-augmented generation (RAG) application architecture and best practices, considerations for generative AI production deployment and practical generative AI-powered business applications across diverse industry use cases.

    The chapter finishes with a discussion on artificial general intelligence (AGI) and various theoretical approaches the research community has taken in their pursuit of AGI.

    To get the most out of this book

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    For the hardware/software requirements for the book, all you will need is a Windows or Mac machine, and an AWS account.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/The-Machine-Learning-Solutions-Architect-and-Risk-Management-Handbook-Second-Edition/. If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/gbp/9781805122500.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text

    : Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded

    WebStorm-10*.dmg

    disk image file as another disk in your system."

    A block of code is set as follows:

    import

    pandas

    as

    pd churn_data = pd.read_csv(

    churn.csv

    ) churn_data.head()

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    # The following command calculates the various statistics

    for

    the features.

    churn_data.describe()

    # The following command displays the histograms for the

    different features.

    # You can replace the column names to plot the histograms

    for

    other features

    churn_data.hist([

    'CreditScore'

    ,

    'Age'

    ,

    'Balance'

    ])

    # The following command calculate the correlations among

    features

    churn_data.corr()

    Any command-line input or output is written as follows:

    ! pip3 install --upgrade tensorflow

    Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "An example of a deep learning-based solution is the Amazon Echo virtual assistant."

    Warnings or important notes appear like this.

    Tips and tricks appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at

    [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at

    [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit

    authors.packtpub.com

    .

    Share your thoughts

    Once you’ve read The Machine Learning Solutions Architect Handbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below:

    https://packt.link/free-ebook/9781805122500

    Submit your proof of purchase.

    That’s it! We’ll send your free PDF and other benefits to your email directly.

    1

    Navigating the ML Lifecycle with ML Solutions Architecture

    The field of artificial intelligence (AI) and machine learning (ML) has had a long history. Over the last 70+ years, ML has evolved from checker game-playing computer programs in the 1950s to advanced AI capable of beating the human world champion in the game of Go. More recently, Generative AI (GenAI) technology such as ChatGPT has been taking the industry by storm, generating huge interest among company executives and consumers alike, promising new ways to transform businesses such as drug discovery, new media content, financial report analysis, and consumer product design. Along the way, the technology infrastructure for ML has also evolved from a single machine/server for small experiments and models to highly complex end-to-end ML platforms capable of training, managing, and deploying tens of thousands of ML models. The hyper-growth in the AI/ML field has resulted in the creation of many new professional roles, such as MLOps engineering, AI/ML product management, ML software engineering, AI risk manager, and AI strategist across a range of industries.

    Machine learning solutions architecture (ML solutions architecture) is another relatively new discipline that is playing an increasingly critical role in the full end-to-end ML lifecycle as ML projects become increasingly complex in terms of business impact, science sophistication, and the technology landscape.

    This chapter will help you understand where ML solutions architecture fits in the full data science lifecycle. We will discuss the different steps it will take to get an ML project from the ideation stage to production and the challenges faced by organizations, such as use case identification, data quality issues, and shortage of ML talent when implementing an ML initiative. Finally, we will finish the chapter by briefly discussing the core focus areas of ML solutions architecture, including system architecture, workflow automation, and security and compliance.

    In this chapter, we are going to cover the following main topics:

    ML versus traditional software

    The ML lifecycle and its key challenges

    What is ML solutions architecture, and where does it fit in the overall lifecycle?

    Upon completing this chapter, you will understand the role of an ML solutions architect and what business and technology areas you need to focus on to support end-to-end ML initiatives. The intent of this chapter is to offer a fundamental introduction to the ML lifecycle for those in the early stages of their exploration in the field. Experienced ML practitioners may wish to skip this foundational overview and proceed directly to more advanced content.

    The more advanced section commences in Chapter 4; however, many technical practitioners may find Chapter 2 helpful, as numerous technical practitioners often need more business understanding of where ML can be applied in different businesses and workflows. Additionally, Chapter 3, could prove beneficial for certain practitioners, as it provides an introduction to ML algorithms for those new to this topic and can also serve as a refresher for those practicing these concepts regularly.

    ML versus traditional software

    Before I started working in the field of AI/ML, I spent many years building computer software platforms for large financial services institutions. Some of the business problems I worked on had complex rules, such as identifying companies for comparable analysis for investment banking deals or creating a master database for all the different companies’ identifiers from the different data providers. We had to implement hardcoded rules in database-stored procedures and application server backends to solve these problems. We often debated if certain rules made sense or not for the business problems we tried to solve.

    As rules changed, we had to reimplement the rules and make sure the changes did not break anything. To test for new releases or changes, we often replied to human experts to exhaustively test and validate all the business logic implemented before the production release. It was a very time-consuming and error-prone process and required a significant amount of engineering, testing against the documented specification, and rigorous change management for deployment every time new rules were introduced, or existing rules needed to be changed. We often replied to users to report business logic issues in production, and when an issue was reported in production, we sometimes had to open up the source code to troubleshoot or explain the logic of how it worked. I remember I often asked myself if there were better ways to do this.

    After I started working in the field of AI/ML, I started to solve many similar challenges using ML techniques. With ML, I did not need to come up with complex rules that often require deep data and domain expertise to create or maintain the complex rules for decision making. Instead, I focused on collecting high-quality data and used ML algorithms to learn the rules and patterns from the data directly. This new approach eliminated many of the challenging aspects of creating new rules (for example, a deep domain expertise requirement, or avoiding human bias) and maintaining existing rules. To validate the model before the production release, we could examine model performance metrics such as accuracy. While it still required data science expertise to interpret the model metrics against the nature of the business problems and dataset, it did not require exhaustive manual testing of all the different scenarios. When a model was deployed into production, we would monitor if the model performed as expected by monitoring any significant changes in production data versus the data we have collected for model training. We would collect new unseen data and labels for production data and test the model performance periodically to ensure that its predictive accuracy remains robust when faced with new, previously unseen production data. To explain why a model made a decision the way it did, we did not need to open up the source code to re-examine the hardcoded logic. Instead, we would rely on ML techniques to help explain the relative importance of different input features to understand what factors were most influential in the decision-making by the ML models.

    The following figure shows a graphical view of the process differences between developing a piece of software and training an ML model:

    Figure 1.1: ML and computer software

    Now that you know the difference between ML and traditional software, it is time to dive deep into understanding the different stages in an ML lifecycle.

    ML lifecycle

    One of the early ML projects that I worked on was a fascinating yet daunting sports predictive analytics problem for a major league brand. I was given a list of predictive analytics outcomes to think about to see if there were ML solutions for the problems. I was a casual viewer of the sport; I didn’t know anything about the analytics to be generated, nor the rules of the games in the detail that was needed. I was provided with some sample data but had no idea what to do with it.

    The first thing I started to work on was an immersion in the sport itself. I delved into the intricacies of the game, studying the different player positions and events that make up each game and play. Only after being armed with the newfound domain knowledge did the data start to make sense. Together with the stakeholder, we evaluated the impact of the different analytics outcomes and assessed the modeling feasibility based on the data we had. With a clear understanding of the data, we came up with a couple of top ML analytics with the most business impact to focus on. We also decided how they would be integrated into the existing business workflow, and how they would be measured on their impacts.

    Subsequently, I delved deeper into the data to ascertain what information was available and what was lacking. The raw dataset had a lot of irrelevant data points that needed to be removed while the relevant data points needed to be transformed to provide the strongest signals for model training. I processed and prepared the dataset based on a few of the ML algorithms I had considered and conducted experiments to determine the best approach. I lacked a tool to track the different experiment results, so I had to document what I had done manually. After some initial rounds of experimentation, it became evident that the existing data was not sufficient to train a high-performance model. Hence, I decided to build a custom deep learning model to incorporate data of different modalities as the data points had temporal dependencies and required additional spatial information for the modeling. The data owner was able to provide the additional datasets I required, and after more experiments with custom algorithms and significant data preparations and feature engineering, I eventually trained a model that met the business objectives.

    After completing the model, another hard challenge began – deploying and operationalizing the model in production and integrating it into the existing business workflow and system architecture. We engaged in many architecture and engineering discussions and eventually built out a deployment architecture for the model.

    As you can see from my personal experience, the journey from business idea to ML production deployment involved many steps. A typical lifecycle of an ML project follows a formal structure, which includes several essential stages like business understanding, data acquisition and understanding, data preparation, model building, model evaluation, and model deployment. Since a big component of the lifecycle is experimentation with different datasets, features, and algorithms, the whole process is highly iterative. Furthermore, it is essential to note that there is no guarantee of a successful outcome. Factors such as the availability and quality of data, feature engineering techniques (the process of using domain knowledge to extract useful features from raw data), and the capability of the learning algorithms, among others, can all affect the final results.

    Figure 1.2: ML lifecycle

    The preceding figure illustrates the key steps in ML projects, and in the subsequent sections, we will delve into each of these steps in greater detail.

    Business problem understanding and ML problem framing

    The first stage in the lifecycle is business understanding. This stage involves the understanding of the business goals and defining business metrics that can measure the project’s success. For example, the following are some examples of business goals:

    Cost reduction for operational processes, such as document processing.

    Mitigation of business or operational risks, such as fraud and compliance.

    Product or service revenue improvements, such as better target marketing, new insight generation for better decision making, and increased customer satisfaction.

    To measure the success, you may use specific business metrics such as the number of hours reduced in a business process, an increased number of true positive frauds detected, a conversion rate improvement from target marketing, or the number of churn rate reductions. This is an essential step to get right to ensure there is sufficient justification for an ML project and that the outcome of the project can be successfully measured.

    After you have defined the business goals and business metrics, you need to evaluate if there is an ML solution for the business problem. While ML has a wide scope of applications, it is not always an optimal solution for every business problem.

    Data understanding and data preparation

    The saying that data is the new oil holds particularly true for ML. Without the required data, you cannot move forward with an ML project. That’s why the next step in the ML lifecycle is data acquisition, understanding, and preparation.

    Based on the business problems and ML approach, you will need to gather and comprehend the available data to determine if you have the right data and data volume to solve the ML problem. For example, suppose the business problem to address is credit card fraud detection. In that case, you will need datasets such as historical credit card transaction data, customer demographics, account data, device usage data, and networking access data. Detailed data analysis is then necessary to determine if the dataset features and quality are sufficient for the modeling tasks. You also need to decide if the data needs labeling, such as

    fraud

    or

    not-fraud

    . During this step, depending on the data quality, a significant amount of data wrangling might be performed to prepare and clean the data and to generate the dataset for model training and model evaluation, depending on the data quality.

    Model training and evaluation

    Using the training and validation datasets established, a data scientist must run a number of experiments using different ML algorithms and dataset features for feature selection and model development. This is a highly iterative process and could require numerous runs of data processing and model development to find the right algorithm and dataset combination for optimal model performance. In addition to model performance, factors such as data bias and model explainability may need to be considered to comply with internal or regulatory requirements.

    Prior to deployment into production, the model quality must be validated using the relevant technical metrics, such as the accuracy score. This is usually accomplished using a holdout dataset, also known as a test dataset, to gauge how the model performs on unseen data. It is crucial to understand which metrics are appropriate for model validation, as they vary depending on the ML problems and the dataset used. For example, model accuracy would be a suitable validation metric for a document classification use case if the number of document types is relatively balanced. However, model accuracy would not be a good metric to evaluate the model performance for a fraud detection use case – this is because the number of frauds is small and even if the model predicts

    not-fraud

    all the time, the model accuracy could still be very high.

    Model deployment

    After the model is fully trained and validated to meet the expected performance metric, it can be deployed into production and the business workflow. There are two main deployment concepts here. The first involves the deployment of the model itself to be used by a client application to generate predictions. The second concept is to integrate this prediction workflow into a business workflow application. For example, deploying the credit fraud model would either host the model behind an API for real-time prediction or as a package that can be loaded dynamically to support batch predictions. Moreover, this prediction workflow also needs to be integrated into business workflow applications for fraud detection, which might include the fraud detection of real-time transactions, decision automation based on prediction output, and fraud detection analytics for detailed fraud analytics.

    Model monitoring

    The ML lifecycle does not end with model deployment. Unlike software, whose behavior is highly deterministic since developers explicitly code its logic, an ML model could behave differently in production from its behavior in model training and validation. This could be caused by changes in the production data characteristics, data distribution, or the potential manipulation of request data. Therefore, model monitoring is an important post-deployment step for detecting model performance degradation (a.k.a model drift) or dataset distribution change in the production environment (a.k.a data drift).

    Business metric tracking

    The actual business impact should be tracked and measured as an ongoing process to ensure the model delivers the expected business benefits. This may involve comparing the business metrics before and after the model deployment, or A/B testing where a business metric is compared between workflows with or without the ML model. If the model does not deliver the expected benefits, it should be re-evaluated for improvement opportunities. This could also mean framing the business problem as a different ML problem. For example, if churn prediction does not help improve customer satisfaction, then consider a personalized product/service offering to solve the problem.

    ML challenges

    Over the years, I have worked on many real-world problems using ML solutions and encountered different challenges faced by different industries during ML adoptions.

    I often get the same question when working on ML projects: We have a lot of data – can you help us figure out what insights we can generate using ML? I refer to companies with this question as having a business use case challenge. Not being able to identify business use cases for ML is a very big hurdle for many companies. Without a properly identified business problem and its value proposition and benefit, it becomes difficult to initiate an ML project.

    In my conversations with different companies across their industries, data-related challenges emerge as a frequent issue. This includes data quality, data inventory, data accessibility, data governance, and data availability. This problem affects both data-poor and data-rich companies and is often exacerbated by data silos, data security, and industry regulations.

    The shortage of data science and ML talent is another major challenge I have heard from many companies. Companies, in general, are having a tough time attracting and retaining top ML talents, which is a common problem across all industries. As ML platforms become more complex and the scope of ML projects increases, the need for other ML-related functions starts to surface. Nowadays, in addition to just data scientists, an organization would also need functional roles for ML product management, ML infrastructure engineering, and ML operations management.

    Based on my experiences, I have observed that cultural acceptance of ML-based solutions is another significant challenge for broad adoption. There are individuals who perceive ML as a threat to their job functions, and their lack of knowledge in ML makes them hesitant to adopt these new methods in their business workflows.

    The practice of ML solutions architecture aims to help solve some of the challenges in ML. In the next section, we will explore ML solutions architecture and its role in the ML lifecycle.

    ML solutions architecture

    When I initially worked with companies as an ML solutions architect, the landscape was quite different from what it is now. The focus was mainly on data science and modeling, and the problems at hand were small in scope. Back then, most of the problems could be solved using simple ML techniques. The datasets were small, and the infrastructure required was not too demanding. The scope of the ML initiative at these companies was limited to a few data scientists or teams. As an ML architect at that time, I primarily needed to have solid data science skills and general cloud architecture knowledge to get the job done.

    In more recent years, the landscape of ML initiatives has become more intricate and multifaceted, necessitating involvement from a broader range of functions and personas at companies. My engagement has expanded to include discussions with business executives about ML strategies and organizational design to facilitate the broad adoption of AI/ML throughout their enterprises. I have been tasked with designing more complex ML platforms, utilizing a diverse range of technologies for large enterprises to meet stringent security and compliance requirements. ML workflow orchestration and operations have become increasingly crucial topics of discussion, and more and more companies are looking to train large ML models with enormous amounts of training data. The number of ML models trained and deployed by some companies has skyrocketed to tens of thousands from a few dozen models in just a few years. Furthermore, sophisticated and security-sensitive customers have sought guidance on topics such as ML privacy, model explainability, and data and model bias. As an ML solutions architect, I’ve noticed that the skills and knowledge required to be successful in this role have evolved significantly.

    Trying to navigate the complexities of a business, data, science, and technology landscape can be a daunting task. As an ML solutions architect, I have seen firsthand the challenges that companies face in bringing all these pieces together. In my view, ML solutions architecture is an essential discipline that serves as a bridge connecting the different components of an ML initiative. Drawing on my years of experience working with companies of all sizes and across diverse industries, I believe that an ML solutions architect plays a pivotal role in identifying business needs, developing ML solutions to address these needs, and designing the technology platforms necessary to run these solutions. By collaborating with various business and technology partners, an ML solutions architect can help companies unlock the full potential of their data and realize tangible benefits from their ML initiatives.

    The following figure illustrates the core functional areas covered by the ML solutions architecture:

    Figure 1.3: ML solutions architecture coverage

    In the following sections, we will explore each of these areas in greater detail:

    Business understanding: Business problem understanding and transformation using AI and ML.

    Identification and verification of ML techniques: Identification and verification of ML techniques for solving specific ML problems.

    System architecture of the ML technology platform: System architecture design and implementation of the ML technology platforms.

    MLOps: ML platform automation technical design.

    Security and compliance: Security, compliance, and audit considerations for the ML platform and ML models.

    So, let’s dive in!

    Business understanding and ML transformation

    The goal of the business workflow analysis is to identify inefficiencies in the workflows and determine if ML can be applied to help eliminate pain points, improve efficiency, or even create new revenue opportunities.

    Picture this: you are tasked with improving a call center’s operations. You know there are inefficiencies that need to be addressed, but you’re not sure where to start. That’s where business workflow analysis comes in. By analyzing the call center’s workflows, you can identify pain points such as long customer wait times, knowledge gaps among agents, and the inability to extract customer insights from call recordings. Once you have identified these issues, you can determine what data is available and which business metrics need to be improved. This is where ML comes in. You can use ML to create virtual assistants for common customer inquiries, transcribe audio recordings to allow for text analysis, and detect customer intent for product cross-sell and up-sell. But sometimes, you need to modify the business process to incorporate ML solutions. For example, if you want to use call recording analytics to generate insights for cross-selling or up-selling products, but there’s no established process to act on those insights, you may need to introduce an automated target marketing process or a proactive outreach process by the sales team.

    Identification and verification of ML techniques

    Once you have come up with a list of ML options, the next step is to determine if the assumption behind the ML approach is valid. This could involve conducting a simple proof of concept (POC) modeling to validate the available dataset and modeling approach, or technology POC using pre-built AI services, or testing of ML frameworks. For example, you might want to test the feasibility of text transcription from audio files using an existing text transcription service or build a customer propensity model for a new product conversion from a marketing campaign.

    It is worth noting that ML solutions architecture does not focus on developing new machine algorithms, a job best suited for applied data scientists or research data scientists. Instead, ML solutions architecture focuses on identifying and applying ML algorithms to address a range of ML problems such as predictive analytics, computer vision, or natural language processing. Also, the goal of any modeling task here is not to build production-quality models but rather to validate the approach for further experimentations by full-time applied data scientists.

    System architecture design and implementation

    The most important aspect of the ML solutions architect’s role is the technical architecture design of the ML platform. The platform will need to provide the technical capability to support the different phases of the ML cycle and personas, such as data scientists and operations engineers. Specifically, an ML platform needs to have the following core functions:

    Data explorations and experimentation: Data scientists use ML platforms for data exploration, experimentation, model building, and model evaluation. ML platforms need to provide capabilities such as data science development tools for model authoring and experimentation, data wrangling tools for data exploration and wrangling, source code control for code management, and a package repository for library package management.

    Data management and large-scale data processing: Data scientists or data engineers will need the technical capability to ingest, store, access, and process large amounts of data for cleansing, transformation, and feature engineering.

    Model training infrastructure management: ML platforms will need to provide model training infrastructure for different modeling training using different types of computing resources, storage, and networking configurations. It also needs to support different types of ML libraries or frameworks, such as scikit-learn, TensorFlow, and PyTorch.

    Model hosting/serving: ML platforms will need to provide the technical capability to host and serve the model for prediction generations, for real-time, batch, or both.

    Model management: Trained ML models will need to be managed and tracked for easy access and lookup, with relevant metadata.

    Feature management: Common and reusable features will need to be managed and served for model training and model serving purposes.

    ML platform workflow automation

    A key aspect of ML platform design is workflow automation and continuous integration/continuous deployment (CI/CD), also known as MLOps. ML is a multi-step workflow – it needs to be automated, which includes data processing, model training, model validation, and model hosting. Infrastructure provisioning automation and self-service is another aspect of automation design. Key components of workflow automation include the following:

    Pipeline design and management: The ability to create different automation pipelines for various tasks, such as model training and model hosting.

    Pipeline execution and monitoring: The ability to run different pipelines and monitor the pipeline execution status for the entire pipeline and each

    Enjoying the preview?
    Page 1 of 1