Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
Ebook680 pages5 hours

Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Although creating a machine learning pipeline or developing a working prototype of a software system from that pipeline is easy and straightforward nowadays, the journey toward a professional software system is still extensive. This book will help you get to grips with various best practices and recipes that will help software engineers transform prototype pipelines into complete software products.
The book begins by introducing the main concepts of professional software systems that leverage machine learning at their core. As you progress, you’ll explore the differences between traditional, non-ML software, and machine learning software. The initial best practices will guide you in determining the type of software you need for your product. Subsequently, you will delve into algorithms, covering their selection, development, and testing before exploring the intricacies of the infrastructure for machine learning systems by defining best practices for identifying the right data source and ensuring its quality.
Towards the end, you’ll address the most challenging aspect of large-scale machine learning systems – ethics. By exploring and defining best practices for assessing ethical risks and strategies for mitigation, you will conclude the book where it all began – large-scale machine learning software.

LanguageEnglish
Release dateJan 31, 2024
ISBN9781837636945
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system

Related to Machine Learning Infrastructure and Best Practices for Software Engineers

Related ebooks

Computers For You

View More

Related articles

Reviews for Machine Learning Infrastructure and Best Practices for Software Engineers

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Machine Learning Infrastructure and Best Practices for Software Engineers - Miroslaw Staron

    Cover.pngPackt Logo

    Machine Learning Infrastructure and Best Practices for Software Engineers

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Niranjan Naikwadi

    Publishing Product Manager: Yasir Ali Khan

    Book Project Manager: Hemangi Lotlikar

    Senior Editor: Sushma Reddy

    Technical Editor: Kavyashree K S

    Copy Editor: Safis Editing

    Proofreader: Safis Editing

    Indexer: Hemangini Bari

    Production Designer: Gokul Raj S.T

    DevRel Marketing Coordinator: Vinishka Kalra

    First published: January 2024

    Production reference: 1170124

    Published by

    Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK

    ISBN 978-1-83763-406-4

    www.packtpub.com

    Writing a book with a lot of practical examples requires a lot of extra time, which is often taken from family and friends. I dedicate this book to my family – Alexander, Cornelia, Viktoria, and Sylwia – who always supported and encouraged me, and to my parents and parents-in-law, who shaped me to be who I am.

    – Miroslaw Staron

    Contributors

    About the author

    Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner’s Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.

    I would like to thank my family for their support in writing this book. I would also like to thank my colleagues from the Software Center program who provided me with the ability to develop my ideas and knowledge in this area – in particular, Wilhelm Meding, Jan Bosch, Ola Söder, Gert Frost, Martin Kitchen, Niels Jørgen Strøm, and several other colleagues. One person who really ignited my interest in this area is of course Mirosław Mirek Ochodek, to whom I am extremely grateful. I would also like to thank the funders of my research, who supported my studies throughout the years. I would like to thank my Ph.D. students, who challenged me and encouraged me to always dig deeper into the topics. I’m also very grateful to the reviewers of this book – Hongyi Zhang and Sushant K. Pandey, who provided invaluable comments and feedback for the book. Finally, I would like to extend my gratitude to my publishing team – Hemangi Lotlikar, Sushma Reddy, and Anant Jaint – this book would not have materialized without you!

    About the reviewers

    Hongyi Zhang is a researcher at Chalmers University of Technology with over five years of experience in the fields of machine learning and software engineering. Specializing in machine learning, edge/cloud computing, and software engineering, his research merges machine learning theory and software applications, driving tangible improvements in industrial machine learning ecosystems.

    Sushant Kumar Pandey is a dedicated post-doctoral researcher at the Department of CSE, Chalmers at the University of Gothenburg, Sweden, who seamlessly integrates academia with industry, collaborating with Volvo Cars in Gothenburg. Armed with a Ph.D. in CSE from the esteemed Indian Institute of Technology (BHU), India, Sushant specializes in the application of AI in software engineering. His research advances technology’s transformative potential. As a respected reviewer for prestigious venues such as IST, KBS, EASE, and ESWA, Sushant actively contributes to shaping the discourse in his field. Beyond research, he leverages his expertise to mentor students, fostering innovation and excellence in the next generation of professionals.

    Table of Contents

    Preface

    Part 1: Machine Learning Landscape in Software Engineering

    1

    Machine Learning Compared to Traditional Software

    Machine learning is not traditional software

    Supervised, unsupervised, and reinforcement learning – it is just the beginning

    An example of traditional and machine learning software

    Probability and software – how well they go together

    Testing and evaluation – the same but different

    Summary

    References

    2

    Elements of a Machine Learning System

    Elements of a production machine learning system

    Data and algorithms

    Data collection

    Feature extraction

    Data validation

    Configuration and monitoring

    Configuration

    Monitoring

    Infrastructure and resource management

    Data serving infrastructure

    Computational infrastructure

    How this all comes together – machine learning pipelines

    References

    3

    Data in Software Systems – Text, Images, Code, and Their Annotations

    Raw data and features – what are the differences?

    Images

    Text

    Visualization of output from more advanced text processing

    Structured text – source code of programs

    Every data has its purpose – annotations and tasks

    Annotating text for intent recognition

    Where different types of data can be used together – an outlook on multi-modal data models

    References

    4

    Data Acquisition, Data Quality, and Noise

    Sources of data and what we can do with them

    Extracting data from software engineering tools – Gerrit and Jira

    Extracting data from product databases – GitHub and Git

    Data quality

    Noise

    Summary

    References

    5

    Quantifying and Improving Data Properties

    Feature engineering – the basics

    Clean data

    Noise in data management

    Attribute noise

    Splitting data

    How ML models handle noise

    References

    Part 2: Data Acquisition and Management

    6

    Processing Data in Machine Learning Systems

    Numerical data

    Summarizing the data

    Diving deeper into correlations

    Summarizing individual measures

    Reducing the number of measures – PCA

    Other types of data – images

    Text data

    Toward feature engineering

    References

    7

    Feature Engineering for Numerical and Image Data

    Feature engineering

    Feature engineering for numerical data

    PCA

    t-SNE

    ICA

    Locally linear embedding

    Linear discriminant analysis

    Autoencoders

    Feature engineering for image data

    Summary

    References

    8

    Feature Engineering for Natural Language Data

    Natural language data in software engineering and the rise of GitHub Copilot

    What a tokenizer is and what it does

    Bag-of-words and simple tokenizers

    WordPiece tokenizer

    BPE

    The SentencePiece tokenizer

    Word embeddings

    FastText

    From feature extraction to models

    References

    Part 3: Design and Development of ML Systems

    9

    Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)

    Why do we need different types of models?

    Classical machine learning models

    Convolutional neural networks and image processing

    BERT and GPT models

    Using language models in software systems

    Summary

    References

    10

    Training and Evaluating Classical Machine Learning Systems and Neural Networks

    Training and testing processes

    Training classical machine learning models

    Understanding the training process

    Random forest and opaque models

    Training deep learning models

    Misleading results – data leaking

    Summary

    References

    11

    Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

    From classical ML to GenAI

    The theory behind advanced models – AEs and transformers

    AEs

    Transformers

    Training and evaluation of a RoBERTa model

    Training and evaluation of an AE

    Developing safety cages to prevent models from breaking the entire system

    Summary

    References

    12

    Designing Machine Learning Pipelines (MLOps) and Their Testing

    What ML pipelines are

    ML pipelines

    Elements of MLOps

    ML pipelines – how to use ML in the system in practice

    Deploying models to HuggingFace

    Downloading models from HuggingFace

    Raw data-based pipelines

    Pipelines for NLP-related tasks

    Pipelines for images

    Feature-based pipelines

    Testing of ML pipelines

    Monitoring ML systems at runtime

    Summary

    References

    13

    Designing and Implementing Large-Scale, Robust ML Software

    ML is not alone

    The UI of an ML model

    Data storage

    Deploying an ML model for numerical data

    Deploying a generative ML model for images

    Deploying a code completion model as an extension

    Summary

    References

    Part 4: Ethical Aspects of Data Management and ML System Development

    14

    Ethics in Data Acquisition and Management

    Ethics in computer science and software engineering

    Data is all around us, but can we really use it?

    Ethics behind data from open source systems

    Ethics behind data collected from humans

    Contracts and legal obligations

    References

    15

    Ethics in Machine Learning Systems

    Bias and ML – is it possible to have an objective AI?

    Measuring and monitoring for bias

    Other metrics of bias

    Developing mechanisms to prevent ML bias from spreading throughout the system

    Summary

    References

    16

    Integrating ML Systems in Ecosystems

    Ecosystems

    Creating web services over ML models using Flask

    Creating a web service using Flask

    Creating a web service that contains a pre-trained ML model

    Deploying ML models using Docker

    Combining web services into ecosystems

    Summary

    References

    17

    Summary and Where to Go Next

    To know where we’re going, we need to know where we’ve been

    Best practices

    Current developments

    My view on the future

    Final remarks

    References

    Index

    Other Books You May Enjoy

    Preface

    Machine learning has gained a lot of popularity in recent years. The introduction of large language models such as GPT-3 and 4 only increased the speed of the development of this field. These large language models have become so powerful that it is almost impossible to train them on a local computer. However, this is not necessary at all. These language models provide the ability to create new tools without the need to train them because they can be steered by the context window and the prompt.

    In this book, my goal is to show how machine learning models can be trained, evaluated, and tested – both in the context of a small prototype and in the context of a fully-fledged software product. The primary objective of this book is to bridge the gap between theoretical knowledge and practical implementation of machine learning in software engineering. It aims to equip you with the skills necessary to not only understand but also effectively implement and innovate with AI and machine learning technologies in your professional pursuits.

    The journey of integrating machine learning into software engineering is as thrilling as it is challenging. As we delve into the intricacies of machine learning infrastructure, this book serves as a comprehensive guide, navigating through the complexities and best practices that are pivotal for software engineers. It is designed to bridge the gap between the theoretical aspects of machine learning and the practical challenges faced during implementation in real-world scenarios.

    We begin by exploring the fundamental concepts of machine learning, providing a solid foundation for those new to the field. As we progress, the focus shifts to the infrastructure – the backbone of any successful machine learning project. From data collection and processing to model training and deployment, each step is crucial and requires careful consideration and planning.

    A significant portion of the book is dedicated to best practices. These practices are not just theoretical guidelines but are derived from real-life experiences and case studies that my research team discovered during our work in this field. These best practices offer invaluable insights into handling common pitfalls and ensuring the scalability, reliability, and efficiency of machine learning systems.

    Furthermore, we delve into the ethics of data and machine learning algorithms. We explore the theories behind ethics in machine learning, look closer into the licensing of data and models, and finally, explore the practical frameworks that can quantify bias in data and models in machine learning.

    This book is not just a technical guide; it is a journey through the evolving landscape of machine learning in software engineering. Whether you are a novice eager to learn, or a seasoned professional seeking to enhance your skills, this book aims to be a valuable resource, providing clarity and direction in the exciting and ever-changing world of machine learning.

    Who this book is for

    This book is meticulously crafted for software engineers, computer scientists, and programmers who seek practical applications of artificial intelligence and machine learning in their field. The content is tailored to impart foundational knowledge on working with machine learning models, viewed through the lens of a programmer and system architect.

    The book presupposes familiarity with programming principles, but it does not demand expertise in mathematics or statistics. This approach ensures accessibility to a broader range of professionals and enthusiasts in the software development domain. For those of you without prior experience in Python, this book necessitates acquiring a basic understanding of the language. However, the material is structured to facilitate a rapid and comprehensive grasp of Python essentials. Conversely, for those proficient in Python but not yet seasoned in professional programming, this book serves as a valuable resource for transitioning into the realm of software engineering with a focus on AI and ML applications.

    What this book covers

    Chapter 1

    , Machine Learning Compared to Traditional Software, explores where these two types of software systems are most appropriate. We learn about the software development processes that programmers use to create both types of software and we also learn about the classical four types of machine learning software – rule-based, supervised, unsupervised, and reinforcement learning. Finally, we also learn about the different roles of data in traditional and machine learning software.

    Chapter 2

    , Elements of a Machine Learning System, reviews each element of a professional machine learning system. We start by understanding which elements are important and why. Then, we explore how to create such elements and how to work by putting them together into a single machine learning system – the so-called machine learning pipeline.

    Chapter 3

    , Data in Software Systems – Text, Images, Code, and Features, introduces three data types – images, texts, and formatted text (program source code). We explore how each of these types of data can be used in machine learning, how they should be annotated, and for what purpose. Introducing these three types of data provides us with the possibility to explore different ways of annotating these sources of data.

    Chapter 4

    , Data Acquisition, Data Quality, and Noise, dives deeper into topics related to data quality. We go through a theoretical model for assessing data quality and we provide methods and tools to operationalize it. We also look into the concept of noise in machine learning and how to reduce it by using different tokenization methods.

    Chapter 5

    , Quantifying and Improving Data Properties, dives deeper into the properties of data and how to improve them. In contrast to the previous chapter, we work on feature vectors rather than raw data. The feature vectors are already a transformation of the data; therefore, we can change such properties as noise or even change how the data is perceived. We focus on the processing of text, which is an important part of many machine learning algorithms nowadays. We start by understanding how to transform data into feature vectors using simple algorithms, such as bag of words, so that we can work on feature vectors.

    Chapter 6

    , Processing Data in Machine Learning Systems, dives deeper into the ways in which data and algorithms are entangled. We talk a lot about data in generic terms, but in this chapter, we explain what kind of data is needed in machine learning systems. We explain the fact that all kinds of data are used in numerical form – either as a feature vector or as more complex feature matrices. Then, we will explain the need to transform unstructured data (e.g., text) into structured data. This chapter will lay the foundations for going deeper into each type of data, which is the content of the next few chapters.

    Chapter 7

    , Feature Engineering for Numerical and Image Data, focuses on the feature engineering process for numerical and image data. We start by going through the typical methods such as Principal Component Analysis (PCA), which we used previously for visualization. We then move on to more advanced methods such as the t-Student Distribution Stochastic Network Embeddings (t-SNE) and Independent Component Analysis (ICA). What we end up with is the use of autoencoders as a dimensionality reduction technique for both numerical and image data.

    Chapter 8

    , Feature Engineering for Natural Language Data, explores the first steps that made the transformer (GPT) technologies so powerful – feature extraction from natural language data. Natural language is a special kind of data source in software engineering. With the introduction of GitHub Copilot and ChatGPT, it became evident that machine learning and artificial intelligence tools for software engineering tasks are no longer science fiction.

    Chapter 9

    , Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning), explores different types of machine learning systems. We start from classical machine learning models such as random forest and we move on to convolutional and GPT models, which are called deep learning models. Their name comes from the fact that they use raw data as input and the first layers of the models include feature extraction layers. They are also designed to progressively learn more abstract features as the input data moves through these models. This chapter demonstrates each of these types of models and progresses from classical machine learning to the generative AI models.

    Chapter 10

    , Training and Evaluation of Classical ML Systems and Neural Networks, goes a bit deeper into the process of training and evaluation. We start with the basic theory behind different algorithms and then we show how they are trained. We start with the classical machine learning models, exemplified by the decision trees. Then, we gradually move toward deep learning where we explore both the dense neural networks and some more advanced types of networks.

    Chapter 11

    , Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders, explores how generative AI models work based on GPT and Bidirectional Encoder Representation Transformers (BERT). These models are designed to generate new data based on the patterns that they were trained on. We also look at the concept of autoencoders, where we train an autoencoder to generate new images based on the previously trained data.

    Chapter 12

    , Designing Machine Learning Pipelines and their Testing, describes how the main goal of MLOps is to bridge the gap between data science and operations teams, fostering collaboration and ensuring that machine learning projects can be effectively and reliably deployed at scale. MLOps helps to automate and optimize the entire machine learning life cycle, from model development to deployment and maintenance, thus improving the efficiency and effectiveness of ML systems in production. In this chapter, we learn how machine learning systems are designed and operated in practice. The chapter shows how pipelines are turned into a software system, with a focus on testing ML pipelines and their deployment at Hugging Face.

    Chapter 13

    , Designing and Implementation of Large-Scale, Robust ML Software, explains how to integrate the machine learning model with a graphical user interface programmed in Gradio and storage in a database. We use two examples of machine learning pipelines – an example of the model for predicting defects from our previous chapters and a generative AI model to create pictures from a natural language prompt.

    Chapter 14

    , Ethics in Data Acquisition and Management, starts by exploring a few examples of unethical systems that show bias, such as credit ranking systems that penalize certain minorities. We also explain the problems with using open source data and revealing the identities of subjects. The core of the chapter, however, is the explanation and discussion on ethical frameworks for data management and software systems, including the IEEE and ACM codes of conduct.

    Chapter 15

    , Ethics in Machine Learning Systems, focuses on the bias in machine learning systems. We start by exploring sources of bias and briefly discussing these sources. We then explore ways to spot biases, how to minimize them, and finally, how to communicate potential biases to the users of our system.

    Chapter 16

    , Integration of ML Systems in Ecosystems, explains how packaging the ML systems into web services allows us to integrate them into workflows in a very flexible way. Instead of compiling or using dynamically linked libraries, we can deploy machine learning components that communicate over HTTP protocols using JSON protocols. In fact, we have already seen how to use that protocol by using the GPT-3 model that is hosted by OpenAI. In this chapter, we explore the possibility of creating our own Docker container with a pre-trained machine learning model, deploying it, and integrating it with other components.

    Chapter 17

    , Summary and Where to Go Next, revisits all the best practices and summarizes them per chapter. In addition, we also look into what the future of machine learning and AI may bring to software engineering.

    To get the most out of this book

    In this book, we use Python and PyTorch, so you need to have these two installed on your system. I used them on Windows and Linux, but they can also be used in cloud environments such as Google Colab or GitHub Codespaces (both were tested).

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-Infrastructure-and-Best-Practices-for-Software-Engineers

    . If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

    . Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The model itself is created one line above, in the model = LinearRegression() line.

    A block of code is set as follows:

    def fibRec(n):

      if n < 2:

          return n

      else:

          return fibRec(n-1) + fibRec(n-2)

    Any command-line input or output is written as follows:

    >python app.py

    Best practices

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

    .

    Share Your Thoughts

    Once you’ve read Machine Learning Infrastructure and Best Practices for Software Engineers, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

    for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    Download a free PDF copy of this book

    https://packt.link/free-ebook/978-1-83763-406-4

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1:Machine Learning Landscape in Software Engineering

    Traditionally, Machine Learning (ML) was considered to be a niche domain in software engineering. No large software systems used statistical learning in production. This changed in the 2010s when recommendation systems started to utilize large quantities of data – for example, to recommend movies, books, or music. With the rise of transformer technologies, this has changed. Commonly known products such as ChatGPT popularized these techniques and showed that they are no longer niche products, but have entered the mainstream software products and services. Software engineering needs to keep up and we need to know how to create the software based on these modern machine learning models. In this first part of the book, we look at how machine learning changes software development and how we need to adapt to these changes.

    This part has the following chapters:

    Chapter 1

    , Machine Learning Compared to Traditional Software

    Chapter 2

    , Elements of a Machine Learning System

    Chapter 3

    , Data in Software Systems – Text, Images, Code, and Features

    Chapter 4

    , Data Acquisition, Data Quality, and Noise

    Chapter 5

    , Quantifying and Improving Data Properties

    1

    Machine Learning Compared to Traditional Software

    Machine learning software is a special kind of software that finds patterns in data, learns from them, and even recreates these patterns on new data. Developing the machine learning software is, therefore, focused on finding the right data, matching it with the appropriate algorithm, and evaluating its performance. Traditional software, on the contrary, is developed with the algorithm in mind. Based on software requirements, programmers develop algorithms that solve specific tasks and then test them. Data is secondary, although not completely unimportant. Both types of software can co-exist in the same software system, but the programmer must ensure compatibility between them.

    In this chapter, we’ll explore where these two types of software systems are most appropriate. We’ll learn about the software development processes that programmers use to create both types of software. We’ll also learn about the four classical types of machine learning software – rule-based learning, supervised learning, unsupervised learning, and reinforcement learning. Finally, we’ll learn about the different roles of data in traditional and machine learning software – as input to pre-programmed algorithms in traditional software and input to training models in machine learning software.

    The best practices introduced in this chapter provide practical guidance on when to choose each type of software and how to assess the advantages and disadvantages of these types. By exploring a few modern examples, we’ll understand how to create an entire software system with machine learning algorithms at the center.

    In this chapter, we’re going to cover the following main topics:

    Machine learning is not a traditional software

    Probability and software – how well do they go together?

    Testing and validation – the same but different

    Machine learning is not traditional software

    Although machine learning and artificial intelligence have been around since the 1950s, introduced by Alan Turing, they only became popular with the first MYCIN system and our understanding of machine learning systems changed over time. It was not until the 2010s that we started to perceive, design, and develop machine learning in the same way as we do today (in 2023). In my view, two pivotal moments shaped the landscape of machine learning as we see it today.

    The first pivotal moment was the focus on big data in the late 2000s and early 2010s. With the introduction of smartphones, companies started to collect and process increasingly large quantities of data, mostly about our behavior online. One of the companies that perfected this was Google, which collected data about our searches, online behavior, and usage of Google’s operating system, Android. As the volume of the collected data increased (and its speed/velocity), so did its value and the need for its veracity – the five Vs. These five Vs – volume, velocity, value, veracity, and variety – required a new approach to working with data. The classical approach of relational databases (SQL) was no longer sufficient. Relational databases became too slow in handling high-velocity data streams, which gave way to map-reduce algorithms, distributed databases, and in-memory databases. The classical approach of relational schemas became too constraining for the variety of data, which gave way for non-SQL databases, which stored documents.

    The second pivotal moment was the rise of modern machine learning algorithms – deep learning. Deep learning algorithms are designed to handle unstructured data such as text, images, or music (compared to structured data in the form of tables and matrices). Classical machine learning algorithms, such as regression, decision trees, or random forest, require data in a tabular form. Each row is a data point, and each column is one

    Enjoying the preview?
    Page 1 of 1