ML Workflow Steps: Step 2: Building Dataset

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

ML Workflow Steps

this document explains the step 2 of ML workflow as mentioned in the first


document

Step 2: Building Dataset


The next step in the machine learning process is to build a dataset that can be used to solve your machine
learning-based problem. Understanding the data needed helps you select better models and algorithms so you
can build more effective solutions.

The most important step of the machine learning process


Working with data is perhaps the most overlooked—yet most important—step of the machine learning process.
In 2017, an O’Reilly study showed that machine learning practitioners spend 80% of their time working with
their data.

You can take an entire class just on working with, understanding, and processing data for machine learning
applications. Good, high-quality data is essential for any kind of machine learning project. Let's explore some
of the common aspects of working with data.

Data collection
Data collection can be as straightforward as running the appropriate SQL queries or as complicated as building
custom web scraper applications to collect data for your project. You might even have to run a model over
your data to generate needed labels. Here is the fundamental question:

Does the data you've collected match the machine learning task and problem you have defined?
Data Inspection:
The quality of your data will ultimately be the largest factor that affects how well you can expect your model
to perform. As you inspect your data, look for:

 Outliers
 Missing or incomplete values
 Data that needs to be transformed or preprocessed so it's in the correct format to be used by your
model
Summary Stats:
Models can make assumptions about how your data is structured.

Now that you have some data in hand, it is a good best practice to check that your data is in line with the
underlying assumptions of the machine learning model that you chose.

Using statistical tools, you can calculate things like the mean, inner-quartile range (IQR), and standard
deviation. These tools can give you insights into the scope, scale, and shape of a dataset.

Data Visualization:
You can use data visualization to see outliers and trends in your data and to help stakeholders understand your
data.

Look at the following two graphs. In the first graph, some data seems to have clustered into different groups.
In the graph immediately preceding it, some data points might be outliers.
One
You learned that having good data is key to being able to successfully answer the problem you have defined in
your machine learning problem.

Two
To build a good dataset, there are four key aspects to be considered when working with your data. First, you
need to collect the data. Second, you should inspect your data to check for outliers, missing or incomplete
values, and to see if any kind of data reformatting is required. Third, you should use summary statistics to
understand the scope, scale, and shape of the dataset. Finally, you should use data visualizations to check for
outliers, and to see trends in your data.

Key terms from this lesson:

 Impute is a common term referring to different statistical tools that can be used to calculate missing
values from your dataset.
 Outliers are data points that are significantly different from other date in the same sample.

Additional reading
 In machine learning, you use several statistical-based tools to better understand your data. The sklearn
library has many examples and tutorials, such as this example that demonstrates outlier detection on a
real dataset .

You might also like