Professional Documents
Culture Documents
Data Cleaning With Power BI (2024)
Data Cleaning With Power BI (2024)
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained
in this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing or its dealers and distributors, will be held
liable for any damages caused or alleged to have been caused directly or
indirectly by this book.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-640-9
To my incredible wife, whose unwavering support and love sustained me
throughout the writing of this book, even as we welcomed our son,
Maximus, into the world. I am grateful for your patience and
encouragement.
Also, to the skilled team at Packt for their guidance and expertise in
shaping this book.
– Gus Frazer
Contributors
Preface
Part 1 – Introduction and Fundamentals
Technical requirements
Understanding DAX
Summary
Questions
Summary
Questions
Data assessment
Data profiling
Data validation
Data transformations
Documentation
Summary
Questions
Technical requirements
Removing duplicates
Splitting columns
Merging columns
Replacing values
Calculated columns
Measures
Calculation group
Considerations
Summary
Questions
Part 2 – Data Import and Query Editor
Technical requirements
Summary
Questions
6
Cleaning Data with Query Editor
Technical requirements
Adding columns
Date/time
Rounding
Pivot/unpivot columns
Merge queries
Workflow
Summary
Questions
Further reading
7
Technical requirements
Structure of M
Summary
Questions
Understanding EDA
Exploring data profiling features in Power BI
Column distribution
Column quality
Column profile
Value distribution
Summary
Questions
Part 3 – Advanced Data Cleaning and
Optimizations
Technical requirements
Fuzzy matching
Fill down
Summary
Questions
10
Identifying parameters
Using parameters
Types of parameters
Defining parameters
Writing M code
Summary
Questions
11
M Query Optimization
Technical requirements
Summary
Questions
Further reading
12
Data Modeling and Managing Relationships
Dimensional modeling
Snowflake schema
Intermediate tables
Role-playing dimensions
Aggregating tables
Incremental refreshes
Understanding cardinality
Summary
Questions
Further reading
Part 4 – Paginated Reports, Automations, and
OpenAI
13
Technical requirements
Data preparation
Query
Fields
Options
Filters
Parameters
Summary
Questions
14
Technical requirements
Automating notifications
Summary
Questions
Further reading
15
Summary
Questions
Further reading
Putting it together
Assessments
Index
Other Books You May Enjoy
Preface
This book is not just a manual; it is a roadmap to demystify the art of data
cleaning in Power BI. It goes beyond the technicalities, instilling
confidence in you to embark on your data-cleaning journey with assurance.
In an era where data reigns supreme, this guide is not just about learning the
tools; it’s about conquering the challenges that often stifle progress. By the
time you reach the final chapters, the synergy of your newfound knowledge
and the innovative collaboration with OpenAI and ChatGPT will redefine
your approach to data cleaning, making it an intuitive and empowering
experience.
Who this book is for
Ideally, if you have a basic knowledge of BI tools and concepts, then this
book will help you advance your skills in Power BI.
What this book covers
Chapter 4, The Most Common Data Cleaning Operations, teaches you how
to identify and tackle the most common data challenges/corrections. You
will get hands-on as you walk through examples of carrying out the
cleaning steps.
Chapter 5, Importing Data into Power BI, explores the six main
considerations when importing data for analysis in Power BI, which include
metrics that matter the most when identifying how clean your data is.
Chapter 13, Preparing Data for Paginated Reporting, provides you with a
hands-on crash course into the world of paginated reports. It will guide you
through examples of how you can prepare your data for use in Power BI
Report Builder.
Chapter 14, Automating Data Cleaning Tasks with Power Automate, gives
an overview of Power Automate, which is often seen as a great tool and ally
in the Power tools kitbag to Power BI. With more and more Power BI
analysts and Microsoft customers beginning to use the other features of the
Microsoft Power tools, this chapter gives you an understanding of how you
might use Power Automate to help with the cleaning of your data.
Chapter 15, Making Life Easier with OpenAI, provides insight into how
OpenAI and tools such as ChatGPT and Copilot are improving the way we
work with data. It also provides context and examples of how you can
potentially use these technologies to get ahead.
This hands-on guide provides you with a strong foundation of best practices
and practical tips for data cleaning in Power BI. With each chapter, you can
follow along with real-world examples using a test dataset, gaining hands-
on skills and building confidence in your ability to use DAX, Power Query,
and other key tools.
Here are the key software that you will need through the book:
Power BI Service
Power Automate
Python
If you are using the digital version of this book, we advise you to type
the code yourself or access the code from the book’s GitHub repository
(a link is available in the next section). Doing so will help you avoid any
potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at
https://github.com/PacktPublishing/Data-Cleaning-with-Power-BI. If
there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
Conventions used
Get in touch
General feedback: If you have questions about any aspect of this book,
email us at [email protected] and mention the book title
in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we
would be grateful if you would report this to us. Please visit
www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on
the internet, we would be grateful if you would provide us with the location
address or website name. Please contact us at [email protected]
with a link to the material.
If you are interested in becoming an author: If there is a topic that you
have expertise in and you are interested in either writing or contributing to a
book, please visit authors.packtpub.com.
Once you’ve read Data Cleaning with Power BI, we’d love to hear your
thoughts! Please click here to go straight to the Amazon review page for
this book and share your feedback.
Your review is important to us and the tech community and will help us
make sure we’re delivering excellent quality content.
Do you like to read on the go but are unable to carry your print books
everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version
of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code
from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts,
newsletters, and great free content in your inbox daily
https://packt.link/free-ebook/9781805126409
Although not definitive, it’s generally accepted that when creating data
visualizations, cleaning and preparing data can often account for as much as
50-80% of the overall time spent on a data visualization project. Power BI
provides you with some great tools to carry this out and so we will dive
deeper into what is available during this chapter.
After this chapter, you will understand with confidence which tools are
available within Power BI to help prepare your data for analysis, how to
navigate around Power Query, and then what to consider when getting
started with preparing your data for analysis.
Technical requirements
Please ensure you have installed Power BI Desktop on your device so that
you can follow along with the instructions and navigation provided in the
chapter.
The Power Query Editor is a powerful tool that allows you to clean and
transform data. It provides a user-friendly interface to perform various data
transformation tasks, such as splitting columns, merging tables, filtering
data, and removing duplicates. It also has several built-in functions to help
you transform your data, such as date and text transformations.
This UI is the hub for cleaning and preparing data within Power BI. It
allows users such as yourself to connect to a wide range of data sources and
apply transformations within the UI. As you begin to clean and prepare
data, Power Query then tracks the steps of your cleaning process.
The actual language applied when carrying out cleaning steps is a language
called M code. When interacting with the UI within Power Query, Power BI
creates code in the M language to do transformations. We will dive deeper
into how you can use this language directly using the Advanced Editor later.
It’s important to know that there are two ways to access Power Query from
within Power BI. The first is through online experiences such as dataflows,
and the second is within the Power BI Desktop application. It’s important to
note, though, that the techniques learned in this book will allow you to use
Power Query within tools such as Excel, Power Apps, Power Automate,
Azure Data Factory, SQL Server Integration Services (SSIS), SQL
Server Analysis Services (SSAS), and Customer Insights.
Understanding DAX
DAX is a formula and query language that plays a pivotal role in Power BI,
helping users of Power BI to perform complex calculations and analysis on
their data. It’s a language created by Microsoft for their suite of products
and was first introduced in 2009 along with Power Pivot for Excel,
something that was then also incorporated into Power BI. Helping to create
and define custom calculations and formulas goes beyond the capabilities of
traditional Excel functions.
Interestingly, it originated from the need to bridge the gap between
relational database systems and traditional spreadsheet tools to help lower
the barrier for professionals by providing a formula language that was more
user-friendly for business analysts who may not be SQL experts, hence why
DAX has been designed to work with tabular data models. Microsoft
recognized the limitations of Excel at handling large sets of data and
complex calculations, and this then led them to develop DAX, which could
handle this in similar spreadsheet-like expressions. For example, the
commonly used CALCULATE function in DAX is a super-charged version
of the SUM-IF Excel function.
Leading on from this, DAX can often have strong implications on the
performance of your reports depending on how well you’ve used it,
especially if the model contains very large amounts of data. Later in this
book, you will learn techniques on how to optimize this on real-world
examples for optimal performance.
Functions: Built-in to help you carry out calculations and are often
categorized into different types such as mathematical, statistical, text,
and TI functions.
Operators: Similar to functions, there are a number of operators used
within common expressions to combine values, such as arithmetic
operators (+, -, *, /), comparison operations (=, <, >, <>,
>=, >=), logical operators (AND, OR, NOT), and more.
Column, measure, and table references: Your expressions will often
call on/refer to data within your data model. You can identify these as
columns. Measures are referenced using square brackets ([]) and tables
are referenced using single quotes (' ') with their table name.
TIP
We would use the following expression to get the desired results with total
sales by category:
We will dive deeper into what these functions do later in the book as both
CALCULATE and ALLEXCEPT are key functions to learn. However, you
can see the preceding code includes functions, operators, and table
references.
Where do we begin with data?
As you progress through this book, you will learn how to use these
technologies together to clean and prepare your data for performant data
visualization. However, before diving into some examples and learning how
to actually carry out these transformations, it’s important you pick up a few
best practices on what you should consider before getting started.
Key elements to consider here are what is meant when we say data quality,
why it is important (outside of the obvious reasons), who’s responsible for
it, and how to plan for this data preparation.
Summary
Questions
Data is all around us, and so subsequently, data quality is also all around us.
Now, if you work in the data space, then you have definitely encountered
data quality.
In the world of data analysis and business intelligence (BI), data is the
foundation upon which insights and decisions are made. However, the
quality of the data we work with can greatly impact the accuracy and
reliability of our analyses.
In this chapter, we will explore factors that affect data quality and delve into
why data cleaning is a crucial step in the data preparation process. You will
learn key concepts to ensure the data you work with is clean and accurate
for the analysis you’re looking to carry out. In addition to this, you will also
learn best practices that you can implement within your own business.
After completing this chapter, you will understand with confidence the
factors that contribute to data quality issues. This is critical for those just
starting on their data journey but equally important to review if you’ve been
working with data for many years. These factors will also be referenced in
later chapters, such as Chapter 8, Using Data Profiling for Exploratory
Data Analysis (EDA).
Firstly, before diving into how you can leverage Power BI to clean your
data, it’s important to understand some key basics of what will affect your
data quality.
In general, several factors describe and make up the quality of a dataset for
analysis, which we will dive into further in the following list:
Data accuracy: Data accuracy means the extent to which data represents
the true values and attributes it is intended to capture, indicating the
degree to which it aligns with the true, real-world information it seeks to
represent. Factors such as human errors during data entry, system
glitches, or outdated information can compromise data accuracy.
Data completeness: This describes the degree to which all required data
elements are present in a dataset. Missing or incomplete data can occur
due to data collection errors, system limitations, or data integration
challenges.
Data consistency: The uniformity and coherence of data across different
sources or datasets. Inconsistencies may arise from variations in data
formats, naming conventions, or conflicting data definitions.
Data validity: Refers to the extent to which data conforms to defined
rules, constraints, or standards. Invalid data can result from data entry
errors, data integration issues, or changes in business rules.
Data timeliness: The relevance and currency of data in relation to the
analysis or reporting timeframe. Outdated or stale data can lead to
inaccurate insights and hinder decision-making.
Data entry errors: Mistakes made during manual data entry processes
can introduce errors such as typos, misspellings, or incorrect values.
Human error, lack of training, or inadequate validation mechanisms can
contribute to data entry issues.
Incomplete or missing data: Data may be incomplete or have missing
values due to various reasons, such as data collection processes that fail
to capture all required information, data entry omissions, or system
limitations that prevent data collection.
Data integration challenges: When combining data from multiple
sources or systems, inconsistencies can arise due to differences in data
formats, naming conventions, or data structures. Mismatched or
incompatible data elements can lead to data quality issues.
Data transformation and manipulation: Data transformations, such as
aggregations, calculations, or data conversions, can introduce errors if
not implemented correctly. Issues can arise from improper formulas,
incorrect assumptions, or errors in the data manipulation process.
Data storage and transfer: Unreliable storage systems may lead to data
corruption, loss, or unauthorized access, impacting data quality. Events
such as hardware failures or system crashes can result in data loss,
affecting completeness and accuracy. Delays in data transfer may lead to
latency issues, with outdated or stale data impacting the accuracy of
analyses. Lastly, incompatibility between systems during data transfer
can cause format mismatches and structural issues, introducing
inaccuracies.
Data governance and documentation: Inadequate data governance
practices, including a lack of data standards, data definitions, or
metadata documentation, can lead to misunderstandings or
misinterpretations of data, resulting in data quality problems. Poor
documentation of data lineage makes it challenging to trace the origin of
quality issues and prevent future problems.
Data changes and updates: As data evolves over time, changes in
business rules, system updates, or modifications to data sources can
impact data quality. Data may become outdated, inconsistent, or no
longer aligned with the intended use.
External data sources: When incorporating data from external sources,
such as third-party providers or open datasets, data quality issues may
arise. Inaccurate or unreliable data from external sources can affect the
overall data quality.
Data cleaning plays a vital role in maintaining data integrity and accuracy.
Inaccurate or inconsistent data can lead to flawed analysis, flawed decision-
making, and potential business risks. By recognizing data cleaning as a
shared responsibility, all individuals working with data can contribute to
maintaining the integrity of the data they generate, use, or interact with.
Data cleaning is not just about rectifying existing issues; it also involves
proactively identifying and addressing data quality issues. Employees on
the front lines of data collection, entry, and analysis are often the first to
notice anomalies or inconsistencies. By encouraging a culture where data
issues are shared and addressed promptly, organizations can mitigate the
impact of poor data quality before it cascades into larger problems.
Of course, this book will delve deep into how you can actually clean your
data with Power BI, but it wouldn’t be responsible for us to not provide
some insight into implementing best practices to prevent dirty data.
Define clear data quality standards that align with your organization’s goals
and objectives. These standards should include criteria for accuracy,
completeness, consistency, validity, and timeliness, as discussed next:
Summary
In this chapter, we explored factors that affect data quality and why data
cleaning is crucial in the data preparation process. We discussed the
importance of understanding data quality standards and the impact of data
accuracy, completeness, consistency, validity, and timeliness on analyses
and decision-making. We also identified common sources of data quality
issues, such as data entry errors, incomplete or missing data, data
integration challenges, data transformation and manipulation, data storage
and transfer issues, data governance and documentation gaps, data changes
and updates, and external data sources.
In the next chapter, we will cover fundamental concepts and key principles
that form the backbone of effective data cleaning practices. You will learn
essential knowledge and processes to confidently tackle the challenges of
dirty data and transform it into reliable, accurate, and actionable
information.
Questions
In this chapter, we will delve into the fundamental concepts and key
principles that form the backbone of effective data cleaning practices, with
the aim of sharing essential knowledge and processes to confidently tackle
the challenges of dirty data and transform it into reliable, accurate, and
actionable information.
As the previous chapter introduced, poor data quality can lead to people like
yourself needing to clean data ready for it to be analyzed. Data cleaning is
an indispensable step in the data preparation process, ensuring that the data
we work with is trustworthy, consistent, and fit for analysis. It involves
identifying and rectifying errors, inconsistencies, duplicates, missing
values, and other data anomalies that can hinder the reliability and validity
of our analyses. By implementing sound data cleaning practices, you can
enhance data quality, improve decision-making, and unlock the full
potential of your data.
Throughout this chapter, we will uncover the core principles that guide
successful data cleaning efforts. We will explore the importance of
understanding data context, establishing data quality criteria, implementing
robust data validation techniques, and adopting a systematic and iterative
approach to cleaning data. By internalizing these principles, you will
develop a strong foundation for effectively cleaning and preparing data for
analysis.
By the end of this chapter, you will have developed a better understanding
of the key roles, responsibilities, and processes involved in creating a strong
process for cleaning data within your business for analysis.
In the following section, you will begin to learn about the different roles
and responsibilities linked to the process of cleaning data within your
business.
Analytics (BI) managers: They are responsible for the overall strategy
and implementation of Business Intelligence (BI) initiatives. They play
a crucial role in aligning data cleaning efforts with broader business
goals and objectives.
They also allocate resources, including personnel and technology, to
ensure effective data cleaning processes. They are responsible for
prioritizing data quality initiatives based on organizational needs and
allocating budgets accordingly. They ensure that data stewardship,
analytics, IT, and business units work cohesively to maintain high data
quality standards.
Lastly, BI managers oversee the training and development of staff
involved in data-related tasks. They ensure that teams have the necessary
skills to clean, analyze, and interpret data accurately.
C-suite executives: Executives, including the Chief Executive Officer
(CEO) and Chief Information Officer (CIO), set the organizational
tone for prioritizing data quality. Their commitment to data integrity
influences the entire company culture.
C-suite executives align data quality initiatives with broader business
strategies. They ensure that clean data supports decision-making
processes at the highest level, contributing to the organization’s success.
This, in turn, means they typically oversee investments in technology
infrastructure that supports data quality. This includes approving budgets
for advanced data cleaning tools, ensuring the availability of skilled
personnel, and staying abreast of technological advancements.
Let’s go through a breakdown of how the different C-suite leadership
positions play important roles in ensuring data cleanliness within a
business:
Chief Data Officer (CDO): The CDO oversees data governance
policies and frameworks, ensuring that they align with industry
standards and regulations. They play a key role in establishing and
enforcing data quality standards across the organization.
In addition to governance, by collaborating with other executives and
departments, the CDO ensures that data-related initiatives are integrated
into broader organizational strategies. They work to create a unified
approach to data management.
Chief Information Security Officer (CISO): The CISO is
responsible for ensuring the security of data assets. In the context of
data cleaning, they play a crucial role in implementing measures to
protect sensitive information and maintaining data privacy during the
cleaning process.
Chief Analytics Officer (CAO) or chief data scientist: For
organizations heavily invested in advanced analytics, the CAO or
chief data scientist ensures that data cleaning processes support the
requirements of complex analytical models. They may guide the
development of sophisticated algorithms for more accurate data
cleaning.
Chief Technology Officer (CTO): The CTO oversees the
technological infrastructure supporting data cleaning processes. They
ensure that the organization has the necessary hardware, software, and
cloud solutions to facilitate efficient and effective data cleaning.
Chief Risk Officer (CRO): The CRO assesses risks associated with
data quality issues, including potential financial, legal, or reputational
risks. They work with other leaders to develop risk mitigation
strategies related to data cleaning.
Chief Compliance Officer (CCO): The CCO ensures that data
cleaning practices adhere to industry regulations and compliance
standards. They play a critical role in mitigating legal risks associated
with data quality issues.
Chief Financial Officer (CFO): The CFO oversees budgeting and
resource allocation for data-related initiatives, including data
cleaning. They ensure that necessary financial resources are allocated
to support data quality improvement efforts.
In the next section of this chapter, you will learn how to go about building a
process for cleaning data within your business.
The process of cleaning data involves several key steps that help to form a
systematic approach to ensure comprehensive data cleaning.
While the specific steps may vary depending on the nature of the data and
the organization’s requirements, the following general process provides a
framework for effective data cleaning.
The effective steps to cleaning data follow this flow:
1. Data assessment
2. Data profiling
3. Data validation
4. Data cleaning strategies
5. Data transformation
6. Data quality assurance
7. Documentation
Data assessment
First of all, it’s imperative to assess the quality of data before we get started
with cleaning the data. This may sound obvious; however, tracking this
information will help you later down the line to ensure you have not missed
any data transformations.
Data profiling
This step involves examining the data for patterns, distributions, outliers,
and just missing data. Profiling helps identify potential quality issues and
guides the cleaning effort that you will need to do later in this process.
Depending on the size of your teams, it can also be beneficial to include any
machine learning engineers within this step to assess how you might be able
to use machine learning techniques to recognize patterns that affect data
quality.
Data validation
Like the previous examples, this responsibility will tend to lie with the data
analysts as they implement data cleaning strategies and business analysts as
they collaborate to ensure that the data aligns with the business objectives.
If you have a data steward, then they might also define the validation steps,
scenarios, and testing processes to ensure fair and accurate validation.
Data transformations
Quality assurance ensures that the cleaned data meets the defined quality
standards and is fit for its intended purpose. Generally, this phase would
involve you carrying out data integrity checks, comparing the cleaned data
with its original source data, and verifying its accuracy. This becomes very
important when you start building out more complex data models in Power
BI, which may improve performance but could impact the data integrity.
Documentation
Gone are the days when people could just build or develop without
documenting what changes were being made. This causes companies huge
pains later down the line as others in the business won’t know what changes
have been made to the data and thus are required to start fresh.
Documenting the data cleaning journey is a crucial aspect for this reason.
This involves recording the steps taken, decisions made, and any
assumptions or transformations applied to the data during the cleaning. This
can often be seen as heavy admin on top of the already lengthy data
cleaning process, but this proves invaluable when faced with future
analyses of similar data, audits on your data, and lastly, data governance
tracking.
Ensuring you have proper documentation is just the first hurdle. In the next
section, you will learn more about understanding why it’s important to have
good data, not lots of data.
Understanding quality over quantity in
data cleaning
When it comes to data cleaning, quality should always take precedence over
quantity. While large datasets may initially seem enticing, the real value
resides in the precision, dependability, and uniformity of the data. Imagine
having a vast pool of data that is riddled with errors, duplications, and
inconsistencies – the potential insights gleaned from such a dataset would
be marred by inaccuracies and inefficiencies.
Summary
In this chapter, we delved into the fundamentals of data cleaning and
explored key principles to consider when cleaning data. Data cleaning is a
crucial step in the data preparation process, as the quality of the data greatly
impacts the accuracy and reliability of analyses and decision-making.
You learned about seven key principles when it comes to planning and
preparing to clean your data, which not only provide best practices but also
document the impact you’ve had on that data or the business.
In the next chapter, we will dive into the practical aspect of data cleaning
using Power BI. You will be following along as we go through the most
common data cleaning steps within Power BI, providing hands-on
experience to clean and transform data for improved quality and usability.
Questions
Now that you’ve built a strong knowledge of data quality and the
importance of assessing and documenting your data cleaning process, it’s
time to roll your sleeves up and get stuck into some data.
In this chapter, you will be learning how to deal with the most common data
cleaning steps within Power BI, as listed next. For each of these example
topics, you will find a step-by-step walk-through on how to carry out these
transformations yourself.
Removing duplicates
Removing missing data
Splitting columns
Merging columns
Replacing outliers
Creating calculated columns versus measures
By the end of the chapter, you will have built a strong base of foundational
knowledge on how to tackle some of the most common and simplest
transformations that we often see needing to be done when connecting to
data in Power BI.
Technical requirements
You will find the uncleaned dataset to be used during this chapter at the
following link: https://github.com/PacktPublishing/Data-Cleaning-with-
Power-BI.
Removing duplicates
In many cases, as we start working with data, there will often be duplicates
within the data. As we discussed in Chapter 2, Understanding Data Quality
and Why Data Cleaning is Important, there are a number of reasons why
the values in your data may have been duplicated. For example, say we're a
retailer and we accidentally entered two product items for the same product.
We don’t want to have inaccurate numbers for that product by leaving the
duplicate data in, so it’s key that we remove it before we get started with
our analysis.
So, let’s get started. In the following example, we will find, select, and
remove the duplicate in the data:
Now that you have loaded the data into Power BI, we can begin to look at
our data.
To do this, select Table view on the navigation bar on the far left side of
Power BI Desktop, as shown in the following figure. By default, Power BI
likely left you in Report view after you loaded your data as this is the
default view when you open Power BI Desktop.
As we can see in Figure 4.3, the data we have loaded might be the start of
our data model because it resembles a table that we’d use as a dimension
table. It’s important to note, though, that you might not always start with a
dimension table, as you may choose to begin with a fact table.
In Power BI, dimension tables contain descriptive attributes such as
customer names or product categories, providing context to the numerical
data in fact tables, which store quantitative measures such as sales revenue
or quantities sold. This separation of descriptive and numerical data
enhances efficiency, scalability, and flexibility in data analysis. Together,
they form a star or snowflake schema, optimizing storage, retrieval, and
analytical capabilities in BI applications. We will go into more detail on this
later in Chapter 12, Data Modeling and Managing Relationships.
Figure 4.3 – The products table imported into Power BI
As you can see on rows 16 and 36 in the data in the previous screenshot,
there is a duplicate for the Jenga product. We must look to remove this
duplicate as, when it comes to building our model, it could end up causing
our facts to return inflated since Power BI will count the values of
[Price] and [Cost] incorrectly due to them being there twice.
There are a number of ways you could do this, but the simplest way is
through the Power Query UI. So, in order to reach this page, you will need
to carry out the following steps:
IMPORTANT NOTE
Avoid using columns that could result in an error. For example, if you were
to select Cost or Price, then this could result in the wrong data being
pulled through. This is because Power Query will focus on the values within
that specific column with this method of removing duplicates, resulting in
other products valued at a similar price to Jenga also being deleted.
IMPORTANT NOTE
In the previous example, we ended up removing the row with the duplicate,
which had the product ID 36. Be cautious of this in the future as this ID
might be used later down the line when you try to connect this dimension to
a fact table, as the fact table might reference the product ID we just
removed.
Next, we have the very common issue of missing data or, as most people
would recognize, null values. In Chapter 2, Understanding Data Quality
and Why Data Cleaning is Important, we understood the reasons why this
might happen – for example, due to the type of join between two tables,
which might cause many null values to show.
These null values can often either ruin the look of your reporting or
potentially skew the numbers being used or analyzed, so it’s often best we
look to remove these.
In the example of our products table, we can see that we have a row with
blank or 0 values shown in Figure 4.5. If you were viewing this from
within the Power Query Editor, then the blank values would be showing as
null. While this would otherwise be acceptable as we won’t necessarily
see the null product within a visualization, there is a price and cost value
against the null product with the 0 value. This could affect the analysis,
particularly if we were trying to calculate the average product price, as it
would bring the average product price down.
Figure 4.5 – The products table within Power BI highlighting a specific
column called Product_Price
There are a couple of ways in which you can remove blank rows, using the
prebuilt functions shown in Figure 4.6. For most scenarios, if you have an
entire row with null values, then you can resolve this by selecting
Remove Blank Rows.
Figure 4.6 – The drop-down menu when selecting Remove Rows from
within Power Query
In this particular example, though, removing blank rows will not be suitable
as there is a value for Product_ID labeled 37. So, in this example, the
simplest solution would be to use Remove Bottom Rows to remove the
data we do not need:
Figure 4.7 – The UI for the Remove Bottom Rows function within Power
Query
IMPORTANT TIP
While it is the simplest solution for this example, it might not be the best
solution in other datasets. This is because there could be multiple blank
rows and they might not necessarily all be located at the bottom of the
table.
If you are following along, here are the steps to remove the error from the
products table:
Splitting columns
When working with data, particularly data extracted directly from source
systems, there often comes a time when we will need to split columns to
gain the desired dimensions for our analysis. This might be because the
software or database of that source system might store that data in a
particular format/encoding/arrangement. The most common example of this
could be splitting a Date field in order to extract dimensions for [Day],
[Month], and [Year].
In this example, we will connect and open the calendar.xlsx file. This
Excel table includes one column of dates (as shown in Figure 4.8) and is to
be used as a date table within Power BI. In this example, we might need to
extract the individual date components for our analysis. Once connected,
select Transform data to enter Power Query once again.
Figure 4.8 – The Date column within the date table in Power BI
In order to split the columns, we will use the prebuilt function for splitting
columns shown in Figure 4.9. Given the format of the data, it’s best to split
the columns using By Delimiter. This is because using the delimiter is the
simplest way to split this column.
There are many other options to split columns, such as using the number of
characters. This would be great for an example where we need to extract the
first letters of a product code, for example.
Figure 4.9 – The drop-down menu above the split column function (left)
and a close-up of the UI for the Split Column by Delimiter function (right)
Figure 4.10 – The date table with the columns now split into three columns
It’s recommended to then rename the columns to the appropriate fields, as
shown. This can be done by right-clicking on the column headers and
selecting Rename.
Figure 4.11 – The date table with the three columns renamed Month, Day,
and Year
Here, we have learned how to split columns using the functions within
Power Query. Next in this chapter, we will look at how to do the opposite
and merge columns.
Merging columns
Just like the previous scenario, there are often situations where you need to
merge columns to achieve the desired format. For example, depending how
date data is stored, you may need to merge the [Day], [Month], and
[Year] columns to achieve a singular Date column. In the following
figure, we can see an example description and configuration of how you
might do this using the prebuilt function of Merge Columns in Power
Query:
Figure 4.12 – The drop-down menu above the Merge Columns function
(left) and a close-up of the UI for the Merge Columns function (right)
Replacing values
When connecting to and analyzing data, there are often times when we
might find outliers within the data. If we identify that there are values
skewing the data or showing incorrectly, it’s important for us to be able to
replace the data with the correct values.
There are many scenarios where you might need to do this in Power BI.
Here are just some example scenarios:
Going back to our product dataset, we can use the Replace Values function
to fix an error in the product names/categories. In the data, we have a
product name that has been extracted with a misspelling. Row 22 should
read Monopoly, so we need to replace the value with the correct value to
ensure our analysis and reporting show the correct data:
Figure 4.13 – The row mentioned in the example should read “Monopoly”
but currently reads as “Mono poly”
4. Regardless of the step you choose, you will have the option to then select
the new value you’d like this to be replaced with:
Figure 4.15 – The UI for the Replace Values function within Power Query
IMPORTANT NOTE
Having learned how to replace values within our data using the functions
within Power Query, let’s look at how to create calculated columns versus
creating measures within Power BI.
When preparing data for analysis in Power BI, we often need to add
additional data to the model (often derived from existing data). This can
often come in the form of using DAX to either create a measure or an
additional column. This could be for a new value we need or a new
dimension. Before understanding which option is best, it’s important to
understand what the difference is between a calculated column and a
measure within Power BI.
The best way to add a new measure or column is to first navigate to Table
view on the far-left toolbar.
The view will automatically jump to the Table tools tab along the top
toolbar, which will then present you with the following options for
calculations: New measure, Quick measure, New column, and New table,
as shown:
Figure 4.16 – The Calculations section within the Table view of Power BI
It’s important to note that if you are working from Model view within
Power BI Desktop, this will look slightly different with the latest
introduction of Calculation group, as you can see in the following figure –
this is described later in the chapter:
Figure 4.17 – The Calculations section within the Model view of Power BI
Calculated columns
The calculated column becomes a part of your data model and can be used
in visuals and calculations.
In the example of our products table, we could add a column to the data
model, which would look like the following:
Figure 4.18 – The products table within Power BI with a specific column
selected named Margin
Measures
Similar to the calculated columns example, if you select New measure, you
can add a similar DAX expression to calculate the margin, but this will not
add a column with the margin data to the data model. The following
screenshot shows an example of this being added as a measure:
You can then see the difference between the two values on the right-hand
side:
Figure 4.20 – The new measure has been added to the products table within
the Power BI service
Measures can be identified with the calculator logo and calculated columns
can be identified with the table logo to the left of the value.
Calculation group
Prior to Microsoft’s latest release, where they added the ability to create
calculation groups within the product, Power BI users would typically need
to use third-party tools such as Tabular Editor. There is much to be said,
though, for why many users still look to leverage the immense power of
tools such as Tabular Editor. Here are just a few examples:
It’s worth noting that the decision to use third-party tools alongside Power
BI depends on individual preferences, specific project requirements, and the
level of expertise of the users involved. Some people may find that the
native Power BI tools, including calculation groups, meet all their needs,
while others may prefer the additional capabilities provided by third-party
tools.
Considerations
There are some considerations you should make when deciding which to
use:
Calculated columns are best suited for calculations that require row-level
context or depend on the relationships between tables.
Measures are ideal for performing aggregations and calculations on your
data.
Calculated columns can consume more memory and storage as the
calculated values are stored for each row in the table.
Simple calculated columns such as the previous Margin example can
also be added with Power Query instead of DAX. It’s often seen as best
practice to add such columns with Power Query; however, of course, the
best solution would be if these columns were created upstream in the
source systems or database.
Measures are memory-efficient as they are not precalculated and don’t
consume additional storage.
Avoid creating too many calculated columns as they can increase the
complexity and size of your data model.
Measures are calculated dynamically based on the filters and slicers that
impact visual contexts. They respond to user interactions and provide
accurate results.
Visual context, specifically, relates to the filters and conditions applied to
a visual or a set of visuals on a report canvas. Contexts influence how
data is presented and can be summarized as follows:
Row context: Occurs at the individual row level when iterating
through the data table. Established within formulas and calculations
row by row.
Filter context: Set of filters applied to a particular visual element on
the report canvas. This comes from slicers, cross-filters, highlighting,
and so on.
Page context: Refers to filters applied at the entire page level
affecting all visuals on a page.
Drill-through context: Established when users drill through from
one page to another, passing the context from the source page to the
target page.
Use calculated columns when the calculated value needs to be used
across multiple visuals or calculations consistently.
Use measures when you need to calculate values based on the
summarized data displayed in a visual.
These considerations will help you make the correct decisions when it
comes to cleaning your data with measures or adding new columns to the
data.
Summary
In this chapter, you began your journey into the practical aspects of data
cleaning within Power BI. You covered some of the most common data
cleaning steps in Power BI, including removing duplicates, handling
missing data, splitting columns, merging tables, dealing with date formats,
replacing values, and creating calculated columns versus measures.
Lastly, the chapter explored the difference between calculated columns and
measures in Power BI and explained when to use each option and their
respective benefits. Calculated columns are best suited for row-level
calculations, while measures are ideal for aggregations and calculations
based on visual context. The chapter concluded by emphasizing the
considerations when deciding between calculated columns and measures
and providing guidelines to help you make informed choices based on
factors such as memory usage, storage, and the complexity of your data
model.
Now that you have a strong understanding of data quality and the
importance of documenting your data cleaning process, it’s time to roll up
your sleeves and start working with real data. You have gained practical
knowledge and hands-on experience in performing common data cleaning
tasks within Power BI. This sets the foundation for the next chapter, which
focuses on importing data into Power BI and further refining your data for
analysis.
Questions
This part will delve into the intricacies of data import and the potent Query
Editor. You will learn about the key aspects that need to be considered when
importing data and build confidence in working with the Query Editor. You
will also explore the M language, unraveling its syntax and capabilities for
transforming data, before being introduced to data profiling for exploratory
data analysis, thus enhancing data accuracy within Power BI.
Now that we understand the importance of cleaning data and even how to
tackle some of the most common data-cleaning operations, it’s time to start
bringing some data into Power BI.
When it comes to importing data into Power BI, there are six main topics,
as listed next, to consider. Most importantly, for the top three, there are
metrics that help us identify how clean data is prior to us working on it. In
this chapter, you will learn about these topics and how you can use the
Power BI platform to assess these metrics within your own data.
By the end of this chapter, you will be able to confidently assess how clean
the data you’ve imported is and also how to use Power BI to assess it.
Technical requirements
You will find the uncleaned dataset to be used during this chapter at the
following link: https://github.com/PacktPublishing/Data-Cleaning-with-
Power-BI.
When importing data into Power BI, one of the primary concerns is
ensuring data completeness.
Let’s jump into our dataset to understand how complete the data is. For this
analysis, we will connect to the CSV file labeled
AdventureWorks_Products (unclean).csv with the help of the
following steps:
IMPORTANT NOTE
The previous view shows column quality and column distribution metrics.
To activate this, you will need to navigate to the View tab within the Power
Query Editor ribbon. In the Data Preview group, you can then select the
Column Quality and Column Distribution checkboxes to see them.
5. As we assess the data preview in the Power Query Editor, we can then
see there is a gap in the data for the ProductSize column, as shown in
Figure 5.1.
We can see in this figure that 60% of the data for ProductSize is empty,
and as such, the completeness of the data is impacted.
NOTE
If you are not seeing a percentage, then it’s likely because Power Query has
added an additional step called Changed Type to your query. Power Query
may automatically attempt to set the data type of columns when importing
from CSV. In this example, it has likely tried to change the type of the
ProductSize column from Text to Whole Number. Simply remove the
Changed Type step by selecting the X symbol next to the step in the query
settings pane, as illustrated in the following screenshot:
Figure 5.2 – The applied steps within the Power Query Editor
6. Select Close & Apply to save the work done in Power Query.
Understanding the context in which data accuracy matters is crucial for data
analysts and business professionals alike.
In this section, we will delve into two methods to validate data accuracy
and detect errors or inconsistencies. By addressing data accuracy during the
data import process, you will gain confidence in the integrity of your
datasets, enabling you to make well-informed decisions based on
trustworthy insights from Power BI.
Power BI provides several tools to assess data accuracy. One such tool is
conditional formatting, where we can define rules to highlight data points
that fall outside predefined accuracy ranges or thresholds. For example, we
can set rules to flag unusually high or low values in our dataset.
In this example, we will assess the relevance of the data we need using the
same dataset we connected to earlier in this chapter. A simple example of
this would be to understand which columns are necessary and which are
not. This dataset is pretty small, but as you start to work with larger
datasets, performing this check will help keep performance strong and make
it a positive experience working with that data/report.
1. Select the irrelevant column you would like to remove. You can select
multiple columns one by one by holding down Ctrl, or alternatively
select a range of columns by holding down Shift when selecting
columns.
2. Then select Remove columns from the toolbar.
3. Lastly, select Remove Columns from the menu shown:
This will then remove the column and thus reduce the size of your data
model.
Power BI’s formatting options allow you to control how data is displayed in
visuals such as charts, tables, and maps. Users can customize color
schemes, font styles, and axis labels to create visually engaging and
informative visuals. Conditional formatting features enable dynamic
changes based on data values, highlighting important data points.
Leveraging these capabilities transforms raw data into compelling
visualizations, effectively communicating key insights to stakeholders.
This can most commonly happen when connecting to and working with
revenue data. Particularly if this data needs to be formatted with a particular
currency and so on.
In the following example, let’s prepare the data to format the currency
based on the relevant fields:
1. Let’s prepare some data in the Power Query Editor before we begin.
Select Transform Data from within the Report, Table, or Model views
to then open the Power Query Editor.
2. We can see the data contains two columns labeled [ProductPrice]
and [ProductCost], with {Text} data types, which in some cases
have 3-4 decimal places. In order to ensure this data shows correctly on
our visuals, we need to format these columns as currency.
3. Change the type of the [ProductPrice] and [ProductCost]
columns to {Fixed Decimal Number}. This is a preset type used
mainly for currency values. Do this by selecting the type icon next to the
column header.
4. Now we would ideally like to add a margin to this data. So, let’s do this
by adding a column using the Subtract function. Select the Add
Columns tab in the Power Query Editor.
5. To ensure we are subtracting the data correctly and to prevent us from
seeing a negative number, we need to reorder the columns in the data.
Otherwise, the Subtract function would subtract [ProductPrice]
from [ProductCost]. To do this, select [ProductCost] and drag
this into the position left of the [ProductCost] column.
6. Select the two columns [ProductPrice] and [ProductCost],
then select Subtract from the drop-down menu under Standard in the
From Number ribbon group, as shown. This will add a new column
called Subtraction, which represents our margin value.
Figure 5.4 – The Subtract function within the Power Query Editor
8. Select the Home tab in the ribbon and then select Close & Apply to save
the transformations made in the Power Query Editor. This should take
you back to the Table view in Power BI. If it doesn’t, then simply
navigate to the Table view.
9. Here we can see that although the newly created ProductMargin
column was set to the {Fixed Decimal Number} type, there are
values with multiple decimal places showing.
10. Select the ProductMargin column. This will then open a new tab
labeled Column tools on the toolbar, where we can see the formatting
options to transform that data.
11. Select how many decimal numbers you would like to format this value
to, which in this example will be 2. Notice Auto is still showing
multiple decimal places, which isn’t normal for a currency value. The
following figure shows an example of this configuration:
Figure 5.5 – The Table view for the product table loaded into Power BI
12. Now let’s create a table in the report that includes conditional formatting.
Select Report View in the navigation ribbon.
13. Let’s create a table visual. Select Table from Visual types, as shown in
Figure 5.5. This will create a table template on your report page.
14. With that table selected, add the following data to the table by either
selecting the checkboxes next to these columns or dragging these into
the box under Visual types that says +Add data, as highlighted in the
following screenshot:
1. ProductColor
2. ProductMargin
Figure 5.6 – Visual types and Columns selected in Power BI
15. By default, Power BI will have aggregated our numeric field to SUM,
which for this example is fine. Now, to add conditional formatting to this
visual, we can right-click on the field labeled [Sum of ProductMargin].
16. Then, select Conditional formatting, and finally, let’s select
Background color, as shown. This will open the configuration window
for the background color.
We can now see that the conditional formatting has been applied and is
highlighting negative margins with products that are available in cream and
gray colors. In this example, preparing data for formatting currency is
essential, particularly when working with revenue data. By following the
steps outlined, you can ensure accurate representation and enhance the
clarity of financial information.
Mastering these formatting techniques in Power BI not only refines the
visual appeal of your reports but also elevates the precision and impact of
the insights conveyed to your audience.
7. By selecting Edit, we can then adjust the columns selected for this
particular relationship:
Figure 5.10 – The Edit relationship window within Power BI
In this example, Power BI has selected the correct columns for this
relationship, so we won’t need to make any adjustments here. It does this by
using a feature called relationship autodetect. This is where Power BI
attempts to create a relationship between two tables you’re loading at the
same time. Defining the cardinality and cross-filter direction makes the
relationship active by using the column names in the tables you have
selected to identify potential relationships. If it doesn’t identify any
similarities with high confidence, then it doesn’t create a relationship.
These dimension and fact tables are then connected with relationships
usually in the form of a star schema. The term star is used because you
usually start with a fact table in the middle and create relationships to
dimension tables around the fact table, like points of a star, as highlighted in
the following example:
Figure 5.11 – Screenshot of the edit relationships window within Power BI
For those not familiar with star schemas, they are specialized data models
that are typically used when designing data warehouses. This is where you
might be introduced to concepts such as the Kimball method, which are
often referred to in this space. Ralph Kimball is known for his book on the
data warehouse toolkit. If you’re interested in learning more about this, then
I’d definitely recommend looking into the Kimball method.
We have briefly covered normalization and the use of star schemas when
importing your data into Power BI; however, it’s important to mention that
Power BI can and will also work with denormalized data.
To clarify, denormalization within dimension tables involves incorporating
aggregated values or pre-calculated metrics to improve query performance.
Here’s an example to help describe how this might be used.
Summary
You learned about using Power BI’s data profiling capabilities to assess data
completeness. By visualizing missing values and identifying columns that
demand attention, you gained the skills to ensure a thorough and complete
dataset.
Delving into data consistency, Power BI’s strong data modeling capabilities
emerged as your guiding force. Building table relationships and following
best practices gave you an overview of how to achieve accurate and
consistent analyses.
Recognizing the importance of data relevance, you used Power BI’s data
transformation capabilities to filter and transform data, allowing you to
focus on the correct data to begin your analysis.
Now that we have brought this data into Power BI, we’re ready to get
started with some transformations. In the next chapter, we will explore the
Query Editor in further detail to learn how to transform this data.
Questions
Power BI’s Query Editor is a powerful tool that allows you to connect,
import, transform, and clean your data before visualizing it in Power BI
reports and dashboards. With Query Editor, you can perform various data
cleaning operations, such as removing duplicates, filtering rows, splitting
columns, merging data, and more.
By the end of this chapter, you will know about the essential techniques and
functions for cleaning and transforming data using Query Editor.
Technical requirements
Using the following link, you will find the uncleaned dataset to be used
during this chapter: https://github.com/PacktPublishing/Data-Cleaning-
with-Power-BI.
Understanding the Query Editor interface
To access Query Editor, open Power BI Desktop and navigate to the Home
tab on the ribbon. Click on the Transform data button to launch Query
Editor.
Figure 6.1 – Locating the Transform data button in the Home tab
Once inside Query Editor, you will notice the Power Query ribbon, which
contains various groups of commands and tools to manage and transform
your data, as shown in the following screenshot:
Figure 6.2 – Power Query ribbon with its important functions shown on the
Home tab
The Query Editor navigation pane is on the left-hand side of the window. It
lists all the queries in the current report and provides options to navigate
between queries, as shown in the following screenshot:
Figure 6.3 – Screenshot highlighting two queries within the Power Query
window
In the center of the Query Editor window, you will see the Query preview
pane, which displays a preview of the data after applying the transformation
steps.
IMPORTANT NOTE
As you select different steps in the query settings, the preview will adjust to
show what the data looked like in that transformation step.
The Query Settings pane is located on the right-hand side of the Query
Editor window. It displays the applied steps of data transformations and
provides options to manage query properties and connections (shown next).
This is where we can often add further descriptions to help others
understand the transformations we are adding, and other features such as
Query Folding can be seen in this space.
When naming the applied steps within Query Settings in Power Query
(Query Editor), it’s essential to follow some best practices to ensure the
clarity, organization, and maintainability of your data transformation
process. There are some additional best practices that are not just related to
naming:
As you have familiarized yourself with the Power Query interface and best
practices for naming and organizing applied steps, you are well equipped to
delve into the next section, where we will explore advanced techniques for
efficient data transformation and delve into the nuances of query folding,
optimizing your Power BI experience.
Adding columns with calculated values enhances your dataset and enables
you to derive new insights. This will be quite a common transformation,
especially when looking to keep the original source column.
To note, this is a different way to add columns to your data from the method
that was highlighted in Chapter 4, The Most Common Data Cleaning
Operations, where we used DAX to add a column. As a reminder, although
there are two ways to add an additional column, when possible, you should
look to use this method as it’s the preferred way to add columns in Power
BI if they cannot be added directly to the source system.
Follow these steps to add columns (a screenshot of the Add Column tab is
shown):
Using existing columns: In Query Editor, you can add a new column
based on existing column values or calculations. Select the Add Column
tab on the ribbon.
This allows you to write M code using Power Query formula language to
create the desired calculated column.
In later chapters, we will cover the M code language, but here is an example
for us to create a new column that depicts whether the product has a color
or not:
=if [ProductColor] = "NA" or [ProductColor] = nul
This returns a new column that shows a value of True if the product has a
color. The nice thing about this custom column function is that you have the
ability to check available columns and also check whether there are any
syntax errors with your code.
Data type conversions are crucial for ensuring accurate calculations and
visualizations. Power BI allows you to transform data types in Query
Editor. Follow these steps to perform data type conversions:
Date/time
Date/time functions are essential for extracting specific components from
date and time columns or performing date-related calculations. Here’s how
to use date/time functions:
Rounding
Pivot/unpivot columns
Pivoting and unpivoting are essential for transforming your data between
wide and tall formats. Here’s how to use these functions:
Pivot columns:
1. Select the column you want to use as the pivot column.
2. Go to the Transform tab and click on Pivot Column.
3. Choose the values column containing the data to be spread across new
columns.
4. Optionally, you can use an aggregation function to summarize values
if duplicates are found during pivoting.
Unpivot columns:
1. Select the columns you want to unpivot (convert into rows).
2. Go to the Transform tab and click on Unpivot Columns.
Power BI will create two new columns, Attribute and Value, containing
the original column names and values, respectively.
Merge queries
Merging queries allows you to combine data from multiple sources into a
single dataset. This is particularly useful as you begin to build out dataflows
within the Power BI service, which has a similar look and feel to Query
Editor. Here’s how to perform query merging:
Purpose:
Data loading and transformation: Power Query is primarily used
for loading and transforming data from various sources before it is
imported into the Power BI data model.
Extract, Transform, Load (ETL): It’s great for tasks such as
cleaning, filtering, shaping, and merging data from different sources.
Interface:
Graphical User Interface (GUI): Power Query uses a graphical user
interface that allows users to apply a series of step-by-step
transformations to the data using a point-and-click interface.
Language:
M Language: Power Query uses the M language for expressing data
transformation logic. While you can perform many tasks using the
GUI, the underlying code is written in M.
Transformations:
Wide range of transformations: Power Query offers a wide range of
built-in transformations and functions that allow you to clean,
reshape, and transform data easily.
Flexibility:
User-friendly: It’s user-friendly and suitable for users with a non-
programming background. You can perform complex transformations
without writing code.
Now to help you decide when to use Power Query and when to use DAX,
I’ve provided a short summary workflow to make this decision simpler:
Typical workflow:
Power Query first: Generally, you use Power Query to load and
transform data from source systems. This includes cleaning data,
filtering, and shaping it to your needs. This also includes adding
columns. There’s an entire tab in Power Query editor devoted to
adding columns.
DAX later: Once the data is loaded into the data model, you use
DAX to create calculations and measures based on the transformed
data.
Integration:
Seamless integration: Power Query and DAX are seamlessly
integrated within Power BI. Power Query handles the ETL process,
and DAX is used for creating calculations and measures based on the
transformed data.
To conclude, Power Query is used for the ETL process and data
preparation, while DAX is used for creating calculations and measures
within the Power BI data model. They work together to provide a
comprehensive solution for data transformation and analysis in Power BI.
Summary
In this chapter on cleaning data with Power BI’s Query Editor, you
embarked on a journey of mastering essential techniques and functions for
transforming and refining your datasets. You began by learning more about
the Query Editor interface, understanding its crucial components such as the
Power Query ribbon, navigation pane, preview pane, and settings pane. You
explored the significance of maintaining clear and descriptive names for
applied steps, ensuring consistency and conciseness, and avoiding
ambiguous terms to enhance the organization and readability of your data
transformation process.
Moving forward, you delved into advanced features such as query folding,
where high-level transformation requests seamlessly translate into low-level
SQL statements, optimizing query execution. Armed with best practices,
you navigated through the Power Query interface, learning the art of
naming, organizing, and cleaning up applied steps.
You then covered data cleaning techniques and functions, such as adding
columns, performing data type conversions, utilizing date/time functions,
applying rounding transformations, merging queries, and combining data
from multiple sources.
You covered the strategic use of Power Query and DAX, recognizing their
distinct roles in the data preparation process. Power Query, with its user-
friendly graphical interface and M language, excels at data loading,
cleaning, and shaping tasks, while DAX is there for defining business logic
within the Power BI data model.
Questions
1. What are the crucial components of the Query Editor interface discussed
in the chapter?
1. Power Query help
2. Navigation compass
3. Power Query ribbon
4. Settings
2. What is query folding, and how does it optimize query execution in
Power BI?
1. Folding multiple tables
2. Duplicate queries in Power Query
3. Translating high-level transformations into low-level SQL statements
4. Executing queries slowly for precision
3. Which technique allows the creation of new data based on existing
columns?
1. Filtering
2. Sorting
3. Adding columns
4. Removing duplicates
4. What determines how records are matched between tables in merging
queries?
1. Query folding
2. Join types
3. Preview settings
4. Ribbon functions
5. What is Power Query used for in the data preparation process?
1. Creating calculated columns
2. Defining complex business logic
3. Loading, cleaning, and shaping data
4. Folding queries
Further reading
As we have seen throughout this book so far, Power BI and Power Query
are powerful tools designed to help transform, clean, and shape data for
effective visualization and analysis. Central to their data transformation
capabilities is a functional, case-sensitive language called M, which serves
as the backbone for performing data transformations within these tools.
By the end of this chapter, you will have built a basic foundation of
knowledge and hands-on experience using M within the advanced editor –
useful for creating complex data transformations that are not easily
achievable through the graphical user interface (GUI). As an outcome of
reading this chapter, you will feel more confident about reviewing and
creating code for use in the advanced editor.
Technical requirements
Using the following link, you will find the uncleaned dataset to be used
during this chapter: https://github.com/PacktPublishing/Data-Cleaning-
with-Power-BI.
M and DAX are two coding languages within Power BI; however, M is also
very different from DAX. It’s a functional, case-sensitive language (similar
to F# for those familiar) that employs a sequence of steps to transform data.
In Power Query’s M language, each step in the query editor represents a
transformation or operation applied to the data. While not explicitly defined
as functions, these steps function in a manner similar to functional
programming. The steps operate sequentially, with the output of one step
serving as the input for the next. This sequential arrangement allows for the
creation of a transformation flow where each step builds upon the results of
the preceding ones, enabling a structured and organized approach to data
manipulation and shaping. Let’s take a brief overview of the basic structure
of M that you’ll find in Advanced Editor.
Structure of M
In most cases, nearly every M query begins with let to open the code and
begin reading the query steps. There are exceptions to this in the advanced
editor where you can nest calls to Power Query functions but this isn’t
common and would be seen as bad practice. This part of the code is used to
define variables and store immediate results. Variables are defined by using
the following syntax:
Identifiers
Depending on which area of Power Query you’re looking at, you might
refer to the following as a step or a variable, more commonly known as
identifiers. Identifiers are names given to variables, functions, and other
entities.
These are often recognizable in the Advanced Editor view of Power Query
Editor because they are placed in double quotes and have a hashtag as a
prefix.
For example, look at the following screenshot from the Power Query Editor
UI:
Figure 7.1 – Screenshot of the applied steps in Power Query Editor
This would be seen in the M query from the advanced editor as follows:
#"Promoted Headers"
Data source
let
Source = Excel.Workbook(File.Contents("C:\Users\G
In this example, we have opened the code with let and have then
introduced our first step. Source is the name we have given to this step or
variable, which, as the name suggests, is there to represent where the data
has come from. Typically, this is already created for you when you connect
to your data via the GUI. However, it’s useful to understand what’s
happening here:
Excel.Workbook() tells Power Query to connect to an Excel
workbook and its contents
File.Contents("C:\Users\GusFrazer\Downloads\prod
ucts.xlsx") then tells Power Query where to locate the workbook
and the Contents function reads the binary contents of the file
null is there to specify that no specific sheet is being loaded from the
workbook, meaning that the data from all the sheets will be loaded into
the query
The true parameter specifies that the first row of the data should then
be used as column headers
Products_Sheet = Source{[Item="Products",Kind="Sh
Query steps
Now that you have connected to the source data, you will want to begin to
define a series of query steps to transform the data you’ve connected to.
Common steps would include filtering, sorting, merging, grouping, and
pivoting.
In the previous example, we would typically see the following code when
we open Advanced Editor:
For example, if you have a variable with spaces or special characters in its
name, like the previous example "Promoted Headers", you can use
the # symbol at the beginning of its name within double quotes
(#"Promoted Headers") to reference it in subsequent steps. This
enables a structured approach to data manipulation, and variables can either
be a single word like "Source" or include spaces and special characters,
with the # symbol aiding in their identification and utilization within the
query.
We will cover the other functions seen in the previous code, such as
PromoteHeaders, in the next section.
in expressions
An in expression is used to specify the final result of the query. It helps in
separating the variable definitions from the main computation. The in
expression comes after all the let expressions and indicates what the final
output of the script should be.
In our example, if we were happy with the data and just wanted to return
this as is, we would use the following expression to end the code. This
signifies that we want to return the output from the last step in the let
series:
in
#"Promoted Headers"
Literals
In Power Query, literals refer to constant values that are directly written
within the M query and represent specific data types. These values are not
variables or references to other data sources but are fixed, unchanging
values embedded directly into the M language code.
Here are just a few of the common types of literals in Power Query:
Table 7.1 – Table referencing the literal types available in Power Query
along with some examples
Literals are useful when you need to include specific constant values
directly in your Power Query code. They provide a way to define fixed data
points or structures without the need for external data sources or dynamic
calculations.
As you can imagine from the name of this chapter and the book, the biggest
use case for M is to help with cleaning and transforming data. In addition to
this, though, it can be used to help with standardizing formats for data being
loaded to ensure accuracy and integrity. Merging and appending data from
different data sources becomes seamless with M, enabling you to
consolidate data for your analysis.
In the next section of this chapter, we will look at some of the most
common functions used within M. However, another great use for this
language is when you need to do some custom complex calculations as it
allows you to go beyond the built-in functions from the GUI.
IMPORTANT NOTE
Lastly, like other coding languages, you can add comments to different
parts of your code. Adding comments to M is a great practice to enhance
readability and understanding, especially when sharing your code with
others. Comments provide explanations and context about what each step is
doing. This is particularly important when you start building long complex
queries into your code. Comments can be initiated by starting a new line
with the // characters. When you do this within Advanced Editor, it will
typically show the comments in green to highlight that these are not coded
values.
let
//Load Excel workbook from specified file pat
Source = Excel.Workbook(File.Contents("C:\Use
Products_Sheet = Source{[Item="Products",Kind
//Promote headers from the first row of the P
#"Promoted Headers" = Table.PromoteHeaders(Pr
in
#"Promoted Headers"
//Final step where the transformed data is ou
In this section, you learned more about the foundation and structure of the
M language used within Power Query to transform data. As you progress to
the next section, you will build on this knowledge with techniques to filter
and sort your data with M.
Filtering and sorting are essential data transformation tasks that help you
extract relevant information from large datasets and organize it in a
meaningful way.
First of all, filtering data from your analysis typically involves selecting
certain rows from a dataset based on certain conditions. M has a function
named Table.SelectRows for this exact purpose. As it suggests, it
allows you to specify a condition within the argument that determines
which rows should be retained.
We could also add a sorting function to this so that we can sort the data in a
particular manner. To do this, we would typically use the Table.Sort
function, which then allows us to pass a similar argument as shown:
Table.Sort("Filtered Products",{{"Product_Cost",
If you are trying this on the products data, you’ll notice that the filter
function will return an error. This is because when you load the data into
Power Query from Excel, it doesn’t recognize the [Product Costs]
column as a number but instead as a text type due to the $ symbol in the
data. The error is because we are trying to apply a number filter on a text
column.
This leads us to the next section because in order to remove this error, we
will need to carry out some transformations on the data first.
Transforming data with M
As mentioned earlier, the first issue we can see in the data that might
prevent us from filtering is that the values for cost and price contain a $
character. This is leading Power BI to read this as a text value. So our first
port of call should be to remove this value from the column.
Now, of course, you could use the Split column function in the Power
Query UI but it’s important to understand what M code is created behind
the scenes from using such buttons. Using M will also help reduce the steps
you need to get to the desired goal. This will particularly help when you’re
looking to script more complex queries in M later in your data journey.
We have labeled this step pretty clearly to highlight what this variable is
doing; however, we could also add some commentary to this to provide
further context that this is not only removing the currency symbol but also
transforming the value into a number type.
As we need to transform more than one column, we will need to create two
separate arguments within this Table.TransformColumns function.
This is referred to as the list of transformations and can be seen enclosed in
curly brackets {}.
To decipher what this code is doing, it’s actually best to read it backward.
The Text.AfterDelimiter(_, "$") section of the code is a
function that extracts the text portion after the dollar symbol for each value
within the Product_Cost column. The Number.From() function is
then used to convert the extracted text into a numeric value. Lastly, for
assurance, we include the type specifier number to specify that the results
of the transformation should have a data type number.
Now that we have removed the currency symbol and converted the
values/columns to a number type, we have one more error that might cause
a problem, which is the bottom row, which has null values and errors.
In order to eliminate this problem, we can use the following code to remove
the last row from the dataset, which happens to be our blank row:
Now that we have removed the bottom row, we can apply the filter code we
used previously to create the following M code:
let
Source = Excel.Workbook(File.Contents("C:\Use
Products_Sheet = Source{[Item="Products",Kind
#"Promoted Headers" = Table.PromoteHeaders(Pr
//the following functions removes the $ sign
#"Remove Currency" = Table.TransformColumns(
#"Promoted Headers",
{
{"Product_Cost",each Number.From(
{"Product_Price", each Number.Fro
}
),
//the next function removes the row at the bo
#"Removed Bottom Rows" = Table.RemoveLastN(#"
// the following carries out the filter on th
#"Filtered Products" = Table.SelectRows(#"Rem
(
in
#"Filtered Products"
IMPORTANT NOTE
As you can see, it’s important to add descriptions using comments here.
This will also help if you need to share this with others so they can
understand what has been built and why.
Now that you have a basic understanding of the M language, you will begin
to look at how you can optimize your M with the use of parameters in the
next section.
To effectively work with different data sources, you need a good grasp of
M’s capabilities for more dynamic and flexible data transformations. It’s
essential to understand how M can be used to customize and parameterize
data source connections. In short, this will help you to enhance reusability
and facilitate easy adjustments to queries without manual code
modification, thus saving you time later down the line.
In the following sections, you will learn about how to actually go about
creating parameters with the help of examples for you to follow along and
implement in your own environment.
Creating parameters and variables
Now that we have created that parameter, we can then add it to our M by
referencing the parameter where the file path previously was. The following
example shows just the beginning of an M query that we have adjusted to
include the reference of the parameter we created:
let
Source = Excel.Workbook(File.Contents(DatasourceP
This now means that, should we need to change the data source to, say, a
production environment, for example, we could simply swap the file path
for the production data source using the parameter as shown:
Now, there are other steps you can perform to add more logic to your code,
such as the following examples.
Example 1 – using parameters in your SQL server
connection
Suppose you have data stored in different SQL server databases, and you
want to use parameters to adjust the connection. First, we would start by
creating a custom query function by right-clicking in the query pane and
selecting a blank query as shown:
Figure 7.4 – Screenshot of the menu when right-clicking in the query pane
within Power Query
Let’s see a breakdown of this example and what is happening during the
query:
If you are using deployment pipelines in the Power BI service, you can then
set up rules that adjust the parameters when you deploy from test to
production workspaces, for example, as shown in the following screenshot.
There are deployment rules for data source rules when using this service,
but this will only allow you to adjust the server name for the connection and
only supports certain data sources.
Figure 7.5 – Screenshot of the menu when configuring deployment rules
within a deployment pipeline in the Power BI service
If you didn’t use parameters for this, you would need to manually go into
your M query to adjust the server name and table name from within
Sql.Database. You could also potentially do this in the Power Query
GUI, but again, this would be manually done during deployment to
production workspaces.
There are often times when you will need to change the filter of a query to
obtain different results. Now, it would be a lot simpler if you could do this
without either editing the query or making slightly different copies of the
same query. In this example, we will change the data source filter from
CSV to Excel. You can then use a parameter and conditional logic to select
the appropriate data source function.
let
DataSourceType = "CSV", // or "Excel"
ConnectToDataSource =
if DataSourceType = "CSV" then
Csv.Document(File.Contents("C:\Local\File
else
Excel.Workbook(File.Contents("https://example.com
in
ConnectToDataSource
If you create a parameter as seen in the earlier steps of this section, you
would then be able to use the parameter in the DataSourceType
variable. This could be set to a list of values that included CSV and Excel,
as shown in the following configuration:
Figure 7.7 – Screenshot of the new parameter configuration named
Datasource Filter
Now let’s evolve the previous code and build upon it. To follow along,
ensure you have downloaded the Excel file named
AdventureWorks_Sales_2017.xlsx from GitHub. In this example,
we are going to connect to a CSV file (labeled
AdventureWorks_Sales_2017.CSV) in the GitHub repo if the
parameter is set to CSV. Otherwise, our query will open the local Excel file:
let
ConnectToDataSource =
csvurl = "https://raw.githubusercon
csvcontent = Text.FromBinary(Web.Co
csvtable = Csv.Document(csvcontent,
in
csvtable
else
let
Excelworkbook = Excel.Workbook(File
in
#"Changed Type"
in
ConnectToDataSource
7. Select Done.
In this example, you can see we introduced nested let statements to load
CSV data directly from a specified URL as well as to load and transform an
Excel workbook.
The CSV nested let statements take the contents for the URL and retrieve
the binary content to convert to text. The last variable parses this into a
table using the Csv.Document function.
Now it isn’t actually necessary to add the Excel Workbook nested let
statement; however, if you simply used the following code, Power Query
would not know that you would actually want to extract the data from the
sheet within the file:
else
Excel.Workbook(File.Contents("C:\Users\GusFra
So, you would see the following output when you switch the parameter to
Excel:
Figure 7.8 – Screenshot of the output using the previous code when the
parameter created is set to Excel
By using the nested let statement for Excel, we can then add additional
logic to the query, which then opens the specific sheet we want to view, and
carry out the transformations that are usually carried out by default when
opening an Excel sheet using Power Query.
This result will show the same data but you will have the option to specify
from the drop-down parameter whether you are using a local copy or the
version hosted on GitHub.
The last example is one I see used often, particularly with companies
looking to combine financial files or exported data files.
Suppose you have a folder containing several CSV files that you want to
combine into a single table. You might want to create a parameter to filter
the data for a specific date (another very common use case for a parameter).
To do this, let’s first walk through how you might do this in the Power
Query UI and then inspect the query in Advanced Editor to add our
parameters.
To follow along, ensure you have downloaded and extracted the folder
named AdventureWorksSales from within the ZIP file named
AdventureWorksSales.zip from GitHub:
4. The next window will allow us to specify the path to the folder
containing our CSVs. Here, we either type out the file path by selecting
Browse, or we can create a new parameter using the dropdown under the
folder path. Select Browse… and then navigate to the folder you have
downloaded. Select OK.
Figure 7.10 – Screenshot of the user entering the folder path into the
connector
Figure 7.11 – Screenshot of the preview of the files within the selected
folder
6. This will then open the Combine files configuration window to specify
the sample file for transformation. The Combine files transform
identifies file formats (text, Excel, JSON) and enables the extraction of a
specific object from the first file. It automates the creation of an example
query for extraction steps, linked to a function query that parameterizes
file/binary inputs. Applying the function query to the original query,
such as the Folder query, executes it for each row, expanding the
extracted data as top-level columns. Select OK, which will then create a
number of queries automatically.
7. Under the Other queries folder, select AdventureWorksSales.
By selecting the AdventureWorksSales query, you can then see the output
from the functions that have created a table from the data in the CSV files
from the folder you selected. The following example includes a column
called Source.Name, which takes the CSV filename to identify which
file that row was extracted from.
As we can see from the Query Settings window, there are a number of
steps that have been carried out on this table. However, before we dive into
this, it’s worth highlighting what Power Query has created in the
background as we were loading this data, which can be seen on the left-
hand side under the queries window. These are created in the folder named
Helper Queries:
Sample File: This query selects the first file from within the folder
that we specified.
Parameter1: Power Query created a binary parameter automatically
with a current and default value of Sample File. It is used to select
the file in the previous query.
Transform File: This is a custom function that is updated directly
from the query called Transform Sample File. This function is
then used in later queries to carry out the transformations performed on
the sample file on all the files in the folder.
Transform Sample File: This is a query that uses the parameter
to open Source File. There is, by default, one transformation
applied (promoted headers). Here, you would be able to add further
transformations that would then be applied to all the files in the folder.
Here is a further description of what each of these steps is carrying out and
the M query behind this to help you develop your knowledge of M:
Source:
Description: Retrieves information about all files in the specified
folder.
M query:
Source = Folder.Files("C:\Users\GusFrazer\Do
Filtered hidden files:
Description: Excludes hidden files from the retrieved file
information/folder.
M query:
Renamed columns:
Description: Renames columns for clarity of which file the data is
coming from.
M query:
Now this was a great introduction to how Power Query intuitively creates
the steps to carry out the transformations needed to combine these files into
one table, creating parameters, functions, and other queries along the way to
make the journey simple for users like yourself to get started.
let
FolderPathParameter = "C:\Users\GusFrazer\Dow
Source = Folder.Files(FolderPathParameter),
CombineFiles = Table.Combine(
),
{"ProductKey", Int64.Type},
{"CustomerKey", Int64.Type},
{"TerritoryKey", Int64.Type},
{"OrderLineItem", Int64.Type},
{"OrderQuantity", Int64.Type}
}
)
in
#"Changed Type"
5. Select Done.
6. In short, this M query creates a folder query that loads all CSV files from
the specified folder and combines them into a single table. The output
will be identical to the other query you created earlier in this section
named AdventureWorksSales (where we used the UI to select a folder
path and combine the files); however, you have created this with a much
smaller query.
We have only touched the surface of parameters and variables, but as you
can see, parameters are a key part of the M language. In this section, you
saw a number of good examples of how parameters can help you switch
between different data sources, but there are many applications where you
can incorporate this into your own analysis.
Summary
Overall, you gained valuable insights into the power of M for working with
diverse data sources, customizing connections, and efficiently transforming
data in preparation for analysis and visualization in Power BI.
In the next chapter, we will begin to delve deeper into the functionalities
within Power BI that can aid and assist you with carrying out Exploratory
Data Analysis (EDA).
Questions
Imagine having a tool that not only helps you explore your data but also
illuminates the path to better decision-making. Power BI’s data profiling
features do just that. In this chapter, we will guide you through the process
of activating and using these features effectively.
Understanding EDA
Exploring the data profiling features in Power BI
Reviewing column quality, distribution, and profile
Turning data profiles into high-quality data
By the end of this chapter, you’ll be armed with the skills and knowledge
needed to elevate your data analysis game with Power BI’s data profiling
capabilities.
Understanding EDA
With this basic understanding of EDA, let’s see what features within Power
BI will help you carry out EDA on your data.
These data profile views can be accessed by opening Power Query and
selecting the View tab.
As we can see in Figure 8.1, within the Data Preview section of the View
tab, there are three column-specific previews, which are Column quality,
Column distribution, and Column profile:
In the next sections of this chapter, we will dive deeper into what each
section can provide to help you profile your data.
In the following sections, we will explore each view and how they report
information.
Column distribution
The Column distribution view in Power BI and Power Query allows you
to explore the distribution of values within a specific column. It provides
visualization and statistics that help you understand how data is spread
across different ranges or categories.
Let’s see an example from the dataset provided in this book’s GitHub
repository. Here, we have only selected Column distribution and have
then zoomed in on a particular value called OrderQuantity.
The example shows that there are 16 distinct values within the quantity
data, with most of the records being within the first 2 bars. We will later
dive deeper into the insights of this when we activate the column profile.
This view also lets us know about unique values. These are potential
outliers or values that only appear once and might skew the readings.
Column quality
The Column quality view assesses the quality of data in a specific column.
It helps you identify data issues, such as missing values, data type
inconsistencies, and potential problems that may affect your analysis. If we
activate Column quality in the previous view, we can see the following:
Figure 8.3 – Column quality and distribution for the column named
OrderQuantity
Column profile
Lastly, the Column profile view in Power BI and Power Query combines
information from both the Column distribution and Column Quality
views to provide a comprehensive profile of a specific column.
When switching on the data profile within Power Query, you can expect the
Column statistics and Value distribution windows to open at the bottom
of your screen:
Figure 8.4 – View of the column profile for column named OrderQuantity
On the left, we can see a number of statistics that help us gain an insight
into the profile of the data within this column. The availability of these
statistics can vary depending on the nature of the data in that particular
column.
Count (total number of rows): This statistic shows that there are 1,000
rows in the column, which means your dataset contains 1,000 data points
for this particular variable. It is worth noting here that when Power
Query is loading a sample of data, it normally loads the first 1,000 rows,
so the dataset could actually contain more rows.
Error: The absence of any Error values (a count of 0) suggests that
there are no data quality issues or errors in this column. All the data
appears to be correctly formatted.
Empty: Similarly to Error, a count of 0 for Empty indicates that there
are no missing or blank values in this column. Every row in the column
contains data.
Distinct: The Distinct count of 16 tells you that there are 16 unique
values in this column. This is helpful for understanding the variety of
data within the column.
Unique: The Unique count of 5 indicates that there are 5 values that
occur only once in the column. This provides insight into the diversity of
values in the data.
NaN (Not a Number): The count of 0 for NaN values suggests that
there are no values in the column that are classified as Not a Number.
This is typically relevant for numeric columns and indicates that all data
points are numeric.
Zero: The count of 0 for Zero values indicates that there are no data
points with a value of 0 in this column.
Min (minimum value): The minimum value of 1 is the smallest value
present in the column. This statistic provides insight into the lower range
of data values.
Max (maximum value): The maximum value of 26 is the largest value in
the column, giving you an understanding of the upper range of data
values.
Average (mean): The average of 2.537 represents the arithmetic mean of
all values in the column. It provides a measure of the central tendency.
Standard Deviation: The standard deviation of approximately 2.046
signifies the dispersion or spread of data values around the mean. A high
standard deviation suggests high variability in the data.
Even and Odd: These statistics categorize the data into even and odd
values. You can see that there are 457 even values and 543 odd values,
providing insights into the distribution of these two types of values in the
column.
Now you have built a good foundation of knowledge of what the column
quality, distribution, and profile views within Power Query are and how
they can be used to gauge the cleanliness of your data. In the next section,
you will dive deep into how they can be used to clean your data.
Along with the capability to provide insights on what this data represents,
Power BI has now also made it easier to act on the insights provided. These
features aim to help cleanse and transform the data faster so you can gain
insights quicker. We’ll review them in the next sections.
You can leverage the quick actions shown in the previous screenshot to
clean and prepare the data you are analyzing more efficiently. Keep in mind
that it might not always be the correct action to take; it depends on your
analysis. For example, if you needed to see how many repeat orders there
were within a sales order, it might not make sense to remove the duplicate
line item numbers.
Value distribution
From within the Column profile view, we can see a large breakdown
histogram of a column that we have selected. This view can be grouped on
a number of different properties that are accessible in the top-right corner.
Similar to the previous section of this chapter, if you hover over a particular
value, you will see the following information:
Figure 8.6 – The value distribution window within Power Query
highlighting smart actions to filter the data to the value selected
Power Query will then add a filter step within your query:
Figure 8.8 – The Applied Steps window within Power Query detailing the
step added by selecting the equals action from Figure 8.7
Summary
In this chapter, you were given an overview of Power BI’s data profiling
capabilities that can help with EDA. Beginning with an introduction to
EDA’s core concepts, you navigated the data profiling landscape within
Power BI, mastering the interpretation and utilization of column quality,
distribution, and profile views.
You also learned how to use the Column distribution view to explore data
distributions and identify unique values and potential outliers. The Column
quality view allows you to assess data quality, including validity, errors,
and empty values, ensuring data integrity. Furthermore, the Column profile
view combines data distribution and quality insights into a comprehensive
profile of each column, offering a holistic understanding of your data.
In the next chapter, you’ll further enhance your data preparation skills,
enabling you to tackle even more complex data challenges and optimize
your datasets for robust analysis and visualization in Power BI.
Questions
So far in this book, we have covered several techniques that range from
basic to intermediary. Now, we are ready to progress to the next stage of
this book, where we will explore more advanced techniques for cleaning,
preparing, and enhancing data for analysis within Power BI and its toolkit.
Using Power Query Editor from within Dataflow Gen1 – fuzzy matching
and fill down
Using R and Python scripts
Using ML to clean data
By the end of this chapter, you will have learned how to build complex
functions within Power Query. In addition to this, you will also learn how to
begin bringing other advanced languages such as R or Python into your data
model.
To follow along with the AutoML example and walk-through, you will
require access to Dataflow Gen1 and a Power BI premium workspace.
Once installed, you will need to run the following code in your workspace
to ensure the necessary packages are installed:
> install.packages("dplyr")
To run Python scripts in Power BI Desktop, you will need to install Python
on your local machine. You can download Python from the Python website:
https://www.python.org/. Further details on installing Python are also
available at https://learn.microsoft.com/en-us/power-bi/connect-
data/desktop-python-scripts.
Data for the examples used in this chapter can be found in the following
GitHub repo: https://github.com/PacktPublishing/Data-Cleaning-with-
Power-BI/blob/main/online_shoppers_intention.csv.
Fuzzy matching
Fuzzy matching uses a similarity score to determine how closely two strings
match. It assigns a similarity score between 0 (no similarity) and 1 (perfect
match) based on the comparison of two strings. The higher the similarity
score, the more likely it is that the strings represent the same entity.
There are a number of use cases for how fuzzy matching can help when
preparing data but here are just a few of them:
Merging duplicate records: Fuzzy matching can help you identify and
merge duplicate records, even when there are slight variations in the data
Standardizing text: If your data includes variations of the same text (for
example, “USA” and “United States of America”), fuzzy matching can
standardize it
Handling misspelled names: When dealing with names or product titles
with spelling variations, fuzzy matching can identify and link similar
entries
It’s important to note that features of fuzzy matching can only be used from
within the Power BI web service. Here is a walk-through of how this can be
implemented in Power BI:
Figure 9.2 – Screenshot of the cluster menu from within Power BI/Power
Query
7. Configure settings for the cluster: Adjust the name of the new column.
Here, you can also configure the fuzzy cluster by adjusting the similarity
threshold.
8. Review the results: In the following figure, you can review the results
of the newly created column:
Figure 9.3 – Preview of the data after adding a new column and configuring
the cluster option within Power BI/Power Query
In this example, we can see that by using fuzzy matching, we have been
able to use smart technology within Power Query to automate the correction
in the data.
IMPORTANT NOTE
It is crucial that the transformation table you use includes the columns
named From and To. Without these named columns, Power Query will not
recognize the table as a transformation table, resulting in no
transformation.
Fill down
Data gaps are common and can lead to inaccurate analyses. The fill down
technique in Power BI’s Power Query Editor from within Dataflow Gen1
can help fill in these gaps, creating a consistent and complete dataset.
Fill down is an operation that copies the value from the previous row to fill
in missing values in the current row. This is particularly useful when you
have a series of data points with occasional gaps.
Here are some use cases where fill down might be best applicable:
Time series data: When working with time series data, fill down can
populate missing values in a time sequence, ensuring the continuity of
your data
Categorization: If you have categorical data where values should
remain consistent until changed, fill down can help maintain the
continuity of categories
Calculations: In situations where you need to perform calculations on
missing values, fill down can prepare the data for mathematical
operations
1. Select your data: Open the Power Query Editor from within Dataflow
Gen1, and select the column with missing values (in this example, we
can see that the Wheels column has a number of rows with missing
values). We know that the data is looking at cars, so we can assume that
all the cars in this category of data have four wheels, making it an ideal
example to use the fill feature. Navigate to the Transform ribbon on the
toolbar and choose the Fill Down option. You can also choose to fill up
from this button.
2. Configure fill down: Choose the desired options, such as the direction
(down or up) and the handling of null values. In this example, we will
select Down to fill down. Once selected, Power Query will apply the fill
down operation to your data, and the gaps will be filled automatically.
Figure 9.4 – Screenshot of the fill down function available within the Power
Query Editor view from the Dataflow
Compared to fill down, fill up works in a similar manner but in the opposite
direction: populating rows with values from the row below.
Here are just some useful best practice tips when working with fuzzy
matching and fill down within the Power Query Advanced Editor:
Data backup: Always create a backup of your data before applying any
changes, especially when using fuzzy matching, as it may result in data
consolidation that cannot be undone.
If you are in Power BI Desktop, you could simply save a local copy of
your file or use integrations such as GitHub to help with rolling back to a
previous version of your changes. From dataflows though, this is not as
simple as, although you can export your query to JSON, you can’t
exactly save a local copy of the query. So, one method I have always
found safest is to duplicate the query before making adjustments. That
way, you always have a rollback copy of the query should you need it.
To do this, follow these steps:
1. Duplicate the query: In the Power Query Editor, duplicate the query
you are working on by right-clicking on the query you’d like to back
up. From the menu that appears (shown in the following figure),
select Duplicate. This creates a copy of your data transformation
steps without affecting the original data.
Figure 9.5 – Screenshot of the menu that appears from right-clicking a
query within the Power Query Editor from within a dataflow
2. Rename the original query: Right-click on the original query and use a
clear and distinct name, indicating that it is a backup or a copy. This
helps in easily identifying and distinguishing the backup from the main
query. For example, a name such as
mtcars_ROLLBACK_GF_05022024 provides clear detail that this is
a rollback version of the query and when it was created.
3. Disable the loading of the backup query: To prevent the dataflow from
duplicating efforts or impacting performances, it’s best to disable the
loading of the rollback copy of the query. The Enable load option
allows users to control whether the results of a specific query should be
loaded into the data model of Power BI or other destinations. By default,
Power Query loads query results into the data model for analysis, but
clicking on Enable load (as shown in Figure 9.5) enables the option to
disable loading. This is also useful for performing intermediate data
transformations or cleaning without adding unnecessary data to the final
dataset. This feature helps manage the size and content of the Power BI
data model.
As you have learned throughout this book, Power BI’s Power Query Editor
from within a dataflow or Power BI Desktop is a versatile tool that enables
users to shape, transform, and cleanse data before it’s used for analysis and
visualization.
While the built-in capabilities that have been highlighted in the previous
sections are robust, sometimes you may encounter data manipulation tasks
that require more advanced processing. This is where the feature of using R
or Python scripts in coding languages such as R and Python, for example,
comes to the rescue.
Before jumping into how you might go about doing this, it’s worth
understanding why you might use these languages. Custom data scripts
enable users to extend the capabilities of Power BI by leveraging the rich
ecosystems of R and Python.
Although this chapter will not go into depth about how to code in these
languages, here are some scenarios where you might consider using custom
scripts:
Complex transformations: When your data transformation needs to go
beyond the standard capabilities of the Power Query M formula
language, R or Python scripts provide a way to perform complex
operations.
Statistical analysis: If you need to run statistical analyses, predictive
modeling, or machine learning algorithms on your data within Power BI,
R or Python scripts are your allies.
Third-party libraries: Access to the extensive libraries available in R
and Python can be a game-changer. From natural language processing to
image analysis, you can tap into a vast array of pre-built functions and
packages.
Data integration: When you want to integrate data from various
sources, databases, or web services, using custom scripts allows you to
fetch and process the data seamlessly.
There are many advantages of using R and Python scripts in Power BI,
which I’ve summarized as follows:
Let’s walk through a basic example of how to use R and Python scripts
within Power BI’s Power Query Editor. It’s important to note that, in this
section, we will need to jump back into Power BI Desktop as dataflows do
not currently support the use of R or Python scripts. In this example, we’ll
use a simple dataset of sales transactions to carry out some simple
transformations using R and Python. It’s important to note that these are just
to demonstrate how to leverage these languages in Power BI. The simple
transformations in this example could also be done using Power Query. The
code has also been added to the GitHub repo should you wish to copy the
code from there (given that Python is line-sensitive, copying directly from
the snippets may introduce errors otherwise).
Figure 9.6 – Screenshot of the Get data menu from within Power BI
Desktop
2. Enter the following code: This will open a dataset that is available by
default within R, which we can use for an example:
data(mtcars)
3. Select Transform data: You need to select the Transform data option,
which will open the Power Query Editor, showing the following
preview:
Figure 9.7 – Screenshot of the data preview within Power Query
# R Script Example
library(dplyr)
group_by(gear) %>%
summarize(AverageMPG = mean(mpg))
6. Apply the script: Click on the OK button to apply the R script, and the
result will be added as a new column in your data.
7. Select privacy: For the R script to work properly in the Power BI
service, all data sources need to be set to public.
The output of this script will transform the data showing the average mpg
value per gear category, as shown:
As you can see from this example, this has returned a simplified view of the
data calculating that those cars with four gears have the optimum mpg
value. It’s important to note that the simple grouping transformations we
carried out in the previous example can also be done in Power Query using
the grouping functions. The key aim is to give you an introduction to how
to actually leverage your R scripts in Power BI.
The steps to use Python for the same transformations in Power BI are as
follows:
1. Repeat steps 1–3 from the previous example: Connect to the same data
as the previous example.
2. Invoke Python script: Under the Transform tab, choose Run Python
Script.
TIP
3. Write Python script: Write your Python script to perform the desired
transformation. An example would look like the following; you could
use Python to calculate the average sales price:
import pandas as pd
dataset = pd.DataFrame(dataset)
4. Apply the script: Click on the OK button to apply the Python script,
and the result will be added as a new column in your data.
5. Select privacy: For the Python scripts to work properly in the Power BI
service, all data sources need to be set to public (out of the Public,
Organizational, or Private options).
Figure 9.9 – Screenshot of the privacy levels from the Python script
IMPORTANT NOTE
Privacy levels in Power Query define the sensitivity of data sources and
control their visibility and interaction with each other. When a data source
is set to Private, it contains confidential information, and access is
restricted to authorized users. Organizational privacy levels allow data
sources to fold into private and other organizational sources, with visibility
limited to a trusted group. Public privacy levels apply to files, internet
sources, and workbooks, allowing data to fold into other sources while
being visible to everyone in the business. Examples of the Public privacy
level include freely available data from the Azure Marketplace or
information from a public web page stored in a local file.
To highlight, though, setting this to Public doesn’t mean that your data will
be made accessible to anyone and everyone outside your company. If you
set a data source to the Public privacy level, it means that the data from
that source can be shared and combined with other data sources, but it
doesn’t imply that the data itself is accessible externally. The privacy levels
primarily control the interaction and integration of data within the Power
BI environment. Regardless of the privacy level, the actual security and
access control of the data are managed by the underlying data sources and
the Power BI service.
6. Open result table: The script will return the view shown in the
following figure. Select the Table link next to the result value to
complete the script.
Figure 9.10 – Screenshot of the results from the Python script
The results that match the results from the R script are shown here:
Figure 9.11 – Screenshot of the results from the Python script after selecting
Table from within the Value column
R and Python scripts provide Power BI users with an expanded toolkit for
data transformation and preparation. This feature opens up opportunities for
advanced data manipulation, statistical analysis, and machine learning
within the familiar Power BI environment. As you become proficient in
using scripts, you’ll unlock the full potential of your data and create more
insightful and sophisticated reports and visualizations.
Data cleaning is often the first step in the data preparation process.
Anomalies or outliers in the data can skew analysis results and compromise
the quality of reports. Power BI’s built-in anomaly detection feature can
automatically identify and flag data points that deviate significantly from
the norm. Power BI’s anomaly detection leverages machine learning
algorithms to detect data points that are statistically different from the rest.
Users can set the sensitivity level to control the number of anomalies
detected.
2. Select Define new table: Out of the four presented options, select
Define new table.
3. Select Text/CSV: You will be presented with many connector options to
connect to your data or simply upload a file. Here, we can select
Text/CSV and then use the following file path linked to the GitHub
repo: https://raw.githubusercontent.com/PacktPublishing/Data-Cleaning-
with-Power-BI/main/online_shoppers_intention.csv.
Select Next to continue.
Figure 9.13 – Screenshot of the Choose data source window from within a
new dataflow in the Power BI service
4. Select Transform Data: This will open the Power Query view from
within the dataflow. Here, you can choose to carry out further EDA or
cleaning on the file before we add AutoML to this.
5. Select Save & Close: This will save your dataflow as is.
6. Name your dataflow: Here, we will be prompted to add a name to the
dataflow, as shown:
Figure 9.14 – Screenshot of the Save your dataflow window from within the
Power BI service
7. Next, you will be presented with a view from within the dataflow, which
shows the tables currently being returned/output from this dataflow. It’s
important to note that there may be more than the ones shown if some
tables/queries have the Enable load option unselected. An example of
the actions that will appear on the right-hand side of the dataflow is
shown here:
Figure 9.15 – Screenshot of the actions available to apply against a
particular table or query within a dataflow
Before continuing with the example steps, you might be interested in what
each of these actions does. Here is a breakdown summary for you to better
understand the outcome of each button:
Edit Table: This will allow you to edit or transform the table, effectively
reopening the Power Query window from within the dataflow.
Apply ML Model: This will allow you to apply the ML model to the
selected table, which you will be doing shortly.
Properties: This will open the Properties window for this particular
table, and more importantly, the description of this table. Here, you can
add further detail on what information is held in this table.
Incremental Refresh: This will open the window shown as follows,
which will allow you to configure Incremental refresh on this particular
table of the dataflow. This can be extremely useful when you have large
data tables within your dataflow. Note that you will need a Date column
in your table in order for this to work.
Figure 9.16 – Screenshot of the Incremental refresh window within a
dataflow where users can set up and manage their incremental refreshes
8. Select Apply ML Model and then select Add a machine learning
model: In the dataflow editor, click on Add a machine learning model
to begin our model.
9. Select tables and columns: Next, select the tables and columns we would
like to use within our model predictions, as shown. In this example, let’s
try to predict whether someone will revisit at the weekend.
Figure 9.18 – Screenshot of the model creation window highlighting which
table and column we would like this model to apply to
10. Choose the model: The next window will present you with the option to
select which classification model to use based on the column you
selected in the previous step. Here, we only have the option to select the
Binary Prediction classification because the field we selected,
Weekend, is a true/false field. You can continue by selecting true in
the Choose a target outcome field and selecting Next.
Figure 9.19 – Screenshot of Choose a model from with the model creation
wizard
Figure 9.21 – Screenshot of the final window to name your model and set
the allotted training time for the model
13. Review and apply: Review the suggested settings, adjust if needed, and
apply them to your model to complete the setup. It’s important to note
that the training of the model doesn’t start until you click on the Save
and Train button.
CONNECTIVITY ISSUES
If you get errors from within your dataflow, this might be due to your
connections. To adjust this, simply select Edit the dataflow. This will open
warnings across the new ML tables that suggest you need to edit credentials
to access the table. Alternatively, this can be done by selecting Settings in
the header bar and then selecting owner BI Settings. This will open the
following settings page, where you can then navigate to the Dataflows tab.
Here, you will see the settings for your particular dataflow, including
credentials. Then, simply edit the connections that have warning signs, as
shown in the following figure.
Figure 9.22 – Screenshot of the settings window for the dataflow we have
named Online Visitors detailing connection credential issues
14. Return to the dataflow: Now, return to the dataflow and select Machine
learning models, as shown, to see the models we have created for this
dataflow. Here, you will see that the model status has now been changed
to Trained.
Figure 9.23 – Screenshot showing the model created and the status, which
is Trained
Once the machine model training has been completed and the time allotted
for learning has passed, you will then be able to open the training report,
which provides details on the results of the model training.
15. Select View Training Report: The second action button will allow you
to view the training report of the model you just created. The following
example shows the WeekendVisitors model.
Figure 9.24 – Screenshot showing the model training report
This report allows you to review performance and statistics for the model
created, with the ability to explore the top predictors as well as the options
to adjust the probability thresholds of your model from recall to precision.
For those who are new to machine learning, you may not have heard terms
previously such as predictors or probability thresholds. If you haven’t seen
these phrases before, then take note of the next paragraphs.
Training details provide a further breakdown of how the model was trained.
Having reviewed the model, we are now in a position to apply the model to
our dataflow entity (the original table we loaded). You have to select Apply
model to invoke the model. This will present the following screen and
allow you to specify which dataflow entity to apply this model to:
Figure 9.25 – Screenshot of the confirmation window to apply the model to
our original table
By selecting Save and apply, Power BI creates two new tables, with the
suffix enriched WeekendVisitors (or whatever you have named
your model). This is to differentiate between the original table and the
enriched table of data we have just created.
In this case, applying the model to the Online Visitors table creates
the following:
online_shoppers_intention enriched
WeekendVisitors, which includes the predicted output from the
model. The following example shows the prediction of TRUE for a visit
at the weekend as well as a scoring for this.
Figure 9.26 – Screenshot shows the results from the predictions with an
outcome and prediction score
online_shoppers_intention enriched
WeekendVisitors explanations, which contains top record-
specific influencers for the prediction.
Applying the prediction model to our original table creates the preceding
new tables and adds four columns to the original columns that were in the
table: Outcome, PredictionScore, PredictionExplanation,
and ExplanationIndex, each with a WeekendVisitor prediction
prefix.
In this example, in the top row, you can see that although a visitor didn’t
visit at the weekend, the prediction model has predicted that they would
visit at the weekend with a score of 54. This data can then be used within a
visualization to tell a further story about the prediction data.
Although we won’t deep dive into the queries within this section, it’s
important to note that newly created dataflow tables use M queries in the
background to action the ML model. The following is an example
screenshot of this from the table named
online_shoppers_intention enriched WeekendVisitors:
Figure 9.27 – Screenshot of the M query created for the
online_shoppers_intention enriched WeekendVisitors table
Here, we can see that the previous query then invokes a function that was
created earlier in the background called WeekendVisitors.Score.
This function consists of a more complex query to actually run the model
and apply a score, as we can see in the following example:
Figure 9.28 – Screenshot of the M query created to output the
WeekendVisitors.Score function
In this section, you took your first steps in creating machine learning
models and applying them to your data. In the final section, you will learn
more about using AI insights within your data analysis and preparation.
AI Insights provides pre-trained models that cover various use cases, such
as sentiment analysis, image recognition, and text analytics. Users can
easily integrate these models into their Power BI reports. In order to use this
feature of Power BI, you will need a Premium license. Here are a few
examples where this particular feature adds the most value:
1. From Power BI Desktop, select Get data and then select Text/CSV.
2. Enter the following URL and select Open:
https://raw.githubusercontent.com/PacktPublishing/Data-Cleaning-with-
Power-BI/main/Twitter%20Tweets.csv
3. Select Transform data to open the Power Query Editor.
4. Select Add Column on the ribbon toolbar. You will see AI Insights on
the far right-hand side of the ribbon, as shown:
Figure 9.29 – Screenshot of the Add Column ribbon in Power Query Editor
5. In the Power BI report, select the column you want to enhance with AI
Insights, and then choose the relevant AI model. In this case, we would
like to use Text Analytics as we want to analyze the language of the
posts. This will open the AI Insights configuration window for Text
Analytics.
6. Select Detect language from the configuration window for Text
Analytics, as shown. Notice by default that it has already selected the
column named content. In the lower-left corner, you can also adjust
which Premium workspace you are using to run AI Insights.
Figure 9.30 – Screenshot of the Text Analytics AI Insights configuration
window in Power Query Editor
7. Data enhancement: The AI model will then process the data point and
generate additional insights, which will be added to your data as shown
in the following example:
Figure 9.31 – Screenshot of the output from using AI Insights to detect the
language of the social posts column ribbon in Power Query Editor
From these simple transformations, you could potentially now add a further
layer to your social media analytics. Similar to the previous example in
Figure 9.27, Power Query has used M to actually connect to cognitive
services to carry out the AI Insights analysis. The following figure shows
the M query generated by Power Query to do this:
Summary
The chapter began by introducing the power of this tool and highlighted
two critical techniques: fuzzy matching, which identifies and consolidates
similar strings within your data, and fill down, which fills gaps in your
dataset with values from the previous row. We also outlined some best
practices for using these tools, emphasizing data backup, sensitivity
adjustment, regular validation, documentation, and the iterative nature of
data cleaning.
The chapter also introduced the concept of using custom data scripts in
languages such as R and Python, illustrating their benefits for complex
transformations, statistical analysis, third-party libraries, and data
integration.
The outcome of learning about the advanced data cleaning and preparation
techniques presented in this chapter is that you have significantly enhanced
your skill set for working with data in Power BI. You will have acquired
proficiency in employing advanced tools and methodologies to address
common challenges in data quality and consistency. The specific outcomes
include the following:
In the next chapter, you will delve into the creation of custom functions to
automate complex data manipulations and further enhance data preparation
capabilities.
Questions
1. What are the two essential techniques discussed in the chapter for
cleaning and preparing data using the Query Editor in Power BI?
1. Fuzzy matching and fill up
2. Data profiling and sorting
3. Fuzzy matching and fill down
4. Data imputation and statistical analysis
2. In the context of fuzzy matching, what is the similarity score range, and
what does it indicate?
1. Range from 1 to 10, indicating similarity strength
2. Range from 0 to 100, indicating confidence level
3. Range from 0 to 1, indicating no to perfect similarity
4. Range from -1 to 1, indicating negative to positive correlation
3. When is the fill down technique in Power BI’s Query Editor particularly
useful?
1. When you want to skip data gaps
2. When dealing with categorical data
3. When you need to perform calculations on filled values
4. When working with time series data and maintaining data continuity
4. What is a crucial best practice emphasized when working with fuzzy
matching and fill down in Power BI?
1. Occasionally document the steps taken during data cleaning
2. Create and load multiple versions of the data
3. Regularly refresh your data
4. Regularly validate the results of data cleaning efforts and maintain
documentation
5. What is the primary purpose of using scripts in languages such as R and
Python in Power BI?
1. To replace Power Query
2. To add complexity to data models
3. To extend the capabilities of Power BI by leveraging external
ecosystems
4. To simplify the data cleaning process
6. In the steps for using R in Power BI, what is the purpose of the Run R
Script option in the Power Query Editor from within Dataflow Gen1?
1. To download R packages
2. To write R scripts
3. To visualize R outputs
4. To run machine learning models
7. What are the benefits of using R and Python scripts in Power BI for data
preparation?
1. Limited data manipulation capabilities
2. Integration with external data sources
3. Only suitable for simple transformations
4. Exclusively designed for statistical analysis
8. Which built-in machine learning feature in Power BI is used for
identifying and addressing outliers in data?
1. AutoML
2. AI Insights
3. Anomaly Detection
4. Predictive Modeling
9. What is the purpose of AutoML in Power BI?
1. Manual data transformation
2. Automated data preparation with machine learning suggestions
3. Exclusive focus on statistical analysis
4. Integration of pre-built machine learning models
10. What are some common use cases for AI Insights in Power BI?
1. Fuzzy matching and fill down
2. Feature engineering and data type conversion
3. Sentiment analysis, image recognition, and text analytics
4. Anomaly detection and statistical analysis
10
In this chapter, we will learn how to create custom Power Query functions
in Power BI, covering the planning process, parameters, and the actual
creation of the functions. We will start by understanding data requirements
and defining the function’s purpose and expected output, then we will learn
about the different types of parameters and how to use them to make
functions more flexible and reusable, and finally, we will see step by step
how to write M code functions, and test and debug them.
The first step in planning for a custom function is to clearly define the
problem you want to solve. What specific data transformation or
manipulation task do you need to perform regularly?
For example, you might need to calculate the rolling average of sales data
or create a custom date hierarchy. Identifying the problem is essential as it
serves as the foundation for building your custom function.
Identifying parameters
Parameters are the variables or values that your custom function will take as
input. They allow you to make your custom function versatile and
applicable to a range of scenarios.
Data inputs: What data or columns will your function operate on?
Thresholds and criteria: Are there specific thresholds or criteria that
users should be able to customize?
Data types: Will your function handle different data types, and how
should it behave with each type?
Once you have defined the problem and parameters, having clear objectives
for your custom function is crucial. What do you aim to achieve with this
function, and what are the expected outcomes? Defining objectives helps
you stay focused and ensures that your custom function meets your data
transformation goals.
Using parameters
Power Query offers several types of parameters that you can use in your
custom functions:
Defining parameters
Let’s put the skills we’ve taken from the previous two sections in planning
and using parameters into practice as we begin by looking at a set of data.
In this scenario, we will be looking at scraping some data from a website
and then applying some transformations. This data will show us the top
racing drivers of all time as well as their first or last wins (most recent
wins). The problem here is that I only want to pull in one year and location
column. To get around this, it would be ideal if we could create a parameter
that can be used to select whether we want to pull in first or last win data.
The objective will be to then create custom columns that are dynamic and
change depending on which parameter is selected. Let’s get started by
opening Power BI and executing the following steps:
1. Selecting the connector: Select Get data and then select the Web
connector as shown:
Figure 10.2 – Screenshot of the Get data dropdown within Power BI
2. Input the URL: In the prompt shown next, enter the following URL and
then select OK:
https://en.wikipedia.org/wiki/List_of_Formula_One_Grand_Prix_winner
s
Figure 10.3 – Screenshot of the URL inputted into the web connecter
window
3. This will bring up the Access Web Content window, where you can
configure authentication to connect to the web content. For this example,
we will use an anonymous authentication. Click on Connect to proceed.
4. Select the tables and transform the data: Once selected, the web
connector will scrape the website URL for possible data entries and
tables. The Navigator window (shown in the next image) will then
appear with possible table options from scanning the web URL. On the
left, you can see all the possible and suggested tables found within the
web page. On the right, you can see a preview of this as well as the
ability to switch the web page view, so you can also review this. In this
example, you are going to select Formula One Grand Prix Winners
and then select Transform Data.
Figure 10.4 – Screenshot of the Navigator window showing tables available
from scanning the web page
6. Use space as delimiter: Given the data, you want to extract two
columns from the one, so in the example, we will select Left-most
delimiter to avoid creating multiple columns, as you can see:
8. Repeat the steps for the Last Win column: Now, repeat the previous
steps for the final columns. You’ll notice that when you use the split
columns feature, it automatically adds a step to change the type of
columns created by the split. This is to prevent an error from occurring
by leaving un-typed columns in Power Query.
9. Select Advanced Editor: We now want to rename the four columns we
have created. There are a number of ways we can do this; for example,
we could adjust the previous M code used to split the columns, add four
individual rename steps, or we could code in one rename step that
adjusts all the columns. In this example, you will do the last option:
creating a rename function in Advanced Editor.
10. Add the M code: The following M code should be added to the last line
before the in function. Make a note to add a comma (,) at the end of the
preceding line and also adjust the value after the in function:
11. Selecting parameter: On the Home tab within Power Query, we can
select Manage Parameters in order to create our parameter. There will
be a dropdown that allows you to select New Parameter.
12. Create your parameter: Given that we would like two different values
to select from, you will need to add these within the parameter settings:
1. Name the parameter as First/Last Selector.
2. Provide a description of what the parameter has been created for.
3. We can leave Type as Any.
4. In the Suggested Values dropdown, select List of values. This will
allow you to add the values you wish to use for this. Add the First
and Last values into the list, as shown in the following screenshot, to
be able to then select one of these values as a default value.
5. Default Value should be set as First.
6. Current Value should be set as First.
Figure 10.9 – Screenshot of the Manage Parameters window with inputs
selected
13. Add custom columns for first and last wins: Now that we have added
the parameter to achieve the output we would like, we can begin by
creating a custom column that will return the year:
1. Select Advanced Editor: Instead of using the function custom
column, we are going to script this into the code in Advanced Editor.
This is because we also want to add a dynamic column title so that
this changes depending on what we have selected in their parameter.
2. Add the following code to the bottom line of the code in Advanced
Editor. Make sure to add a comma on the previous line to ensure
there are no errors (taking care to ensure you are using the correct
casing as M code is case sensitive):
This part of the code is then used to dynamically change the contents of
the column depending on which value of the parameter has been
selected:
14. Remove unwanted columns: Now that we have created the dynamic
columns, as a last step, you will need to remove the previous columns.
As this step is happening last, it will not affect how the dynamic columns
work. To do this, select the columns as shown, right-click, and select
Remove Columns:
Figure 10.10 – Screenshot of the user selecting the columns to be removed
15. Switch parameter: Select First/Last Selector from the query list to
open the parameter as shown:
16. Select First from the dropdown: This will adjust the parameter value.
17. Select Formula One Grand Prix winners Query: This will present the
output of changing the parameter. As you’ll see, this will change the
columns in the data to show First Win Year and First Win Location.
The data is now prepared and ready for analysis. You have created a
dynamic table/query that can now be quickly transitioned to show a
different set of data at the switch of a parameter. In the next section, you
can review some of the best practices for using parameters in the future.
By adhering to these best practices, you can enhance the clarity, usability,
and reliability of your custom functions within Power BI, ultimately
improving the overall user experience.
Writing M code
After writing the M code for your custom function, it’s vital to thoroughly
test and debug it. Test your function with different datasets and parameter
values to ensure it works as expected. Debugging tools in Power Query,
such as the formula bar and error messages, can help identify and fix issues
in your code. One of the most common bugs found is when adding new
code to the code in Advanced Editor and forgetting to add a comma after
the last entry.
Documentation
This has also been mentioned a couple of times but is a key pivotal concept
across all tools to help you prepare data: proper documentation is key to
ensuring that users can understand and use your custom function
effectively. Provide clear instructions on how to use the function, including
explanations of parameters and expected outcomes. Good documentation
promotes the adoption of your custom function among users.
Let’s build upon the example in the previous section to create custom
functions. There are two methods that stand out when creating a custom
function, which could be to use the parameterized query you have created
as a function. Or, alternatively, start a function from a blank query. In the
following example, we will cover both scenarios.
Let’s assume that I need to create multiple copies of the table with the
different parameters selected. I could do this by using the entire query we
created previously as a function. To do this, follow these steps:
Figure 10.12 – Screenshot of the function created using the original query
As we can see in the query, our parameter is currently set to Last. If you
wanted to invoke our function to create a new table with last wins
highlighted, you could select Invoke from the function.
Figure 10.13 – Screenshot of the function created named
First/LastWinsToTable
4. Pressing Invoke will run the function query with the parameter selected
and output a new table/query that can then be renamed Last Wins by
right-clicking on the query name and selecting Rename.
Figure 10.14 – Screenshot of the table created by invoking that function
with the parameter set to Last
This example of creating a function could be really useful when you need to
use parameters to apply the transformations multiple times. The most
common case is when merging CSVs where the parameter could be the file
path, thus, allowing you to clean multiple files simply by pointing the
parameter at the new file and invoking the custom function. This, of course,
happens provided that the CSVs have a similar structure in columns, and so
on.
This is great, but let’s now take a look at how you might create your own
custom function from a blank query. Let’s build on the work you have done
so far with this data to now assess what each driver’s wins per seasons
active would be. Given the format of the data, we will need to start by doing
some cleaning of the data using the following steps:
1. Duplicate the query you have built named Formula One Grand
Prix winners.
2. Rename it Driver Wins.
3. Create a new group within the query page to separate this from the
previous example. Simply right-click on the query you just created,
select Move to Group, and then select New Group… from the sub-
menu.
4. Name the group CustomFunctionFromBlank and select OK.
5. Select the columns we would like to keep, which in this example will be
as follows:
1. Rank
2. Driver
3. Wins
4. Seasons active
6. From the Home ribbon, select Remove columns and then select
Remove other columns from the submenu.
7. Before we start creating a function, let’s first ensure the Seasons
active columns are showing the correct values. We can see that some
are missing the year values (likely because they are still active drivers).
To correct this, we will use M to do this from within Advanced Editor
(this could also be made into a custom function). Select Advanced
Editor from the Home ribbon.
8. Before adding the following, ensure you add a comma to the end of the
previous step.
9. Add the following code to the code in the Advanced Editor window.
This will use nested expressions to then split the column named
Seasons active. This then uses if statements to extract the year
ranges from the column, and if the end-year range is blank, it will
populate this with the year 2024. The logic at the end then groups this
back together:
AdjustedTable = Table.TransformColumns(#"Remo
let
// Split the text by commas
YearsList = Text.Split(_, ", "),
// Process each item in the list
AdjustedYearsList = List.Transform(Ye
if Text.Contains(_, "–") then
let
StartYear = Text.BeforeDe
EndYearText = Text.AfterD
EndYear = if EndYearText
AdjustedValue = StartYear
in
AdjustedValue
else _),
// Combine the adjusted list back int
AdjustedValue = Text.Combine(Adjusted
in
AdjustedValue
})
10. Then, adjust the variable after the final in from (#"Removed Other
Columns" to AdjustedTable. This should look like the following
screenshot:
Figure 10.15 – An example of the query used in Advanced Editor to correct
the missing years
11. The resulting output should have shaped the final column to ensure that
any drivers that are still active have the year 2024 shown. In the
following screenshot, we can see that Lewis Hamilton now shows 2007-
2024 instead of 2007-.
Figure 10.16 – Screenshot of the Seasons active column without the
missing years in the ranges
12. Next, in order to figure out the average wins by active seasons, we need
to cleanse the data to extract the number of years active. To do this, let’s
create a custom function from scratch so that we can then extract the
number of years a driver has been active.
13. From the Home ribbon, select New query and then select Blank query.
This will open a new query in the query window.
14. Rename this query YearsActiveFunction.
15. Select Advanced Editor so you can begin to build the function. As this
is a blank query, you will see the following prepopulated by default:
let
Source = ""
in
Source
16. Remove these lines to start from a complete blank query and then type
the following to define the function:
let
// Define the custom function to calculate years
GetTotalYears = (textValue as text) as number
17. Now that we have defined the function, we will begin the
transformations within a nested let. The first transformation we will
need to do is to split out the year ranges as we did previously. Enter the
following code:
let
// Split the text by comma and trim each
yearsList = List.Transform(Text.Split(tex
18. Next, add the following lines so you can define a function to calculate
the number of years for each entry:
19. Now that our function is calculating the number of years for each entry,
we need to add logic to pull this back together for each driver. Add the
following lines:
20. All that’s left to do is to close the let expression. Enter the following:
in
totalYears
in
GetTotalYears
21. Now that you have completed your code, select Done. This should now
show the following within the query for the function:
Figure 10.17 – Screenshot of the function created named
YearsActiveFunction
22. Now that you have created your custom function, return to the query you
named Driver Win. Once you have navigated to the query, select the
Seasons active column.
23. Select Add column from the ribbon and then select Invoke custom
function.
24. Next, you will need to name the column you are creating, so, in this
example, let’s call it Years Active. Then, select the function you
would like to use. Select the function we just created called
YearsActiveFunction.
This will cause another configuration item to appear, which specifies the
input for the function. In this example, as you already selected the
Seasons active column, this should already appear as shown in the
following. Select OK to invoke the function.
This will now create a new column that shows the number of years that the
driver has been an active racer, which could later be used in a measure to
calculate the average wins per active season. Thus, you can figure out
which driver has the best all-round stats.
Summary
In the next chapter, you will learn more about the techniques and tips on
how you can optimize your M code.
Questions
1. What is the first step in planning for a custom function in Power Query?
1. Writing M code
2. Testing and debugging
3. Defining the problem
4. Creating parameters
2. What role do parameters play in custom functions?
1. Identifying issues in the code
2. Enhancing documentation
3. Making functions flexible and adaptable
4. Debugging the function
3. Why is it important to provide default values for parameters in custom
functions?
1. To increase the complexity of the function
2. To make the function less user-friendly
3. To improve the overall user experience
4. To avoid testing and debugging
4. What is crucial for defining the structure of a custom function?
1. Complex M code
2. Writing extensive documentation
3. Choosing a descriptive name
4. Setting default parameter values
11
M Query Optimization
Like with any coding language, there are different ways in which you can
write out your code, meaning there are often strategies and techniques that
can help you achieve the optimum performance of your query. In this
chapter, you will dive deep into four key tips for optimizing your M
queries:
Technical requirements
To follow the instructions in this chapter, you will need to connect to the
data using the following URL:
https://raw.githubusercontent.com/PacktPublishing/Data-Cleaning-with-
Power-BI/main/Retail%20Store%20Sales%20Data.csv.
let
Scorecard = Table.FromList(
),
Scorecard = Table.FromList(
null,
),
let
WeightedValues = Table.AddColumn(table,
NormalAverage = Table.AddColumn(Weighte
Result = Table.AddColumn(NormalAverage
in
Result
in
CalculateWeightedAverage(#"Changed Type")
This will add two columns to our table of test scores, which include the
average and the weighted average score:
In the next section, you will build upon this knowledge as you learn more
about how to optimize your M code when filtering and reducing data.
When we look at filtering data within our Power Query editor, we could of
course begin manually using the UI to filter and reduce this data in multiple
steps as we saw in Chapter 4, The Most Common Data Cleaning
Operations. This is great; however, it will create many different steps
within the applied window, which can of course make the window seem
cluttered and complex.
To do this, we are going to build upon what you learned in Chapter 4 and
Chapter 7 to create more complex and logical functions within your M
code. In the following steps, you will learn three different styles of filter
logic to apply and optimize your M code.
In order to simulate the data preparation we are going to carry out, you will
be presented with three different problem statements from your colleague.
Your job is to prepare the data for your colleague to carry out their analysis.
Problem statement 1
Colleague: We would like to run an analysis on the data, but the scope of
this analysis should only include sales orders within the year 2023 and that
have received more than a 20% discount. Please clean and prepare the data
so I can analyze just that data.
#"ProblemStatement1" = Table.SelectRows(#"C
in
#"ProblemStatement1"
The previous code will add one step within the applied steps rather than two
individual filter steps. This can make the overall query more efficient.
We will now take a look at the next problem statement from our colleague.
Problem statement 2
#"ProblemStatement1" = Table.SelectRows(#"C
ProductIDFilterList = {"OFF-BI-10001120","OFF-B
#"ProblemStatement2" = Table.SelectRows(#"P
in
#"ProblemStatement2"
In this example, you learned to use lists to create a specific list of values
that we will then use in the filter. This is great, but we can probably build
upon this with the knowledge of parameters that we gained in the previous
chapter. This will help with the scenario should the colleague need to adjust
the given list of products. The steps are as follows:
3. Now that you have created the parameter, you can adjust the following
code in the Advanced editor:
ProductIDFilterList = {"OFF-BI-10001120","OFF-B
#"ProblemStatement2" = Table.SelectRows(#"P
in
#"ProblemStatement2"
ProductIDFilterList = SelectedProducts,
ProductIDFilterListTable = Text.Split(Produ
#"ProblemStatement2" = Table.SelectRows(#"P
in
#"ProblemStatement2"
The output shown next returns a table of data with only the specified
products, selected for the particular year of 2023 and with a discount greater
than 20% applied:
Figure 11.5 – The result of adding the code to the advanced editor, with the
table filtered to specific year, discount, and products
Having completed this, you can look at the final request from your
colleague in the next example.
Problem statement 3
In this example, you are going to build upon the knowledge from the
previous chapter to build a custom function within your M code that will do
this logic for us filtering.
AvgProfit = List.Average(Table.Sele
in
AvgProfit,
#"FilteredTable" = Table.SelectRows(
#"Changed Type",
in
#"FilteredTable"
The previous example would achieve the desired result; however, there is
another way to achieve this, which would generally perform better and thus
be more optimized. Using the previous custom function means that it must
iterate this function on every row within the data. An alternative to this
would be to use the Group By function instead of filtering rows based on
a custom function. The Group By operation can calculate the average
profit amount for each product ID in a single step.
To do this, remove the query you added in step 2 and replace it with the
following snippet:
This optimized version reduces unnecessary iterations over the data and
improves performance by calculating the average profit amount per product
ID using the Group By operation instead of the custom function.
In this section, you have learned and explored how you can go about
optimizing the M code queries you create by filtering the data to a smaller
size, thus achieving better performance when it comes to analyzing the data.
In the next section, you will build upon this with other native functions that
will help you optimize your M code.
Using native M functions
For instance, let’s say you need to standardize product names by converting
them to title case. Instead of writing custom code, you can utilize the
Text.ToTitleCase function, making your query more concise and
performant.
let
Source = ... // Your data source
StandardizedData = Table.TransformColumns(Sou
in
StandardizedData
ProductName: This is the name of the column in your data that you
want to transform
Text.ToTitleCase: This is a built-in function in Power Query that
converts the text in the specified column to title case
In this section, you learned about how you can begin to use the
Table.TransformColumns and Text.ToTitleCase functions to
transform your data. Next, you will learn more about how you can use
functions such as Table.Buffer to improve the efficiency and
performance of your code.
Optimizing memory usage
On the other hand, though, there are some potential drawbacks to be aware
of. One significant downside is that using Table.Buffer can actually
slow down performance in certain scenarios.
One reason for this is that it loads the entire table into memory at once. For
very large datasets, this can consume a significant amount of memory
resources, potentially leading to memory pressure and slower overall
performance, especially if your system doesn’t have enough available
memory to handle the entire dataset efficiently.
The following example shows how you could use this function in practice:
let
Source = ... // Your data source
TransformationSteps = ... // Your Transformat
BufferedTable = Table.Buffer(TransformationSt
BufferedTable
In this code, we use the Table.Buffer function to load the dataset into
memory only once, reducing memory duplication. This optimization
minimizes memory usage, resulting in faster query execution.
This can be done using code such as the following in your M query:
let
Source = ... // Your data source
SplitTable = Table.Split(Source, 4) // Split
in
SplitTable
In this code, we use the Table.Split function to divide a large table into
smaller partitions, enabling parallel processing. Each partition is processed
simultaneously, which can often result in significantly reducing query
execution times. It must be said, though, that using Table.Split in this
code on its own won’t inherently reduce query execution times, but it’s a
step towards enabling parallel processing, which can improve performance
in certain scenarios. The effectiveness of parallel processing depends on
various factors such as the nature of the data, the operations or
transformations being performed, and the capabilities of the underlying
hardware.
Here are a few additional details on what the previous code is doing and
best practices when executing parallel queries:
Now that you have an understanding of these two functions, let’s work on
using them within an example dataset. In this example, we will also
introduce you to the session diagnostic tool within the Power Query editor
and how it can be used to measure the impact of your query and query
optimization. This will provide a high-level introduction to the diagnostic
tool, but if you would like to learn more about the tool and its outputs, then
you can find a link in the Further reading section.
in
#"Filtered Rows"
in
#"Changed Type"
8. Click on Done to return to the Power Query editor. You should see that a
number of steps have been added within the applied steps:
Figure 11.6 – The query settings showing the additional transformation
steps added to this query
9. Select Tools from the ribbon. This will show us the available diagnostic
tools.
10. Select Diagnostic Options. This will open the option settings for the
diagnostic query. By default, this will have all options checked for
Diagnostic Level and Additional Diagnostics. We don’t need all of this
for this simple test, so uncheck all but the Aggregated option:
Figure 11.7 – The configuration for the diagnostic options settings
19. This will return a single row for the query we refreshed with the min
(start time) and the max (end time) of the refresh preview. For ease, let’s
adjust the column types from Date/Time to Time by selecting both
columns and right-clicking. Then, hover over Change type and select
Time.
20. Let’s add a new column to show the total time. To do this, navigate to
Add column in the ribbon and select Custom column.
21. Give the new column a name such as Total Time (S) and enter the
following formula:
22. Click on OK. This will now show the time taken in seconds to refresh
the query. In my case, it took 1.08407 seconds, as shown in the
following screenshot, but this might be different for you. Keep in mind
that this data is pretty small and only has a handful of optimization steps,
so naturally it is pretty quick already. However, if you were working
with larger amounts of data, this could take minutes or even hours.
Figure 11.9 – The results from the diagnostic on the query named
StandardQuery
let
in
#"Added Custom"
26. Now, let’s attempt to optimize this query using Table.Split. Select
the query you created called OptimizedQuery.
27. In this example, we are going to need a function so we can perform the
transformations to our split tables before we recombine them. To do this,
repeat step 23 to create a new blank query called
SplitTableTransformations. This function will look to carry
out the same transformations we did on StandardQuery.
28. Open the Advanced editor and paste the following code:
let
in
#"Filtered Rows"
ApplyTransformations = List.Transform(splitt
CombineTables = Table.Combine(ApplyTransform
buffertables = Table.Buffer(CombineTables),
in
buffertables
in
Notice that you added Table.Buffer at the end of the previous query. If
you were to put it earlier in the query (for example, before the
Table.Split function), you could expect to have seen very different
results. The following screenshot shows the result if you were to implement
this function earlier when working with this smaller data. As you can see, it
resulted in the query running 12% slower than the original query.
Figure 11.11 – The results from the diagnostic on the query named
OptimizedQuery that uses Table.Buffer earlier in the query before the
variable named splittable
Now, let’s revisit this example and try to optimize this query by using just
the Table.Split function.
38. Select the query named OptimizedQuery (NoTB) and open the
Advanced editor.
39. This time, we will not use the Table.Buffer function to load our
table into memory. Paste the following into the query with a comma after
the previous step:
ApplyTransformations = List.Transform(splitt
CombineTables = Table.Combine(ApplyTransform
in
CombineTables
40. Click on Done. This will not return the same results as before.
41. Repeat steps 33-37 with the query named OptimizedQuery (NoTB)
selected. Rename the new query created from the diagnostic to
OptimizedQuery (NoTB) Diagnostic.
After applying your diagnostic transformation function, you should see
the results shown next. Here, we can see that by splitting the tables and
applying our function to the partitions, we have been able to reduce the
refresh to 0.6569053 seconds. Now, of course, this is looking at a small
dataset, but that has allowed us to refresh nearly 40% faster than our
standard query.
Figure 11.12 – The results from the diagnostic on the query named
OptimizedQuery (NoTB) which only uses Table.Split
On the flip side to this, we saw that by using the Table.Buffer function
to load that table into memory actually caused a slower refresh in this
instance, highlighting that you should always be careful when choosing
when to use Table.Buffer.
Summary
In the next chapter, we will dive into the critical topic of managing data
relationships in Power BI and using them to clean data effectively.
Questions
1. What were the four key tips for optimizing M queries discussed this
chapter?
1. Sorting, filtering, grouping, summarizing
2. Filtering and reducing data, using native M functions, creating custom
functions, optimizing memory usage
3. Conditional formatting, joins, aggregations, pivoting
4. Concatenation, union, splitting, transforming
2. In the custom function for calculating the weighted average, what are the
parameters of the function, and how is the weighted average calculated?
1. Parameters: table, weights, values; the weighted average is calculated
by summing the weighted values and dividing by the total weight
2. Parameters: table, columns; the weighted average is calculated by
multiplying values by weights
3. Parameters: values, weights, total; the weighted average is calculated
by summing the values and dividing by the weights
4. Parameters: rows, weights, values; the weighted average is calculated
by summing the values and multiplying by the weights
3. How can Table.Buffer optimize the performance of your query?
1. It reduces the number of columns in a table
2. It increases memory duplication for faster processing
3. It loads a table into memory once, reducing memory duplication in
subsequent steps
4. It removes unnecessary rows from a table
4. What does the Table.Split function do in the context of parallel
query execution?
1. Combines multiple tables into one
2. Splits a table into smaller partitions for parallel processing
3. Removes duplicate rows from a table
4. Filters rows based on a specified condition
Further reading
Table.Buffer: https://learn.microsoft.com/en-us/powerquery-
m/table-buffer
Table.Split: https://learn.microsoft.com/en-us/powerquery-m/table-
split
Diagnostic Tool Power Query:
https://learn.microsoft.com/en-us/power-query/query-diagnostics
https://learn.microsoft.com/en-us/power-query/record-query-
diagnostics
https://learn.microsoft.com/en-us/power-query/read-query-diagnostics
12
After learning about how to write efficient and performant code, we must
now take the next step in preparing and cleaning data for our analytics. This
chapter is dedicated to comprehensively understanding and managing data
relationships within Power BI to ensure clean and reliable data for effective
decision-making. It addresses the often-overlooked issue of dirty data
arising from poorly designed data models.
For those confused about why data modeling is part of this book, it’s crucial
to understand that without proper structuring and optimization of data,
analysis and visualization in Power BI would be inefficient and unreliable.
The desired Power BI semantic model often necessitates transformations to
ensure data accuracy, consistency, and performance, thus emphasizing the
indispensable role of data modeling in the process of data cleaning.
By the end of this chapter, you will have gained high-level insights into data
modeling and relationship management, all while learning practical skills
that guarantee the highest standards of data cleanliness within Power BI
projects.
In Chapter 5, Importing Data into Power BI, you were introduced to some
key topics when it comes to importing data, such as dimension modeling,
star schemas, and normalization/denormalization. This chapter will build
upon this with further knowledge of the basics, introducing concepts such
as bidirectional cross-filtering, cardinality, and other best practices.
In Power BI, we have two main methods of bringing data into our model:
importing data and using DirectQuery. Let’s have a brief overview of each
to help get you up to speed on the difference between them.
Importing
Importing refers to the method of loading and storing data within a Power
BI file itself. When you import data, Power BI brings a copy of the data into
its internal storage engine, the VertiPaq engine. This engine compresses and
optimizes the data, enabling fast query performance for analytical
workloads. Importing is suitable for scenarios where data volumes are
manageable within the Power BI file, and users can benefit from the
performance gains provided by the in-memory storage and compression.
Naturally, this option comes with its pros and cons, which I’ve summarized
for you here.
Example scenario
DirectQuery
This method of connecting to data comes with its benefits and challenges
also, which I’ve summarized next.
Example scenario
It’s important to note that you are able to switch between these modes but
only in one direction. So you would be able to switch a DirectQuery table to
an imported table but not the other way around.
Dimensional modeling
Consider the example that was used in Chapter 5, Importing Data into
Power BI, for a sales data scenario. You might have a mixture of the
following tables to form a model:
Now, it can be very common for people who are new to tools such as Power
BI to simply think about creating one large table with no relationships,
which, again, as highlighted in previous chapters, is fine when using a small
set of data but is not best practice or scalable. To help you understand this
further, let’s go through a summary of the pros and cons of dimension
modeling within Power BI.
Snowflake schema
Pros:
Cons:
Intermediate tables
Consider a scenario where you have a star schema with a central fact table
called Sales and two dimension tables, Products and Categories.
Now, if each product can belong to multiple categories, and each category
can include multiple products, you have a many-to-many relationship
between Products and Categories. To resolve this, you might
introduce an intermediate table, say ProductCategories, that connects
the Products and Categories tables.
Adding calendars and date tables has been made a common practice for
several reasons, which are highlighted as follows:
Role-playing dimensions
Don’t be alarmed – you’ve not stepped into a game. In Power BI, there are
such things as role-playing dimensions. However, it’s important to note that
this is not just a Power BI thing; it’s a universal data modeling technique
that’s used in the Kimball methodology we introduced in Chapter 5,
Importing Data into Power BI. Of course, you could dive quite deeply into
this topic, so we will provide a brief overview to help you establish a basic
understanding.
In conclusion, the choice between how you approach this depends on the
specific use case. Generally speaking, using the “one date dimension and
establishing multiple relationships with the fact table” approach is
recommended for its advantages in efficiency and cleanliness, although
careful evaluation is necessary based on the model’s size and requirements.
Aggregating tables
Pros:
Cons:
Storage overhead: Aggregated tables consume additional storage space,
as you are storing both the detailed and aggregated versions of the data
Maintenance: You need to manage the update process for aggregated
tables whenever the underlying data changes, ensuring that the
aggregated values remain accurate
Data quality assurance: Make sure your data is clean and free from
errors before creating the aggregated table. These errors could lead to
inaccurate aggregated results.
Identify aggregation level: Understand the reporting requirements and
identify the appropriate aggregation levels. Then decide which
dimensions and measures are essential for summarization and which can
be removed to reduce the size/storage.
Handling missing data: Decide on a strategy for handling missing or
null values in your data, whether that is to remove these values entirely
or return them as zero, for example. Aggregations can be affected if
there are gaps in your data, so address missing values appropriately.
In this chapter, we will not be walking through how you go about creating
aggregations, but you can refine your practical skills in this area by
following the recommended material highlighted at the end of this chapter.
Incremental refreshes
Consider a scenario where you have a dataset that grows daily, but your
reports only need the most recent data. With an incremental refresh, you can
configure Power BI to refresh only the data that is new or has been updated
since the last refresh, minimizing the refresh time.
Pros:
Cons:
TOP TIP
In the next section, you will learn more about what exactly BDCF is.
This example represents financial transactions in accounts that can have one
or more customers associated. This can be complex and require a snowflake
because the relationship between sales transactions and customers would
likely result in a many-to-many relationship. One customer could have
many transactions. In reverse, a transaction could also have multiple
customers attached. Currently, the model works by allowing us to see the
relationships between the tables; however, let’s try to do some analysis on
this.
6. When you navigate to the Model View to inspect the relationships within
the model, you’ll see arrows denoting how tables are filtered. Notably,
the arrow indicating the relationship between AccountCustomer and
Account is pointing in the incorrect direction. By default, the model
filters from the “1” side to the “Many” side of the relationship, which
doesn’t align with our objective of accessing transactions associated
specifically with customer accounts. Now, there are two ways we can
look to fix this, one of which is to use DAX measures and the other is to
use BDCF. Let’s explore how we would use this here. Select Manage
relationships from the tabs. This will open the window to manage
relationship. All existing relationships can be seen and adjusted here.
Select the relationship from Table AccountCustomer to Table
Account.
7. Select Edit….
8. Adjust the Cross filter direction dropdown from Single to Both, as
shown:
Figure 12.7 – Screenshot of the Edit relationship window within Power BI
9. Select OK to complete the relationship edit and select Close to exit the
relationship manager.
10. Now navigate back to the Report View.
Here, you should see that the table has now been adjusted and we can see
the values assigned to each of the customers we have, as shown:
We have been able to achieve the results because, with the help of BDCF,
the model was able to push the filters to the Account table from the
Customer table through CustomerAccount.
These best practices empower you to strike the right balance, ensuring that
your data model not only meets your analytical needs but also maintains
optimal performance and data integrity over time. As you navigate the
complexities of bidirectional cross-filtering, adherence to these guidelines
will pave the way for a robust and reliable Power BI reporting environment.
Now you have formed a foundational knowledge of what BDCF actually
means when setting up relationships, in the next section, you will begin to
learn about selecting the right cardinality for your data analysis.
In this section, we’ll explore what cardinality is, why it’s essential, and how
to make the right choices to ensure your data model is both accurate and
high-performing. Let’s start!
Understanding cardinality
Cardinality is all about defining the nature of the relationship between two
tables. It answers questions such as “For each row in Table A, how many
corresponding rows exist in Table B?” This understanding guides Power BI
in navigating relationships when aggregating data, ensuring that the
relationships are appropriately established to reflect the real-world
connections in your dataset. There are four main types of cardinality:
Selecting the correct cardinality largely depends on the nature of your data
and the relationships you’re trying to represent. Here are some guidelines to
help you choose the right cardinality:
Understand your data: Before defining cardinality, thoroughly
understand your data. Know the nature of the relationships between
tables and how data is related. This understanding is essential for making
informed choices.
Start with 1:N: In most cases, one-to-many relationships are the safest
choice. They cover scenarios where each item in one table can be related
to multiple items in another table. If you’re unsure, begin with 1:N and
adjust as needed.
Use 1:1 sparingly: One-to-one relationships should be employed only
when it’s crucial to restrict data. They are less common and should be
used sparingly due to their potential to complicate the data model.
Be prepared to adjust: The cardinality setting is not set in stone. As
you work with your data and build reports, be ready to revisit and adjust
the cardinality if necessary. Data models can evolve, and your initial
assumptions may change.
For example, suppose you have a Sales fact table containing
information about individual sales transactions, and a Customer
dimension table providing details about each customer.
Initially, you might establish a relationship between the Sales table
and the Customer table based on the CustomerID column in both
tables. This creates a one-to-many relationship, as each customer can
have multiple sales transactions. This setup is appropriate when you
want to analyze sales data at the customer level and you have a direct
relationship between each sale and a specific customer.
However, let’s say you also want to analyze sales data at a regional
level, and your Customer dimension table includes a Region
column. In this case, you might want to aggregate the sales data at the
regional level, treating each region as a single entity.
To achieve this, you could switch the cardinality of the relationship
between the Sales table and the Customer table from one-to-
many to many-to-one (or many-to-many). By doing so, you’re
indicating that each sale is associated with a specific region, and you
want to aggregate the sales data based on the regions in the
Customer dimension table.
Switching cardinality in this way allows you to change the granularity
of your analysis. You can now analyze sales data at both the customer
and regional levels, leveraging the flexibility of your data model to
gain insights at different levels of detail.
Testing and validation: Before finalizing your cardinality choices,
thoroughly test your data model. Ensure that your queries provide
accurate results and that the reports are performing well. Make
adjustments based on your testing.
In the following section, you will delve deeper into techniques you can take
on while working on complex datasets within Power BI.
Power BI is an excellent tool for data visualization and analysis, but what
happens when you’re dealing with big data?
In this section, we’ll explore best practices for handling big data within
Power BI, ensuring that you can still unlock valuable insights without
compromising performance.
When dealing with big data in Power BI, several challenges emerge:
To help you work with big data in Power BI, I have provided a breakdown
of tips focused on the key areas:
Big data presents unique challenges when working with Power BI, but with
the right strategies and best practices, you can harness its potential. By
optimizing data import strategies, carefully designing your data model,
focusing on query optimization, and staying vigilant with performance
monitoring and testing, you can create Power BI reports that provide
valuable insights even when dealing with large and complex datasets.
Handling big data within Power BI requires a combination of thoughtful
design and continuous optimization to strike the balance between
performance and data richness.
Again, the resources in Further reading and examples of data tuning and
load testing can be found at the end of the chapter.
In this section, we’ll delve into best practices for identifying and avoiding
circular references in your Power BI models, ensuring data clarity and
smooth report generation.
Understanding circular references
Clothing Products =
FILTER(
VALUES('Product'),
'Product'[Category]="Clothing"
)
4. Press Enter – this will create a table that has all your clothing products
listed.
5. To create the relationship between the tables, try selecting
ProductKey from the Clothing Products table and dragging it onto
the ProductKey field in the Product table. This will cause the following
error message to appear:
Clothing Products =
FILTER(
ALLNOBLANKROW('Product'),
'Product'[Category]="Clothing"
Identify and resolve circular references early by using the following best
practices:
While Power BI offers a robust set of features for data analysis, the
responsibility of creating well-structured, non-circular models lies with the
data modeler. Avoiding circular references is an essential step in achieving
accurate and trustworthy insights from your data.
Summary
In this chapter, we delved into the intricate world of data modeling and
managing relationships within Power BI. It provided a brief overview of
and introduction to the pivotal role well-structured data modeling plays in
ensuring clean and reliable data for informed decision-making.
We started by exploring/recapping the basics of dimension modeling in
Power BI, bidirectional cross-filtering, understanding its power, identifying
potential errors and bottlenecks, and adopting best practices to use it
effectively. We also comprehensively covered the concept of cardinality
with an emphasis on its impact on data cleanliness and performance.
Later, we learned how to make the right choices to create accurate and high-
performing data models, with insights into challenges and best practices for
optimizing performance and managing vast data. Lastly, we explored the
complexities of avoiding circular references, gaining strategies and best
practices to ensure data clarity.
In the next chapter, you will learn about essential strategies for maintaining
data quality and consistency throughout the transformation process. It will
cover topics such as evaluating whether Power BI is the right platform,
creating a transformation plan, maintaining consistent naming conventions,
and documenting changes.
Questions
Further reading
In this last part of the book, you will explore paginated reports, automation
through Power Automate, and the transformative potential of OpenAI. You
will learn how to prepare data for paginated reporting in Power BI and
leverage Power Automate as a powerful ally in data cleaning tasks. Later,
you will discover how OpenAI technologies, such as ChatGPT, are
revolutionizing data work, providing insights into simplifying tasks and
staying ahead in the dynamic landscape of data management. This part
serves as a bridge between traditional data practices and cutting-edge
technologies, offering a holistic understanding of the evolving landscape of
data management.
Across this book, we have learned about a number of best practices on how
to leverage Power BI to clean and prepare your data for analysis. In this
chapter, we will delve into how we can take that knowledge in using Power
BI Report Builder to connect, prepare, and report on data with paginated
reports. Power BI Report Builder is a powerful authoring tool within the
Microsoft Power BI suite designed for creating paginated reports. Unlike
the traditional Power BI reports that are interactive and dynamic, paginated
reports are highly formatted, pixel-perfect, and optimized for printing or
generating PDFs. This part of the platform allows users to design and
generate paginated reports with precision.
This introductory chapter will provide insight into how you should clean,
prepare, and connect to your data if you wish to use that data for paginated
reporting in the future.
While the learnings in the book will take you far, having the complete
knowledge, including paginated reports, will leave you more experienced
than many in the industry. Although based on slightly more legacy
technology, paginated reports are still widely used across many industries
and are now very much integrated into the Fabric platform.
Technical requirements
To follow along in this chapter, please ensure that you download and install
Power BI Report Builder. You can get more details from
https://www.microsoft.com/en-us/download/details.aspx?id=58158.
Ensure that you also download the following file to be used in the report:
https://github.com/PacktPublishing/Data-Cleaning-with-Power-
BI/blob/main/AdventureWorksLogo.jpg.
When you first open Power BI Report Builder, you’ll immediately notice
that, although there are some similarities to Power BI, it is for the most part
vastly different. It is often termed as having a more wizard-like feel to the
creation of the reports. The following screenshot will help you get your
bearings around the Power BI Report Builder UI as well as provide a
summary of what each section does:
Figure 13.1 – The Power BI Report Builder UI
IMPORTANT TIP
If you are not seeing the boxes as shown, simply navigate to the View
ribbon and ensure all the boxes are checked.
Let’s review the components of Report Builder:
In the next section, we will dive deeper into how you actually go about
connecting to data from within Power BI Report Builder.
Power BI Report Builder offers a variety of data source options to meet the
diverse needs of its users – whether that is connecting directly to the data
source or connecting to a data source that has already been curated and
published to the Power BI service. Here is an example screenshot of the
connection types available when creating a data source with Power BI
Report Builder:
Figure 13.2 – Data Source Properties showing connection types available
Generally, there are not as many data connectors within Report Builder; as
such, it is recommended to publish your data model to the Power BI service
using Power BI Desktop if the data source is not included in the list of
supported connectors.
While Report Builder can serve as a quick solution for direct data retrieval
into paginated reports, investing in a comprehensive data model is crucial
for maximizing analytical potential and ensuring data integrity in the long
run.
Data sources:
A data source in Power BI Report Builder represents the connection
to the external data repository or system from which the report
retrieves its information
It defines the location and type of the data, such as a SQL Server
database, an Excel workbook, or a Power BI dataset
A data source contains the necessary information to establish a
connection, such as server details, database names, and authentication
credentials
Power BI Report Builder allows users to create multiple data sources
within a report, enabling the integration of data from diverse locations
Datasets:
A dataset is a specific subset of data that is retrieved from a data
source for use in a report
It is essentially the result of a query or set of queries applied to the
data source to extract the relevant information needed for the report
Datasets can be seen as virtual tables that store the selected data in a
format that is optimized for reporting
Users define queries and filters within a dataset to shape the data
according to their reporting requirements
Multiple datasets can be created within a report, each pulling data
from a different data source or applying unique transformations to the
same data source
Now that we have connected to our data source, we are ready to begin
preparing and creating datasets ready for our analysis. To do so, create a
new dataset by right-clicking in the Report Data window and selecting
Add Dataset.
This will open a new window labeled Dataset Properties (Figure 13.7).
Here, we can select a table or write a query to define what data we want to
include in the dataset:
Figure 13.7 – The Dataset Properties window in Power BI Report Builder
Before going ahead, let’s review what each property tab controls within the
dataset.
Query
Data source: Specifies the details of the data source, including server
name, database name, and authentication method.
Query: Allows users to input or edit the query that fetches data from the
data source.
Query Designer…: Helps with creating a dataset without manually
typing out a DAX or MDX query. This works by selecting the
dimensions/KPIs/measures for the list of available fields and dragging
this into the table to add to the query. As you’re getting started, I would
highly recommend using this and then reverting to see and understand
the query it creates for you.
Time out: Specifies the maximum time allowed for the query to execute
before timing out.
Fields
The Fields section is responsible for managing and organizing the fields or
columns retrieved from the data source. When you create a dataset using
Query Designer, you’ll find that the fields here automatically populate with
the ones you selected in the designer. The names might not always be easy
to understand, as characters such as spaces or symbols would have been
replaced by underscores.
Field Name: Displays a list of all field names assigned to the fields
you’ve selected from the data source.
Field Source: Specifies where the field is being pulled from – that is, the
table name and field name. If it’s a measure, it will appear without a
table name.
Add Fields: You can add or delete query fields here as well as add your
own calculated fields for any measures that are not available in your
dataset already.
IMPORTANT TIP
Make sure to check the field names here before getting started. Any changes
you make to the field names later will not be pushed through to the
locations in the report where the field names are used, which can be very
time-consuming to then replace, as you will need to go and manually
change them in every place they are used.
Options
The Options section allows users to configure various settings related to the
dataset’s behavior and performance. It includes settings for Collation, and
other aspects of dataset behavior, such as case sensitivity. It’s important to
note this might show different options depending on the type of data you’re
connecting to.
Filters
The Filters section enables users to apply filters to the dataset, limiting the
data retrieved based on specified conditions. The following screenshot
shows an example of a filter created for an example dataset. As you can see,
the filter is looking to filter the [Segment] column to equal SALES.
Expressions can be used here in order to create more complex filtering
options.
Figure 13.8 – Screenshot of the Filters view from within Dataset Properties
in Power BI Report Builder
Filters help in refining the dataset to include only the relevant information
for the report. It’s important to point out that filters are applied locally after
data has been retrieved from the source. In other words, the filter is not
pushed down to the data source, the way query folding would do in Power
Query. This can have some important performance implications if users are
not aware. For this reason, it is often better to add parameters/filters directly
to the query rather than use the Filters feature from within Dataset
Properties.
Parameters
3. Next, we have the option to either manually create our own query
expression or use Query Designer to build the query. The following
figure details where this is visible in Dataset Properties:
Figure 13.10 – The Query section within Dataset Properties in Power BI
Report Builder
This is where you can add your own query or use Query Designer to help
with this.
5. Now, we can begin to build our query by visually dragging and dropping
the columns or measures we need from this dataset.
6. For this example, drag in the following:
1. AdventureWorks_Sales_2017-2019:
1. Model Name
2. OrderNumber
3. OrderDate
2. Measures:
1. Total Order Price
2. Total Order Cost
7. Add a filter to the query at the top of the window. Here, you will specify
what filter you would like. In this example, let’s add a filter on
OrderDate as we would like to see all orders for a particular date. Add
the relevant values for Dimension, Hierarchy, Operator, and Filter
Expression, so it mirrors the following configuration. Ensure that you
tick the Parameter box so this automatically creates a parameter to be
used in the report:
Figure 13.12 – The filter being applied within Query Designer in Power BI
Report Builder
Now that we have saved our selection, you will see that Query Designer has
actually generated the following query expression for our desired dataset.
This simple query specifies where to find our selected dimensions and
measures within our selected data source:
EVALUATE SUMMARIZECOLUMNS('AdventureWorks_Sales_2
NOTE
This can often provide a great foundation that can be adapted should you
need to add more complex logic to your query. One example would be to
add an if logic so that the dataset returns one measure by default and
another if a particular parameter is selected.
In the next section, you will learn more about how you can go about adding
filters and parameters when connecting to data from Power BI Report
Builder.
Date range filters: Filtering data based on a specific date range allows
users to view information within a selected timeframe, supporting trend
analysis or comparison over periods
Category filters: Filters based on categories or other criteria enable
users to drill down into specific segments of the data, providing a
detailed view of particular subsets
Let’s build upon this by looking at how we can add further parameters and
filter logic to our query. To refine which data we bring into our report,
perform the following steps:
IMPORTANT TIP
If you are looking to create a parameter from a field in your data, you
should look to create a dataset that only pulls in that field or column. Then,
you can select Get values from a query. This then pulls available values
for your parameter directly from your data. Be careful not to select a
column with thousands of values as this could affect your performance.
6. Click on Add.
7. Adjust the label and value to Revenue.
8. Repeat this process to add another field with the label and value set to
Cost.
9. Select the Default values tab. Here, you can specify which value you
would like to be set by default when the report is run.
10. Adjust the Default value configuration to Specify values. Then, select
Add and select Revenue from the dropdown.
11. Click on OK to close the Report Parameter properties window. You’ll
now see your parameter appear in the Parameters folder.
12. Right-click on the dataset you created called
DS_AdventureWorksTestDataset and select Dataset
Properties.
IMPORTANT TIP
Be advised that if you begin adjusting the query that was created by Query
Designer, you won’t be able to access Query Designer after.
13. Reformat the query for simpler reading, as shown next. Be sure to add
=" at the beginning as Query Designer doesn’t add this originally:
="
EVALUATE SUMMARIZECOLUMNS(
'AdventureWorks_Sales_2017-2019'[Model Name],
'AdventureWorks_Sales_2017-2019'[OrderNumber],
'AdventureWorks_Sales_2017-2019'[OrderDate],
RSCustomDaxFilter(@AdventureWorksSales20172019O
14. Now, remove the last three rows, which include our highlighted measure:
)
15. Replace this with the following code:
""Sales/Cost"",
)"
16. Review the screen as shown to confirm you have entered the same code
as shown in the previous step and select Validate Query to have Report
Builder check your query. Notice that Query Designer… has now been
grayed out:
You have just created a new parameterized field in your dataset that can be
used in our report to dynamically switch the data we pull into our query.
Row groups and column groups in paginated reports are essential features
that provide a structured way to organize and present data. They play a
crucial role in preparing and analyzing data for reporting.
Let’s explore how row groups and column groups help users in this process.
It’s important to understand why you should use row groups and column
groups when creating reports in Power BI Report Builder. Let’s dive deeper
into more information on why they are important and the use cases for
them:
Row groups:
Importance:
Hierarchical structure: Row groups allow users to create a
hierarchical structure in the report based on the values in a specific
column. This is particularly useful for representing data in a nested
or grouped manner.
Logical organization: By grouping rows based on certain criteria
(for example, category and date range), users can logically
organize data, making it easier to understand and analyze.
Use cases:
Product category grouping: For a sales report, users can create
row groups based on product categories. This results in a report
organized by category, providing a clear breakdown of sales for
each product category.
Column groups:
Importance:
Cross-tabular presentation: Column groups allow users to
present data in a cross-tabular format. This is valuable when users
need to view data across multiple dimensions simultaneously.
Comparative analysis: Users can use column groups to compare
data across different categories, periods, or other dimensions,
facilitating comparative analysis.
Use cases:
Quarterly revenue comparison: In a financial report, column
groups can be created to represent revenue figures for each quarter
horizontally. This enables users to compare revenue performance
across different quarters.
Let’s put this knowledge to the test and begin creating a report that utilizes
row and column groups:
Figure 13.16 – The Images folder within the report data window of Power
BI Report Builder
3. Inserting a table: Now that we have added the title and company logo,
it’s time to build the report. We are going to create a simple table that
provides a summary of our orders using Row Groups/Column Groups
to help organize the report. To do this, follow these steps:
1. Select the Insert tab on the toolbar.
2. Select Table, and from the dropdown that appears, select Table
Wizard.
3. Choose Dataset, select our dataset named
DS_AdventureWorksTestDataset, and then select Next.
4. This will take you to the Arrange fields window, as shown. This is
really useful as you can add row groups right from the start. Add the
following items by dragging them from the Available fields section
to the respective sections to mirror the configuration in the following
figure. Make sure to set the aggregation for SalesCost to Sum.
Then, click on Next.
Figure 13.18 – Screenshot of the Arrange fields window within Table
Wizard
6. This will bring up the preview of the report item before it is then added
to the report.
7. Select Finish after viewing the preview, which will insert the table into
the report.
8. Now, move the table you created into a better space on the report (as it
may have been inserted on top of the title) by selecting the table and
dragging it into position. Alternatively, you can do this by selecting the
table. This should open the Tablix properties on the right-hand side.
Scroll to the bottom and you should see the Position property folder
shown. Here, you can specify the coordinates of where you would like
the top left of the table to be located.
Figure 13.20 – The Properties window zoomed in on the Position properties
IMPORTANT NOTE
4. Cosmetic changes: Now we have the table within the report, we need to
apply some changes to ensure the report will show correctly when we
run it. To do this, follow these steps:
1. Select the cell containing the column header for Model_Name. This
will be labeled something like Textbox1. Selecting this cell will
show the Textbox properties in the right-hand view.
IMPORTANT TIP
Selecting the middle of the textbox within a tablix will cause you to
select/edit the content of the textbox and thus will show the Selected text
properties on the right-hand side.
To ensure you select the textbox itself and thus avoid showing the textbox
properties, either select the edge of the textbox or select it by right-clicking
instead of left-clicking.
2. You will now adjust the following properties to ensure this is showing
nicely:
� Position | Size: Adjust the size of the textbox to 70mm, 6mm, as shown:
5. Adjust Padding from 2pt to 0pt, as shown. This will remove the
padding applied around each cell’s edges, which might obstruct the view.
=IIF(Parameters!P_KPI_Selector.Value="Rev
8. Add currency formatting to ensure that price and cost are output in the
correct format. To do this, we will start by selecting the cells that contain
values for Price and Costs.
9. Then, in the properties window to the right, select Number and then
select Format. Enter the following format in this property box. This will
ensure that values show with a currency of £ and that negative values are
shown in brackets:
£#,#;(£#,#)
10. Adjust the visibility of Row Group. If you were to run your report now,
it’d be great; however, you would have to adjust the grouping toggles as
this is currently set to hide for your grouping called Model_Name. To
adjust this, select the row group called Model_Name.
11. Right-click and select Group Properties. This will open the Group
Properties window similar to the Group by function in Power Query,
which shows you the details on the grouping of the data.
12. To adjust visibility, select the Visibility tab on the left-hand side. Now,
adjust the toggle by selecting Show. This will result in the Model
Name column being visible when you run the report.
13. Select OK to close the Group Properties window.
5. Save your report, navigate to the Home ribbon, and select Run. This
will run your report with the parameter already selected of OrderDate
01/01/2019. The following example shows the output from the report.
Expand the dates by selecting the plus toggle to the left of the date field.
Figure 13.24 – The output after running the report
Overall, by leveraging row groups, we have been able to aggregate our data
into the report. These row/column groups are powerful tools in paginated
reports that help users organize, analyze, and present data in a structured
and meaningful way. They contribute to the clarity of information, support
aggregated analysis, and enhance the overall effectiveness of paginated
reports for data preparation and analysis in a similar way that the Group By
function works in Power Query.
Summary
Questions
In this chapter, we’ll delve into what Power Automate is and explore how it
can be strategically employed to streamline and automate the cleaning and
preparation of data, ensuring that your Power BI reports are not just visually
compelling but also built on a foundation of consistently clean and updated
data.
By the end of this chapter, you will have an understanding of how Power
Automate can be used to clean and prepare data and how it can automate
the process, and also gain hands-on experience creating your own flows to
prepare data.
Technical requirements
In this chapter, you will require access to Power Automate either via cloud
or desktop. This will then also need to be linked to your Power BI
subscription, which we will highlight in walkthrough examples.
For instance, you can set up a trigger that initiates a data cleaning process
whenever new data is ingested into your data source. This ensures that your
data is consistently cleaned and prepared whenever there are updates,
saving you time and ensuring the reliability of your reports.
Having understood how triggers can be used and leveraged to kickstart the
action flows within Power Automate, you’re now ready to move on to the
next section, where we will learn more about what you can do once a flow
is triggered.
Automating notifications
Figure 14.2 – Screenshot of the Start from blank selection within Power
Automate
Figure 14.3 – Screenshot of the Power Query trigger from within the Power
Query UI
4. You will also have the option to configure how often this trigger polls to
check whether a refresh has occurred. Continue with the trigger set to 3-
minute intervals and specify the time zone as shown:
Figure 14.4 – Screenshot of the Power Query trigger from within the Power
Query UI where you can specify the recurrence intervals
Figure 14.5 – Screenshot of the trigger created in the previous step with the
option to add an action
2. Search for and select the Notifications connector. Choose an action that
suits your needs, for example, Notifications – Send me an email
notification.
5. Configure the notification:
1. Configure the notification action with the necessary details. This may
include the subject, the message body, and other relevant information:
1. Within the same flow you created in the previous step, select Edit in the
toolbar:
2. Now that we have created the trigger in the previous example (where the
trigger was a dataflow refresh completing), we can select Add an action
and use the Refresh a dataflow action to determine which dataflow
should be refreshed next in our sequence. In this example, let’s select the
named mtcars dataflow you created in Chapter 9, Advanced Data
Cleaning Techniques:
Figure 14.8 – Screenshot of the Refresh a dataflow window in Power
Automate
5. Select this trigger and then select the workspace, workspace name, and
the mtcars dataflow.
6. Now that our trigger is keeping watch on when the refresh completes, we
can set up the next action to refresh the semantic model we would like to
sequence a refresh of within our selected workspace. Select This option
is given as "New Step" at top of this page - check screen text option and
make casing consistent and then select the Refresh a dataset action as
shown (you can also save and switch back to the new designer if you
wish):
Figure 14.10 – Screenshot of the Refresh a dataset action used in Power
Automate flow
Having now learned how you can create automated and sequenced refreshes
using Power Automate flows, this will prove very useful as you look to
automate some of the actions within your service. In the next section, you
will explore techniques for creating snapshots or temporary tables within
Power BI.
Complex calculations:
Scenario: You need to perform intricate calculations that involve
multiple steps, conditions, or aggregations.
Use of temporary table: Create a temporary table to break down the
complex calculation into manageable stages. Store intermediate
results in the temporary table before deriving the final output.
Data segmentation:
Scenario: You want to segment your data based on specific criteria
for focused analysis.
Use of temporary table: Generate temporary tables to filter and
segment data subsets based on predefined conditions. Each temporary
table represents a distinct segment, allowing for more targeted
analysis.
Performance optimization:
Scenario: Your dataset is extensive, and complex queries are slowing
down report generation.
Use of temporary table: Create summary or aggregated temporary
tables to pre-calculate and store key metrics. Use these precomputed
values in your reports to significantly improve query performance.
Historical analysis:
Scenario: You need to analyze data changes over time or compare
historical trends.
Use of temporary table: Generate temporary tables to capture
historical snapshots of your data at specific points in time. This
enables you to perform historical analyses without altering your main
dataset.
Scenario analysis:
Scenario: You want to explore various hypothetical scenarios without
affecting your original data.
Use of temporary table: Create temporary tables to duplicate your
dataset and apply changes or adjustments for scenario analysis. This
allows you to assess the potential impact of different scenarios
without modifying the original data.
Data enrichment:
Scenario: You have additional data sources that can enrich your
existing dataset.
Use of temporary table: Merge or append temporary tables with
supplemental data to enhance the richness of your analysis. This
approach keeps your original dataset intact while providing a more
comprehensive view.
Intermediate results storage:
Scenario: Your analysis involves multiple iterative steps and you
want to store intermediate results.
Use of temporary table: Create temporary tables to store
intermediate results during each step of your analysis. This aids in
troubleshooting and validation and facilitates the ability to pick up the
analysis from specific points.
Security and sensitivity:
Scenario: You are working with sensitive data and want to restrict
access to specific subsets.
Use of temporary table: Generate temporary tables to filter and store
subsets of data with restricted access. This ensures that only
authorized users can access sensitive information within the
temporary table.
Custom aggregations:
Scenario: You need to perform custom aggregations or derive
specialized metrics.
Use of temporary table: Create temporary tables to store the results
of custom aggregations or calculated metrics. This simplifies the
integration of these specialized metrics into your reports.
Data transformation iterations:
Scenario: Your data requires multiple transformations and you want
to iterate through various scenarios.
Use of temporary table: Use temporary tables to store data at
different stages of transformation. This facilitates iterative
development and allows you to refine the transformation steps
without affecting the original dataset.
In this example, we will walk through a scenario where you might need to
leverage the Power Platform to create snapshots of your data. This is
particularly useful if your data can change but you have no method of
tracking the variances.
In the first step, you are going to set up the trigger for your flow as a
Recurrence action. This will typically be set up to run at the end or
beginning of the day to get a snapshot of yesterday’s data. However, this
can change depending on how often your data is updated.
In this example, we have set up the trigger to reoccur once every day at
00:30. This can be seen and confirmed in the following preview:
Figure 14.11 – Screenshot of the Recurrence function within Power
Automate
DEFINE
VAR C tY YEAR(TODAY())
VAR CurrentYear = YEAR(TODAY())
VAR DateFilterTable =
FILTER(
All('FiscalDate'[Date]),
YEAR('FiscalDate'[Date])= CurrentYear
)
VAR SnapshotTable=
SUMMARIZECOLUMNS(
'FiscalDate'[Date], 'Store'[StoreNumber],Sales[Ca
DateFilterTable,
"Sales Order Value", 'Sales'[Total Sales],
"Average Order Value", 'Sales'[Avg $ / Unit]
)
EVALUATE
SnapshotTable
This code is broken down into three parts, which have been summarized for
your reference:
The DEFINE section: This is the beginning of the DAX query and
defines variables that can be used later in the query.
The CurrentYear variable is shown here:
FILTER(
ALL('FiscalDate'[Date]),
YEAR('FiscalDate'[Date]) = CurrentYear
VAR SnapshotTable =
SUMMARIZECOLUMNS(
'FiscalDate'[Date],
'Store'[StoreNumber],
Sales[Category],
Sales[SubCategory],
Sales[OrderNumber],
District[DM],
DateFilterTable,
EVALUATE
SnapshotTable
To do this, you will need to specify the site address and folder path for the
SharePoint site in which you’d like the CSVs stored. As this will be running
on a daily basis, it’s important that we give these files a dynamic naming
convention, so we can identify the files when we need to pull them into
Power BI after. This is best done by using an expression within our file
name. To do so, we can add the following expression within the file name
(if you’re in the new designer view, then you will need to select the fx
button that appears at the end of the file name box – this will cause the
expression box to appear on the right side):
This now means that every time the flow is run, it will save the file with
yesterday’s date.
Then, by adding .csv to the end of the file name, we can ensure that the
file will be saved as a CSV file as shown:
Figure 14.13 – Screenshot of the Create file function within Power
Automate
Lastly, we determine the file content, which is where we select the outputs
from the previous step using dynamic content. To select the output of the
previous step, select the box under File Content and then select the output
from the dynamic content menu (under the dynamic content from the last
step). If you are using the new designer, you will need to select the
lightning bolt button at the end of the box to open this view. Save and test
your flow.
IMPORTANT NOTE
Using Power Automate to save a CSV file based on a DAX query does have
some important limitations that could impact the data you’re trying to save.
The limit is that a maximum of 100,000 rows or 1,000,000 values can be
retrieved in your query, or 15 MB of data (whichever is hit first). Keep this
in mind as you may need to filter the query to within these limitations or use
aggregations in your query to reduce the amount of data loaded to within
the limits. If you need to create temporary tables of large datasets, then you
might find it better to use other data or analytics services from within
Azure, for example.
3. Here you can give full access to others as well as manage the
connections used within your flow to other services, such as Power BI.
With these best practices, you fortify your approach to data cleaning in
Power Automate, ensuring not just efficiency but also reliability in your
processes. By embracing these best practices, you not only enhance the
efficiency of your data cleaning processes but also future-proof your Power
Automate endeavors with a foundation of meticulous planning, resilience,
security, and knowledge accessibility.
Summary
Embracing the power of automation can help you elevate your data
preparation practices and unlock the full potential of your Power BI reports.
In the next and final chapter, you will learn how to begin using new and
trending technologies, such as OpenAI and ChatGPT, to help you prepare
and analyze data. Given how much of a storm this technology has caused
already, learning how this can be used to clean your data will most
definitely help you and your business get ahead.
Questions
Further reading
OpenAI is taking over the world, with senior stakeholders across industries
asking how it can be integrated into their businesses. Many teams also
worry about whether this technology will make their roles obsolete. This
chapter provides context and insight as to how OpenAI will only help and
make life easier when it comes to cleaning data in Power BI.
By the end of this chapter, you will have built upon your knowledge of the
latest trending technology and how it might be used within your workflows
to clean, prepare, and analyze data using Power BI.
When we begin looking at use cases for where OpenAI can begin to assist
us in preparing data, we see that there are three key areas:
Cleaning textual data: Azure OpenAI shines in parsing and
normalizing textual data. Whether it’s standardizing product names or
cleaning unstructured text, the language model’s contextual
understanding enhances the efficiency of text-related transformations in
Power BI.
Identifying anomalies and outliers: Leveraging Azure OpenAI’s
natural language processing, Power BI users can develop anomaly
detection models that understand the context of data points. This goes
beyond statistical methods, enabling the identification of anomalies
based on contextual relevance.
Data imputation strategies: Azure OpenAI’s language capabilities
extend to intelligent data imputation. By analyzing surrounding data
points, the model can intelligently predict and fill in missing values,
contributing to more robust datasets for Power BI analysis.
As we progress through some examples in the next section, you might wish
to follow along within your own instance. Before doing this though, you
will need to create your own Azure OpenAI instance. To do this, follow the
next steps (alternatively, if you already have access to an Azure portal, then
skip to step 3):
8. Then, to proceed with creating an Azure OpenAI service, you will need
to complete the application form in the configuration window.
9. Once approved, you will be able to complete the configuration by
clicking on Next and then Review + Submit to create your OpenAI
service.
10. Once configured, you will be brought to the OpenAI service window as
shown. Now this will have details on your new service as well as the
configurations that you can apply, such as access controls, encryptions,
monitoring, and more. There are some useful links in the Further
reading section if you wish to learn more about these features.
Figure 15.3 – The Get Started options from Azure OpenAI services within
the Azure portal
11. Click on Go to Azure Open AI Studio to get started with creating some
AI models.
This section opened your eyes to how Azure OpenAI models can be used to
innovate; the next section aims to build on that with ideas about how it can
be used to optimize your coding in M.
Complete the next steps to follow along with the example in your own
environment:
1. Click on Deployment on the left-hand side and then select Create new
deployment to get started.
2. Select a model from the base models, in this case, select gpt-35-turbo,
and then name your deployment. In this example, name it
CleaningDataOpenAI.
3. Select Chat playground from the Azure OpenAI Studio home screen or
from the Playground tab on the left of your screen. This should open the
following view:
Figure 15.4 – Screenshot of the chat playground from within Azure OpenAI
User: Hi, I’m working on a Power BI project, and I need to transform some
data using M queries. I have a table with sales data, and I want to calculate
the total sales for each product category. Can you help me with the M query
for that?
let
Source = YourDataSource,
GroupedTable = Table.Group(Source, {"ProductC
in
GroupedTable
User: Thanks! That’s a great starting point. What if I want to filter out
products with sales less than $1,000 before calculating the total sales?
let
Source = YourDataSource,
FilteredTable = Table.SelectRows(Source, each
GroupedTable = Table.Group(FilteredTable, {"P
in
GroupedTable
This code filters out rows where sales are less than $1,000 before grouping
the data by product category and calculating the total sales.
User: Perfect! This is really helpful. One more thing – what if I want to sort
the results in descending order based on total sales?
ChatGPT: You can add a sorting step using the Table.Sort function. Here’s
the modified M:
let
Source = YourDataSource,
FilteredTable = Table.SelectRows(Source, each
GroupedTable = Table.Group(FilteredTable, {"P
SortedTable = Table.Sort(GroupedTable,{{"Tota
in
SortedTable
This code sorts the grouped table in descending order based on the
“TotalSales” column.
This interaction represents a simple example of how those who can leverage
the power of tools such as OpenAI and ChatGPT will be able to quickly
upskill in areas such as coding. It has to be said, though, that while this is
still in its infancy, it’s important to always test and validate the answers
provided before implementing them in production. Also, ensure that you
take precautions when using the publicly available ChatGPT model to avoid
sharing sensitive data publicly. If you would like to use sensitive data or
you want to ensure that requests are given within a secured governed
environment, make sure to use the ChatGPT model within your own Azure
OpenAI instance.
As we begin to explore the use cases where this technology can be most
effective, there are a number of clear early winners:
Lastly, it’s worth mentioning that how you prompt these models is crucially
important. In the previous example, we didn’t specify what type of data
source we were using in our query. As such, the model hasn’t provided an
insight or overview that using Table.Buffer on a data source
supporting query folding will cause it to break the fold. Again, this is not so
much of a problem if Table.Buffer is placed at the end of your query
for smaller datasets, but it is a problem if you add it nearer to the beginning
of the query, like in the previous example.
In the next section, you will learn about Microsoft Copilot, how to set up a
Power BI instance with Copilot activated, and also how you can use this
new AI technology to help clean and prepare your data.
This is great, but how can it help you clean and prepare data for analysis?
Well, Copilot can be leveraged on multiple data tools from within the
Microsoft Fabric platform. For those who are not aware, Power BI has now
become part of the Fabric platform. Depending on what type of license you
have for Power BI, you might already have access to this. Any customers
with Premium capacity licensing for the Power BI service would have
automatically been given access to Microsoft Fabric, and more importantly,
Copilot.
That being said, currently, Copilot has only been made available to
customers with a P1 (or above) Premium capacity or a Fabric license of F64
(or above), which is the equivalent licensing available directly from the
Azure portal.
If you would like to follow along with the next example, you will need to
set up a Fabric capacity within your Azure portal. Don’t worry, you can
pause this service when it’s not being used to ensure you are only charged
for the time you’re using it. Alternatively, follow the steps to see the
outcome:
1. Log in to the Azure portal that you set up in the previous section of this
chapter.
2. Select the search bar at the top of the page and type in Microsoft
Fabric. Select the service in the menu that appears below the search
bar, which should take you to the page where you can manage your
capacities.
3. Select Create a Fabric capacity. Note that you will need to use an
organizational account in order to create a Fabric capacity as opposed to
a personal account. You can sign up for a Microsoft Fabric trial for your
organization within the window. Further details on how to do this are
provided here: https://learn.microsoft.com/en-us/power-
bi/enterprise/service-admin-signing-up-for-power-bi-with-a-new-office-
365-trial.
4. Select the subscription and resource group you would like to use for this
Fabric capacity.
5. Then, under capacity details, you can enter your capacity name. In this
example, you can call it cleaningdata.
6. The Region field should populate with the region of your tenant, but you
can change this if you like. However, this may have implications on
performance, which it should warn you about with a message.
7. Set the capacity to F64.
8. Then, click on select Review + create.
9. Review the terms and then click on Create, which will begin the
deployment of your capacity.
10. Once deployed, select Go to resource to view your Fabric capacity.
Take note that this will be active once deployed. Make sure to return
here after testing to pause or delete your Fabric capacity to prevent
yourself from getting charged for this service.
Now you will need to ensure you have activated the Copilot settings
from within your Fabric capacity. To do this, go to
https://app.powerbi.com/admin-portal/ to log in and access the admin
portal.
IMPORTANT TIP
If you can’t see the Tenant settings tab, then you will need to ensure you
have been set up as an admin within your Microsoft 365 admin center. If
you have just created a new account, then you will need to set this up.
Follow the next links to assign roles:
• https://learn.microsoft.com/en-us/microsoft-365/admin/add-users/assign-
admin-roles
• https://learn.microsoft.com/en-us/fabric/admin/microsoft-fabric-admin
11. Scroll to the bottom of Tenant settings until you see the Copilot and
Azure OpenAI service (preview) section as shown:
Figure 15.9 – The tenant settings from within Power BI
12. Ensure both settings are set to Enabled and then click on Apply.
Now that you have created your Fabric capacity, let’s jump into an example
of how we can use Copilot to help with the cleaning of data. As we have
created a new capacity, you will have to create a new workspace that uses
this new capacity:
1. Navigate back to Workspaces using the left navigation bar. Then, select
New Workspace.
2. Name your workspace CleaningData(Copilot), then select the
dropdown for advanced configuration settings.
3. Ensure you have selected Fabric capacity in the license mode, which in
turn will have selected your capacity below, and then select Apply. You
have now created your capacity!
4. Now let’s use Fabric to create a new dataflow using the latest update of
Dataflow Gen2. Select New from within the workspace and then select
More options.
5. This will navigate you to a page with all the possible actions to create
items within your Fabric workspace. Under Data Factory, select
Dataflow Gen2.
6. This will load a Dataflow Gen2 instance called Dataflow 1. On the top
row, you should now see the Copilot logo within the Home ribbon as
highlighted:
Figure 15.10 – The ribbon within a Dataflow Gen2 instance
7. Select Copilot to open the Copilot window on the right-hand side of the
page. As you have not connected to any data, it will prompt you to select
get data.
8. Select Text/CSV and then enter the following into the File path or URL
box:
https://raw.githubusercontent.com/PacktPublishin
g/Data-Cleaning-with-Power-
BI/main/Retail%20Store%20Sales%20Data.csv
9. Leave the rest of the settings as their defaults and click on Next.
10. This will then open a preview of the file data. Click on Create to load
this data into your Dataflow Gen2 instance. You will see that the Copilot
window will have now changed to prompt you as to what you would like
to do (if it hasn’t, then simply close the Copilot window and reopen):
Figure 15.11 – Data loaded into Dataflow Gen2
11. In this example, we can see that the data includes a column called Order
Date but we don’t have a field for the fiscal year. Enter the following
prompt to ask Copilot to help with the transformation:
There's a column in the data named Order Date,
which shows when an order was placed. However, I
need to create a new column from this that shows
the Fiscal Year. Can you extract the year from
the date and call this Fiscal Year? Set this new
column to type number also.
12. Proceed using the arrow key or press Enter. Copilot will then begin
working on your request. As you will see in the resulting output, the
model has added a function (or step) called Custom to the query that we
had selected.
13. Scroll to the far side and you will see that this has added a new column
called Fiscal Year.
14. Now add the following prompt to narrow down our data and press Enter:
Can you now remove all columns leaving me with
just Order ID, Order Date, Fiscal year,
category, and Sales?
15. This will then add another function or step called Choose columns.
Finally, add the following prompt to aggregate this data and press Enter:
Can you now group this data by Category, Fiscal
year, and aggregated by Sum of Sales?
As you can see, Copilot has now added another function called Custom
1 to the applied steps in this query, resulting in this table:
Figure 15.12 – The results from asking Copilot to transform the data
To view the M query that Copilot has added, select Advanced editor,
which will show the functions that Copilot has added for you:
Figure 15.13 – The resulting M query created by Copilot to carry out the
request transformations to clean the data
In this example, you explored the new technologies available with Copilot
and how they help to transform the data using tools such as Dataflow Gen2.
While Azure OpenAI opens new avenues for data cleaning in Power BI,
challenges such as API rate limits and cost considerations should be
navigated. Robust error handling, caching strategies, and efficient resource
management are crucial aspects to address potential roadblocks.
This section was just an introduction to the high-level challenges I’ve seen
in the short time that OpenAI has been around. However, if you’re
interested in learning more about these challenges and how to overcome
them, then I highly recommend taking a look at the articles added to the
Further reading section of this chapter.
By incorporating these workarounds, you can navigate the complexities of
data cleaning and preparation, maximizing the potential of innovative
technologies while mitigating potential pitfalls in your data-driven journey.
Summary
In summary, the key lessons from this chapter include the potential benefits
of integrating OpenAI into Power BI for data cleaning, the specific use
cases where it can be most effective, and the challenges that need to be
navigated for successful implementation. You now know how to optimize
Power Query transformations, handle large datasets, and address ethical and
compliance considerations. These lessons are crucial for practitioners
seeking to enhance data cleaning processes and analytics in Power BI.
Questions
1. What are the three key areas where Azure OpenAI can assist in
preparing data?
1. Optimizing query plans, handling large datasets, and dynamic query
adjustments
2. Cleaning textual data, identifying anomalies and outliers, and data
imputation strategies
3. Error handling strategies, guidance on complex transformations, and
caching strategies
4. Security and confidentiality, integration with existing workflows, and
model explainability
2. In the example M code conversation between a user and ChatGPT, what
step was taken to filter out products with sales less than $1,000?
1. The Table.Buffer function
2. A group by transformation
3. A filtering step before a group by transformation
4. A sorting step using the Table.Sort function
3. What challenge is associated with the dynamic nature of data
requirements when using AI models for data cleaning?
1. Model explainability
2. Over-reliance on AI recommendations
3. Ensuring query security and confidentiality
4. Adapting quickly to shifting requirements
4. What is one of the potential pitfalls when using AI recommendations
without critical scrutiny?
1. Over-reliance on AI recommendations
2. Lack of model explainability
3. Ensuring query security and confidentiality
4. Handling large datasets
Further reading
Putting it together
As you reach the end of this book, it’s important to reflect on the journey
you have been on since the first chapter. You’ve embarked on a
transformative journey through the intricate realm of cleaning and preparing
data with Power BI.
Consider this not as an endpoint, but a stepping stone into a vast landscape
of possibilities. The techniques and insights you’ve learned in this book will
serve as the foundation for your ongoing exploration into the dynamic
world of data analytics.
DAX: With DAX being such a critical part of Power BI, I recommend
building upon your knowledge of how to leverage DAX to build your
measures and more. Hands-On Business Intelligence with DAX
(https://www.packtpub.com/product/hands-on-business-intelligence-
with-dax/9781838824303) can help you build upon this.
M: As we have highlighted many times in this book, Power Query and
M are critical when it comes to data cleaning and preparation in Power
BI. This book has introduced you to lots of concepts, so I’d recommend
building on this with a book such as The Definitive Guide to Power
Query (M) (https://www.packtpub.com/product/the-definitive-guide-to-
power-query-m/9781835089729).
Data modeling: This book introduced data modeling as a key factor to
consider when you’re cleaning and preparing your data. Build upon the
concepts you were introduced to in Chapter 12, Data Modeling and
Managing Relationships, with a book such as Expert Data Modeling
with Power BI – Second Edition
(https://www.packtpub.com/product/expert-data-modeling-with-power-
bi-second-edition/9781803246246).
Leveraging R and Python: We introduced some basics on how you can
shape and enhance your data by using R and Python. A book such as
Expert Data Modeling with Power BI – Second Edition
(https://www.packtpub.com/product/expert-data-modeling-with-power-
bi-second-edition/9781803246246) could help you grow this knowledge
with further examples and use cases.
Power BI management: I’d recommend learning more about the
management of Power BI deployment with a book such as Mastering
Microsoft Power BI – Second Edition
(https://www.packtpub.com/product/mastering-microsoft-power-bi-
second-edition/9781801811484).
The world of data awaits your exploration, and with the tools acquired here,
you are well equipped to navigate its depths and extract meaningful
insights.
1. B – 50-80%
2. D – Power Query, data modeling, DAX formulas
3. C – Data transformation and preparation
4. C – As a formula language for creating calculations and measures
5. B – To bridge the gap between relational databases and spreadsheet tools
6. D – It can be used for both calculations and querying within Power BI
7. B – It enhances clarity and reduces ambiguity
Chapter 2 – Understanding Data Quality
and Why Data Cleaning is Important
1. C – Fuzzy matching and fill down – They are the two essential
techniques discussed in the chapter for cleaning and preparing data using
the Query Editor in Power BI
2. C – Range from 0 to 1, indicating no to perfect similarity – In the context
of fuzzy matching, the similarity score ranges from 0 to 1, indicating no
to perfect similarity
3. D – When working with time series data and maintaining data continuity
– The fill down technique in Power BI’s Query Editor is particularly
useful in this scenario
4. D – Regularly validate the results of data cleaning efforts and maintain
documentation – This is a crucial best practice emphasized when
working with fuzzy matching and fill down in Power BI
5. C – To extend the capabilities of Power BI by leveraging external
ecosystems – This is the primary purpose of using custom data scripts in
languages such as R and Python in Power BI
6. B – To write R scripts – The Run R Script option in the Power Query
Editor from within Power BI Desktop is used to write R scripts
7. B – Integration with external data sources – This is the benefit of using R
and Python scripts in Power BI for data preparation
8. C –Anomaly Detection – The Anomaly Detection built-in machine
learning feature in Power BI is used for identifying and addressing
outliers in data
9. B – Automated data preparation with machine learning suggestions –
The purpose of AutoML in Power BI is to provide automated data
preparation with machine learning models
10. C – Sentiment analysis, image recognition, and text analytics – They are
some of the most common use cases for AI Insights in Power BI
Chapter 10 – Creating Custom Functions
in Power Query
As this ebook edition doesn't have fixed pagination, the page numbers
below are hyperlinked for reference only, based on the printed edition of
this book.
aggregated tables
cons 207
considerations 207
pros 207
AI
AI insights
examples 146
anomaly detection
examples 134
implementing 135-145
automation
creating 210-212
using 209
business users 21
calculated columns 41
considerations 44
versus measures 41
Calculation group 43
calendars
cardinality, types
ChatGPT
column groups
importance 250
using 250
columns
merging 38
splitting 36-38
C-suite executives 21
custom function
debugging 164
documentation 164
testing 164
data
cleaning, in Power BI 4, 5
filtering 176-183
organizing 250
process, initiating 9
reducing 176-183
structuring 250
as formula language 6
as query language 7
column 8
functions 8
measure 8
operators 8
table references 8
data analysts 20
data cleaning
data assessment 23
data transformations 24
data validation 24
defining 20
documentation 25
process, building 22
quality assurance 24
strategies 24
date/time functions 75
pivot/unpivot columns 75
query merging 76
rounding functions 75
Dataflow Gen1
data formatting
assessing 56-61
data in Power BI
data modeling 4
fundamentals 198
data normalization 61
assessing 61-63
data quality 12
best practices 15
sources 12, 13
establishing 15, 16
data relevance
assessing 55, 56
datasets 236
data steward 20
data timeliness 12
data validity 12
date tables
denormalization
assessing 61-63
considerations 66
cons 201
pros 201
dimension tables 200
cons 199
pros 199
drill-through context 44
duplicates
removing 30-33
benefits 108
F
fact tables 200
filter context 44
filters
using 246-249
functions 84, 85
H
high-quality data
identifiers 82
importing 198
cons 198
pros 198
cons 208
considerations 208
pros 208
in expression 85
IT professionals 20
handling 218
let keyword 82
literals 85, 86
usage 216
using 133
measures 42
considerations 44
considering 185
memory usage
Microsoft Copilot
using 288-293
miles per gallon (mpg) 130
missing data
removing 33-36
data sources 90
structure 82
use cases 86
M queries
combining 98-103
native M functions
using 183, 184
notifications
automating 261-264
usage 215
usage 215
OpenAI
P
page context 44
paginated reports
importance 232
parameters
creating 90, 91
defining 155
Power Automate
Power BI
data, cleaning 4, 5
Power Query 5, 6
integration 79
workflow 78
predictors 143
Python
Python scripts
benefits 129
using 128
Q
query
creating 76
merging 76
merging options 76
query steps 84
relationship autodetect 63
roles
row context 44
row groups
importance 250
use cases 250
using 250
R Project
R scripts
benefits 129
using 128
snapshots, of data
cons 203
pros 203
snowflaking 202
assessing 61-63
benefits 65
structure, M language 82
functions 84, 85
identifiers 82
in expression 85
let keyword 82
literals 85, 86
query steps 84
Table.Buffer
using 187-194
tables
aggregating 207
Table.Split
using 187-194
tablix 253
Tabular Editor 43
advantages 43
values
replacing 38-40
replacing, scenarios 39
variables
creating 90
packtpub.com
Subscribe to our online digital library for full access to over 7,000 books
and videos, as well as industry leading tools to help you plan your personal
development and advance your career. For more information, please visit
our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks
and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published,
with PDF and ePub files available? You can upgrade to the eBook version
at packtpub.com and as a print book customer, you are entitled to a discount
on the eBook copy. Get in touch with us at
[email protected] for more details.
At www.packtpub.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters, and receive exclusive
discounts and offers on Packt books and eBooks.
If you enjoyed this book, you may be interested in these other books by
Packt:
Microsoft Power BI Performance Best Practices
Bhavik Merchant
ISBN: 978-1-80107-644-9
ISBN: 978-1-80323-856-2
Now you’ve finished Data Cleaning with Power BI, we’d love to hear your
thoughts! If you purchased the book from Amazon, please click here to go
straight to the Amazon review page for this book and share your feedback
or leave a review on the site that you purchased it from.
Your review is important to us and the tech community and will help us
make sure we’re delivering excellent quality content.
Do you like to read on the go but are unable to carry your print books
everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version
of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code
from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts,
newsletters, and great free content in your inbox daily