Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Introduction to Datafication : Implement

Datafication Using AI and ML


Algorithms Shivakumar R. Goniwada
Visit to download the full and correct content document:
https://ebookmass.com/product/introduction-to-datafication-implement-datafication-usi
ng-ai-and-ml-algorithms-shivakumar-r-goniwada-2/
Shivakumar R. Goniwada

Introduction to Datafication
Implement Datafication Using AI and ML
Algorithms
Shivakumar R. Goniwada
Gubbalala, Bangalore, Karnataka, India

ISBN 978-1-4842-9495-6 e-ISBN 978-1-4842-9496-3


https://doi.org/10.1007/978-1-4842-9496-3

© Shivakumar R. Goniwada 2023

Standard Apress

The use of general descriptive names, registered names, trademarks,


service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general
use.

The publisher, the authors, and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress


Media, LLC, part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY
10004, U.S.A.
This book is dedicated to those who may need access to the resources and
opportunities many take for granted. May this book serve as a reminder
that knowledge and learning are powerful tools that can transform lives
and create new opportunities for those who seek them.
Introduction
The motivation to write this book goes back to the words of Swami
Vivekananda: “Everything is easy when you are busy, but nothing is
easy when you are lazy,” and “Take up on one idea, make that one idea
your life, dream of it, think of it, live on that idea.”
Data is increasingly shaping the world in which we live. The
proliferation of digital devices, social media platforms, and the Internet
of Things (IoT) has led to an explosion in the amount of data generated
daily. This has created new opportunities and challenges for everyone
as we seek to harness the power of data to drive innovation and
improve decision making.
This book is a comprehensive guide to the world of datafication and
its development, governing process, and security. We explore
fundamental principles and patterns, analysis frameworks, techniques
to implement artificial intelligence (AI) and machine learning (ML)
algorithms, models, and regulations to govern datafication systems.
We will start by exploring the basics of datafication and how it
transforms the world, and then delve into the fundamental principles
and patterns and how data are ingested and processed with an
extensive data analysis framework. We will examine the ethics,
regulations, and security of datafication in a real scenario.
Throughout the book, we will use real-world examples and case
studies to illustrate key concepts and techniques and provide practical
guidance in sentiment and behavior analysis.
Whether you are a student, analyst, engineer, technologist, or
someone simply interested in the world of datafication, this book will
provide you with a comprehensive understanding of datafication.
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub
(https://github.com/Apress). For more detailed information, please
visit http://www.apress.com/source-code.
Acknowledgments
Many thanks to my mother, S. Jayamma, and late father, G.M. Rudrapp,
who taught me the value of hard work, and to my wife, Nirmala, and
daughter, Neeharika, without whom I wouldn’t have been able to work
long hours into the night every day of the week. Last but not least, I’d
like to thank my friends, colleagues, and mentors at Mphasis,
Accenture, and other corporations who have guided me throughout my
career.
Thank you also to my colleagues Mark Powers, Celestin Suresh John,
Shobana Srinivasan, and other Apress team members for allowing me
to work with you and Apress, and to all who have helped this book
become a reality. Thank you for my mentors Bert Hooyman and
Abubacker Mohamed and thanks for my colleague Raghu Pasupuleti for
providing key inputs.
Table of Contents
Chapter 1:​Introduction to Datafication
What Is Datafication?​
Why Is Datafication Important?​
Data for Datafication
Datafication Steps
Digitization vs.​Datafication
Types of Data in Datafication
Elements of Datafication
Data Harvesting
Data Curation
Data Storage
Data Analysis
Cloud Computing
Datafication Across Industries
Summary
Chapter 2:​Datafication Principles and Patterns
What Are Architecture Principles?​
Datafication Principles
Data Integration Principle
Data Quality Principle
Data Governance Principles
Data Is an Asset
Data Is Shared
Data Trustee
Ethical Principle
Security by Design Principle
Datafication Patterns
Data Partitioning Pattern
Data Replication
Stream Processing
Change Data Capture (CDC)
Data Mesh
Machine Learning Patterns
Summary
Chapter 3:​Datafication Analytics
Introduction to Data Analytics
What Is Analytics?​
Big Data and Data Science
Datafication Analytical Models
Content-Based Analytics
Data Mining
Text Analytics
Sentiment Analytics
Audio Analytics
Video Analytics
Comparison in Analytics
Datafication Metrics
Datafication Analysis
Data Sources
Data Gathering
Introduction to Algorithms
Supervised Machine Learning
Linear Regression
Support Vector Machines (SVM)
Decision Trees
Neural Networks
Naïve Bayes Algorithm
K-Nearest Neighbor (KNN) Algorithm
Random Forest
Unsupervised Machine Learning
Clustering
Association Rule Learning
Dimensionality Reduction
Reinforcement Machine Learning
Summary
Chapter 4:​Datafication Data-Sharing Pipeline
Introduction to Data-Sharing Pipelines
Steps in Data Sharing
Data-Sharing Process
Data-Sharing Decisions
Data-Sharing Styles
Unidirectional, Asynchronous Push Integration Style
Real-Time and Event-based Integration Style
Bidirectional, Synchronous, API-led Integration Style
Mediated Data Exchange with an Event-Driven Approach
Designing a Data-Sharing Pipeline
Types of Data Pipeline
Batch Processing
Extract, Transform, and Load Data Pipeline (ETL)
Extract, Load, and Transform Data Pipeline (ELT)
Streaming and Event Processing
Change Data Capture (CDC)
Lambda Data Pipeline Architecture
Kappa Data Pipeline Architecture
Data as a Service (DaaS)
Data Lineage
Data Quality
Data Integration Governance
Summary
Chapter 5:​Data Analysis
Introduction to Data Analysis
Data Analysis Steps
Prepare a Question
Prepare Cleansed Data
Identify a Relevant Algorithm
Build a Statistical Model
Match Result
Create an Analysis Report
Summary
Chapter 6:​Sentiment Analysis
Introduction to Sentiment Analysis
Use of Sentiment Analysis
Types of Sentiment Analysis
Document-Level Sentiment Analysis
Aspect-Based Sentiment Analysis
Multilingual Sentiment Analysis
Pros and Cons of Sentiment Analysis
Pre-Processing of Data
Tokenization
Stop Words Removal
Stemming and Lemmatization
Handling Negation and Sarcasm
Rule-Based Sentiment Analysis
Lexicon-Based Approaches
Sentiment Dictionaries
Pros and Cons of Rule-Based Approaches
Machine Learning–Based Sentiment Analysis
Supervised Learning Techniques
Unsupervised Learning Techniques
Pros and Cons of the Machine Learning–Based Approach
Best Practices for Sentiment Analysis
Summary
Chapter 7:​Behavioral Analysis
Introduction to Behavioral Analytics
Data Collection
Behavioral Science
Importance of Behavioral Science
How Behavioral Analysis and Analytics Are Processed
Cognitive Theory and Analytics
Biological Theories and Analytics
Integrative Model
Behavioral Analysis Methods
Funnel Analysis
Cohort Analysis
Customer Lifetime Value (CLV)
Churn Analysis
Behavioral Segmentation
Analyzing Behavioral Analysis
Descriptive Analysis with Regression
Causal Analysis with Regression
Causal Analysis with Experimental Design
Challenges and Limitations of Behavioral Analysis
Summary
Chapter 8:​Datafication Engineering
Steps of AI and ML Engineering
AI and ML Development
Understanding the Problem to Be Solved
Choosing the Appropriate Model
Preparing and Cleaning Data
Feature Selection and Engineering
Model Training and Optimization
AI and ML Testing
Unit Testing
Integration Testing
Non-Functional Testing
Performance
Security Testing
DataOps
MLOps
Summary
Chapter 9:​Datafication Governance
Importance of Datafication Governance
Why Is Datafication Governance Required?​
Datafication Governance Framework
Oversight and Accountability
Model Risk, Risk Assessment, and Regulatory Guidance
Roles and Responsibilities​
Monitoring and Reporting
Datafication Governance Guidelines and Principles
Ethical and Legal Aspects
Datafication Governance Action Framework
Datafication Governance Challenges
Summary
Chapter 10:​Datafication Security
Introduction to Datafication Security
Datafication Security Framework
Regulations
Organization Concerns
Governance and Compliance
Business Access Needs
Datafication Security Measures
Encryption
Data Masking
Penetration Testing
Data Security Restrictions
Summary
Index
About the Author
Shivakumar R. Goniwada
is an author, inventor, chief enterprise
architect, and technology leader with
over 23 years of experience architecting
cloud-native, data analytics, and event-
driven systems. He works in Accenture
and leads a highly experienced
technology enterprise and cloud
architect team. Over the years, he has led
many complex projects across industries
and the globe. He has ten software
patents in cloud computing, polyglot
architecture, software engineering, data
analytics, and IoT. He authored a book on
Cloud Native Architecture and Design. He
is a speaker at multiple global and in-
house conferences. Shivakumar has earned Master Technology
Architecture, Google Professional, AWS, and data science certifications.
He completed his executive MBA at the MIT Sloan School of
Management.
About the Technical Reviewer
Mohan H M
is a technical program manager and
research engineer (HMI, AI/ML) at
Digital Shark Technology, supporting the
research and development of new
products, promotion of existing
products, and investigation of new
applications for existing products.
In the past, he has worked as a
technical education evangelist and has
traveled extensively all over India
delivering training on artificial intelligence, embedded systems, and
Internet of Things (IoT) to research scholars and faculties in
engineering colleges under the MeitY scheme. In the past, he has
worked as an assistant professor at the T. John Institute of Technology.
Mohan holds a master’s degree in embedded systems and the VLSI
design field from Visvesvaraya Technological University. He earned his
Ph.D. on the topic of non-invasive myocardial infarction prediction
using computational intelligence techniques from the same university.
He has been a peer reviewer for technical publications, including BMC
Informatics, Springer Nature, Scientific Reports, and more. His research
interests include computer vision, IoT, and biomedical signal
processing.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2023
S. R. Goniwada, Introduction to Datafication
https://doi.org/10.1007/978-1-4842-9496-3_1

1. Introduction to Datafication
Shivakumar R. Goniwada1
(1) Mantri Tranquil, #I-606, Gubbalala, Bangalore, Karnataka, India

A comprehensive look at datafication must first begin with its


definition. This chapter provides that and details why datafication plays
a significant role in modern business and data architecture.
Datafication has profoundly impacted many aspects of society,
including business, finance, health care, politics, and education. It has
enabled companies to gain insights into consumer behavior and
preferences, health care to improve patient outcomes, finance to
enhance consumer experience and risk and compliance, and educators
to personalize learning experiences.
Datafication helps you to take facts and statistics gained from
myriad sources and give them domain-specific context, aggregating and
making them accessible for use in strategy building and decision
making. This improves sales and profiles, health results, and influence
over public policy.
Datafication is the process of turning data into a usable and
accessible format and involves the following:
Collecting data from myriad sources
Organizing and cleaning the data
Making it available for analysis to use
Analyzing the data by using artificial intelligence (AL) and machine
learning (ML) models
Developing a deeper understanding of the datafication process and
its implications for individuals and society is essential. This requires a
multidisciplinary approach that brings together stakeholders from
various fields to explore the challenges and opportunities of
datafication and to develop ethical and effective strategies for managing
and utilizing data in the digital age.
This chapter will drill down into the particulars and explain how
datafication benefits the across industry. We will cover the following
topics:
What is datafication?
How is datafication embraced across industries?
Why is datafication important?
What are elements of datafication?

What Is Datafication?
Datafication involves using digital technologies such as the cloud, data
products, and AI/ML algorithms to collect and process vast amounts of
data on human behavior, preferences, and activities.
Datafication converts various forms of information, such as texts,
images, audio recordings, comments, claps, and likes/dislikes to
curated format, and that data can be easily analyzed and processed by
multiple algorithms. This involves extracting relevant data from social
media, hospitals, and Internet of Things (IoT). These data are organized
into a consistent format and stored in a way that makes them accessible
for further analysis.
Everything around us, from finance, medical, construction, and
social media to industrial equipment, is converted into data. For
example, you create data every time you post to social media platforms
such as WhatsApp, Instagram, Twitter, or Facebook, and any time you
join meetings in Zoom or Google Meet, or even when you walk past a
CCTV camera while crossing the street. The notion differs from
digitization, as datafication is much broader than digitization.
Datafication can help you to understand the world more fully than
ever before. New cloud technologies are available to ingest, store,
process, and analyze data. For example, marketing companies use
Facebook and Twitter data to determine and predict sales. Digital Twin
uses industrial equipment behavior to analyze the behavior of the
machine.
Datafication also raises important questions about privacy, security,
and ethics. The collection and use of personal data can infringe on
individual rights and privacy, and there is a need for greater
transparency and accountability in how data are collected and used.
Overall, datafication represents a significant shift in how we live, work,
and act.

Why Is Datafication Important?


Datafication enables organizations to transform raw data into a format
that can be analyzed and used to gain insights, make informed business
decisions, improve patients’ health, and streamline supply-chain
management. This is crucial for every industry to improve in today’s
data-driven world. By using the processed data, organizations can
identify trends, gain insight into customer behavior, and discover other
key performance indicators using analytics tools and algorithms.

Data for Datafication


Data is available everywhere, but what type of data you require for
analysis in datafication is crucial and helps you to understand hidden
values and challenges. Data can come from a wide range of sources, but
the specific data set will depend on the particular context and the goal
of the datafication process.
Today, data are created not only by people and their activities in the
world, but also by machines. The amount of data produced is almost
out of control.
For example:
Social media data such as posts and comments are structured data
that can be easily analyzed for sentiment and behavior. This involves
extracting text from the posts and comments and identifying and
categorizing any images, comments, or other media that are part of it.
In the medical context, datafication might involve converting medical
records and other patient information into structured data that can
be used for analysis and research. This involves extracting
information about diagnoses, treatments, and other medical reports.
In the e-commerce context, datafication might involve converting
users’ statistics and other purchase information into structured data
that can be used for analysis and recommendations.
In summary, data can come from a wide range of sources, and how it
is used will depend on the specific context and goals of the datafication
process.
Data constantly poses new challenges in terms of storage and
accessibility. The need to use all of this data is pushing us into a higher
level of technological advancement, whether we like or want it or not.
Datafication requires new forms of integration to uncover large
hidden values from extensive collections that are diverse, complex, and
of a massive scale. According to Kepios (https://kepios.com/),
there will be 4.80 billion social media users worldwide as of April 2023,
59.0 percent of the world population, and approximately 227 million
users join every year.
The following are a few statistics regarding major social media
applications as of the writing of this book:
Facebook has 3.46 billion monthly visitors.
YouTube’s potential advertising reach is 7.55 billion people (monthly
average).
WhatsApp has at least 3 billion monthly users.
Instagram’s potential advertising reach is approximately 2.13 billion
people.
Twitter’s possible advertising reach is approximately 2.30 billion
people.

Datafication Steps
For datafication, as defined by DAMA (Data Management Association),
you must have a clear set of data, well-defined analysis models, and
computing power. To obtain a precise collection of data, relevant
models, and required computing power, one must follow these steps:
Data Harvesting: This step involves obtaining data in a real-time
and reliable way from various sources, such as databases, sensors,
files, etc.
Data Curation: This step involves organizing and cleaning the data
to prepare it for analysis. You need to ensure that the data collected
are accurate by removing errors, inconsistencies, and duplicates with
a standardized format.
Data Transformation: This step involves converting data into a
suitable format for analysis. This step helps you transform the data
into a specific form, such as dimensional and graph models.
Data Storage: This step involves storing the data after
transformation in storage, such as a data lake or data warehouse, for
further analysis.
Data Analysis: This step involves using statistical and analytical
techniques to gain insights from data and identify trends, patterns,
and correlations in the data that help with predictions and
recommendations.
Data Dissemination: This step involves sharing the dashboards,
reports, and presentations with relevant stakeholders.
Cloud Computing: This step provides the necessary infrastructure
and tools for the preceding steps.

Digitization vs. Datafication


For a better understanding of datafication, it can be helpful to contrast
it with digitization. This may help you to better visualize the
datafication process.
Digitization is a process that has taken place for decades. It entails
the conversion of information into a digital format; for example, music
to MP3/MP4, images to JPG/PNG, manual banking process to mobile
and automated web process, manual approval process to automatic
BPM workflow process, and so on.
Datafication, on the other hand, involves converting data into a
usable, accessible format. This consists of collecting data from various
sources and formats, organizing and cleansing it, and making it
available for analysis. The primary goal of datafication is to help the
organization make data-driven decisions, allowing it to gain insights
and knowledge from the data.
Datafication helps monitor what each person does. It does so with
advanced technologies that can monitor and measure things
individually.
In digitization, you convert many forms into digital forms, which are
accessible to an individual computer. Similar to datafication, you ingest
the activities and behavior and convert them into a virtual structure
that can be used within formal systems.
However, many organizations realize that more than simply
processing data is needed to support business disruption. It requires
quality data and the application of suitable algorithms. Modern
architecture and methodologies must be adopted to address these
challenges to create datafication opportunities.

Types of Data in Datafication


The first type of data is content, which can be user likes, comments on
blogs and web forums, visible hyperlinks in the content, user profiles
on social networking sites, news articles on news sites, and machine
data. The data format can be structured or unstructured.
The second type of data is the behavior of objects and the runtime
operational parameters of industrial systems, buildings, and so forth.
The third type is time series data, such as stock price, weather, or
sensor data.
The fourth type of data is network structured data, such as
integrated networked systems in an industrial unit, such as coolant
pipes and water flow. This data type is beneficial because it provides for
overall media analysis, entire industrial function, and so on.
The fifth data set is your health, fitness, sleep time, conversation
chats, smart home, and health monitor device.

Elements of Datafication
As defined by DAMA, Figure 1-1 illustrates the seven critical elements
of the datafication architecture used to develop the datafication
process. Datafication will only be successful if at least one of the steps is
included.
Figure 1-1 Data elements

Data Harvesting
Data harvesting is extracting data from a given source, such as social
media, IoT devices, or other various data sources.
Before harvesting any data, you need to analyze it to identify the
source and software tools needed for harvesting.
First, the data is undesiably noticeable if it is inaccurate, biased,
confidential, and irrelevant. Therefore, harvested information is more
objective and reliable than familiar data sources. However, the
disadvantage is that it is difficult to know the users’ demographic and
psychological variables for social media data sources.
Second, harvesting must be automatic, real-time, streaming and
able to handle large-scale data sources efficiently.
Third, the data are usually fine-grained and available in real-time.
Text mining techniques are used to preprocess raw text images, text
processing techniques are used to preprocess essential texts, and video
processing techniques are used to preprocess photos and videos for
further analysis.
Fourth, the data can be ingested in real-time or in batches. In real-
time, each data item is imported as the source changes it. When data
are ingested through sets, the data elements are imported in discrete
chunks at periodic intervals.
Various data harvesting methods can be used depending on the data
source type, as follows:
IoT devices typically involve collecting data from IoT sensors and
devices using protocols such as MQTT, CoAP, HTTP, and AMQP.
Social media platforms such as Facebook, Twitter, LinkedIn,
Instagram, and others use REST API, streaming, Webhooks, and
GraphQL.

Data Curation
Data curation organizes and manages data collected through ingestion
from various sources. This involves organizing and maintaining data in
a way that makes it accessible and usable for data analysis. This
involves cleaning and filtering data, removing duplicates and errors,
and properly labeling and annotating data.
Data curation is essential for ensuring that the data are accurate,
consistent, and reliable, which is crucial for data analysis.
The following are the few steps involved in data curation:
Data Cleaning: Once the data is harvested, it must be cleaned to
remove errors, inconsistencies, and duplicates. This involves
removing missing values, correcting spelling errors, and
standardizing data formats.
Data Transformation: After the data has been cleaned, it needs to
be transformed into a format suitable for analysis. This involves
aggregating data, creating new variables, and so forth. For example,
you might have a data set of pathology reports with variables that
include such elements as patient ID, date of visit, test ID, test
description, and test results. You want to transform this data set into
a format that shows each patient’s total health condition. To do this
you need to alter harvested data for data analysis with
transformations such as creating a new variable for test category,
aggregating test data for patient for a year, summarizing data by
group (ex: hemoglobin), etc.
Data Labeling: Annotating data with relevant metadata, such as
variable names and data descriptions.
Data Quality Test: In this step, you need to ensure the data is
accurate and reliable by using various tests like statistical tests, etc.
The overall objective of data curation is to reduce the time it takes
to obtain insight from raw data by organizing and bringing relevant
information together for further analysis.
The steps involved in data curation are organizing and cataloging
data, ensuring data quality, preserving data, and providing access to
data.

Data Storage
Data storage stores actual digital data on a computer with the help of a
hard drive, solid-state drive, and related software to manage and
organize the data.
Data storage is the actual physical storage of datafication data. More
than 2.5 quintillion bytes of data are created daily, and data snowballs
of approximately 2 MB are made every second for every person. These
numbers are from users searching the content in the internet, browsing
social media networks, posting blogs, photos, comments, status
updates, watching a video, downloading images, streaming songs, etc.
To make a business decision, the data must be stored in a way that is
easier to manage and access, and it is essential to protect data against
cyber threats.
For IoT, the data need to be collected from sensors and devices and
stored in the cloud.
Several types of database storage exist, including relational
databases, NoSQL databases, in-memory databases, and cloud
databases. Each type of database storage has advantages and
disadvantages, but the best choice for datafication is cloud databases
that involve data lakes and warehouses.
Data Analysis
Data analysis refers to analyzing a large set of data to discover different
patterns and KPIs (Key Performance Indicators) of an organization. The
main goal of analytics is to help organizations make better business
decisions and future predictions. Advanced analytics techniques such as
machine learning models, text analytics, predictive analytics, data
mining, statistics, and natural language processing are used. With these
ML models, you uncover hidden patterns, unknown correlations,
market trends, customer preferences, feedback about your new FMCG
(Fast Moving Consumer Goods) products, and so on.
The following are the types of analytics that you can process using
ML models:
Prescriptive: This type of analytics helps to decide what action
should be taken and examines data to answer various questions such
as what should be done. Or what can we do to make our product
attractive? This helps to find an answer to various problems, such as
where to focus on treatment.
Predictive: This type of analytics helps to predict the future or what
might happen, such as emphasizing the business relevance of the
resulting insights and use cases, such as sales and production data.
Diagnostic: This type of analytics helps to analyze past situations,
such as what went wrong and why it happened. This helps to
facilitate correction in the future; for example, weather prediction
and customer behavior.
Descriptive: This type of analytics helps to analyze current and
future use cases, such as behavioral analysis of users.
Exploratory: This type of analytics involves visualization.

Cloud Computing
Cloud computing is the use of computing resources delivered over the
internet and has the potential to offer substantial opportunities in
various datafication scenarios. It is a flexible delivery platform for data,
computing, and other services. It can support many architectural and
development styles, from extensive, monolithic systems to sizeable
virtual machine deployments, nimble clusters of containers, a data
mesh, and large farms of serverless functions.
The primary services of cloud offerings for data storage are as
follows:
Data storage as a service
Streaming services for data ingestion
Machine learning workbench for analysis

Datafication Across Industries


Datafication is a valuable resource for businesses and organizations
seeking to gain insights into customer behavior, market trends, patient
health healing progress, and more.
Datafication is the process of converting various types of data into a
standardized format that can be used for analysis and decision making
and has become increasingly important across industries as a means of
leveraging data.
In the health-care industry, datafication is used to improve patient
outcomes and reduce costs. By collecting and analyzing patient data,
including pathology tests, medical histories, vital signs, and lab results,
health-care providers are able to optimize treatments and improve
patient care.
In the finance industry, datafication is used to analyze financial data,
such as transaction history, risk, fraud management, personalized
customer experience, and compliance.
In the manufacturing industry, datafication is used to analyze
production data, machine data to improve the production process,
digital twins, etc.
In the retail industry, datafication is used to analyze customer
behavior and preferences to optimize pricing strategies and
personalized customer experience.

Summary
Datafication is the process of converting various types of data and
information into a digital format that can easily be processed and
analyzed. With datafication, you can increase your organization’s
footprint by using data effectively for decision making. It helps to
improve operational efficiency and provides input to the manufacturing
hub to develop new products and services.
Overall, data curation is the key component of effective datafication,
as it ensures that the data is accurate, complete, and reliable, which is
essential for making decisions and gleaning meaningful insights.
In this chapter, I described datafication and discussed the types of
data involved in datafication, datafication steps, and datafication
elements. Next chapter provides more details of principles, patterns
and methodolgoies to realize the datafication.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
S. R. Goniwada, Introduction to Datafication
https://doi.org/10.1007/978-1-4842-9496-3_2

2. Datafication Principles and Patterns


Shivakumar R. Goniwada1
(1) Mantri Tranquil, #I-606, Gubbalala, Bangalore, Karnataka, India

Principles are guidelines for the design and development of a system. They reflect the level of
consensus among the various elements of your system. Without proper principles, your
architecture has no compass to guide its journey toward datafication.
Patterns are tried-and-tested accurate solutions to common design problems, and they can
be used as a starting point for developing a datafication.
The processes involved in datafication are to collect, analyze, and interpret the vast amount
of information from a range of sources, such as social media, Internet of Things (IoT) sensors,
and other devices. The principles and patterns underlying datafication must be understood to
ensure that it benefits all.
The patterns are reusable solutions to commonly occurring problems in software design.
These patterns provide a template for creating designs that solve specific problems while also
being flexible to adapt to different contexts and requirements.
This chapter provides you with an overview of the principles and patterns shaping the
development of datafication. It will examine the ethical implication of these technologies for
society. Using these principles and patterns, you can develop datafication projects that are fair
and transparent and that perform well.

What Are Architecture Principles?


A principle is a law or a rule that must be or usually is to be followed when making critical
architectural decisions. The architecture and design principles of datafication play a crucial role
in guiding the software architecture work responsible for defining the datafication direction.
While following these principles, you must also align with the existing enterprise’s regulations,
primarily those related to data and analytics.
The data and analytics architecture principles are a subset of the overall enterprise
architecture principles that pertain to the rules surrounding your data collection, usage,
management, integration, and analytics. Ultimately, these principles keep your datafication
architecture consistent, clean, and accountable and help to improve your overall datafication
strategy.

Datafication Principles
As mentioned in the previous chapter, datafication analyzes data from various sources, such as
social media, IoT, and other digital devices. For a successful and streamlined datafication
architecture, you must define principles related to data ingestion, data streaming, data quality,
data governance, data storage, data analysis, visualization, and metrics. These principles ensure
that data and analytics are used in a way that is aligned with the organization’s goals and
objectives.
Examples of datafication principles include the use of accurate and up-to-date data, the use
of a governance framework, the application of ethical standards, and the application of quality
rules.
The following few principles that helps you to design the datafication process:

Data Integration Principle


Before big data and streaming technology, data movement was simple. Data moved linearly from
static structured databases and static APIs to data warehouses. Once you built an integration
pipeline in this stagnant world, it operated consistently because data moved like trains on a
track.
In datafication, data have upended the traditional train track–based approach to use a
modern and smart city traffic signal–based approach. To move data at the speed of business and
unlock the flexibility of modern data architecture, the integration must be handled such that it
has the ability to monitor and manage performance continually. For modern data integration,
your data movement must be prepared for the following:1
Be capable of doing streaming, batch, API-led, and micro-batch processing
Support structured, semi-structured, and unstructured data
Handle scheme and semantic changes without affecting the downstream analysis
Respond to changes from sources and application control
The following principles will help you design modern data integration. For example, in
health-care data analysis, you need to integrate various health-care systems in the hospitals,
such as electronic medical records and insurance claims data. In financial data analysis, to
generate trends and statistics of financial performance, you need to integrate various data
systems, such as payment processors, accounting systems, and so forth.
Design for Both Batch and Streaming: While you are building for social media and IoT,
which capitalize on streaming and API-led data, you must account for the fact that these data
often need to be joined with or analyzed against historical data and other batch sources
within an enterprise.
Structured, Semi-structured, and Unstructured Data: Data integration combines data from
multiple sources to provide a unified view. To achieve this, data integration software must be
able to support this.
Handle Scheme and Schematics Changes: In data integration, it is standard for the scheme
and schematics of the data to change over time as new data sources are added or existing
sources are modified. These changes affect the downstream analysis, making it difficult to
maintain the data’s integrity and the analysis’s accuracy. It is essential to use a flexible and
extensible data integration architecture to handle this. You can use data lineage tools to
achieve this.
Respond to Changes from Sources: In data integration, responding to the source side
requires technical and organizational maturity. Using CDC (Change Data Capture) and APIs
(Application Programming Interface) and implementing the best change management ensures
that data integration is responsive, efficient, and effective.
Use Low-Code No-Code Concepts: Writing custom code to ingest data from the source into
your data store has been commonplace.
Sanitize Raw Data upon Data Harvest: Storing raw inputs invariably leads you to have
personal data and otherwise sensitive information posing some compliance risks (use only
when it is needed). Sanitizing data as close to the source as possible makes data analytics
productive.
Handle Data Drift to Ensure Consumption-Ready Data: Data drift refers to the process of
data changing over time, often in ways that are unpredictable and difficult to detect. This drift
can occur for many reasons, such as changes in the data source, changes in data processing
algorithms, or changes in the system’s state. This kind of drift can impact the quality and
reliability of data and analytics. Data drift increases costs, causes delays in time to analysis,
and leads to poor decisions based on incomplete data. To mitigate this, you need to analyze
and choose the right tools and software to detect and react to changes in the schema and keep
data sources in sync.
Cloud Design: Designing integration for the cloud is fundamentally different when
architecting the cloud. Enterprises often put raw data into object stores without knowing the
end analytical intent. Legacy tools for data integration often lack the level of customization
and interoperability needed to take full advantage of cloud services.
Instrument Everything: Obtaining end-to-end insight into data systems will be challenging.
End-to-end instrumentation helps to manage data movements. This instrumentation is
needed for time series analysis of a single data flow to tease out changes over time.
Implement the DataOps Approach: Traditional data integration was suitable for the
waterfall delivery approach but may not work for modern-day engineering principles. Modern
dataflow tools provide an integrated development environment for continuous use for the
dataflow life cycle.

Data Quality Principle


Ensuring you have high-quality data is central to the data management platform. The principle
of data quality management is a set of fundamental understandings, standards, rules, and values.
It is the core ingredient of a robust data architecture. Data quality is critical for building an
effective datafication architecture. Well-governed, high-quality data helps create accurate
models and robust schemas.
There are five characteristics of data quality, as follows:
Accuracy: Is the information captured in every detail?
Completeness: How compressive is the data?
Reliability: Does the data contradict other trusted resources?
Relevance: Do you need this data?
Timeliness: Is this data obsolete or up-to-date, and can it be used for real-time analysis?
To address these issues, several steps can be taken to improve the quality, as follows:
Identify the type of source and assess its reliability.
Check the incoming data for errors, inconsistencies, and missing values.
Use data cleaning techniques to fix any issues and improve the quality.
Use validation rules to ensure data is accurate and complete. This could be an iterative
approach.
Monitor regularly and identify changes.
Apply data governance and management process.

Data Quality Tools


Data quality is a critical capability of datafication, as the accuracy and reliability of data are
essential for an accurate outcome. These tools and techniques can ensure that data is correct,
complete, and consistent and can help identify and remediate quality issues. There are various
tools and techniques to address data quality. Here are a few examples:
Data Cleansing tools: These help you identify and fix errors, inconsistencies, and missing
values.
Data Validation tools: These tools help you to check data consistency and accuracy.
Data Profiling tools: These will provide detailed data analysis such as data types, patterns,
and trends.
Data Cataloging tools: These tools will create a centralized metadata repository, including
data quality metrics, data lineage, and data relationships.
Data Monitoring and Alerting tools: These track data quality metrics and alert the
governance team when quality issues arise.

Data Governance Principles


Data are an increasingly significant asset as organizations implementing datafication move up
the digital curve as they focus on big data and analytics. Data governance helps organizations
better manage data availability, usability, integrity, and security. It involves establishing policies
and procedures for collecting, storing, and using data.
In modern architecture, especially for datafication, data are expected to be harvested and
accessed anywhere, anytime, on any device. Satisfying these expectations can give rise to
considerable security and compliance risks, so robust data governance is needed to meet the
datafication process.
Data governance is about bringing data under control and keeping it secure. Successful data
governance requires understanding the data, policies, and quality of metadata management, as
well as knowing where data resides. How did it originate? Who has access to it? And what does it
mean? Effective data governance is a prerequisite to maintaining business compliance,
regardless of whether that compliance is self-imposed by an organization or comes from global
industry practices.
Data governance includes how data delivery and access take place. How is data integrity
managed? How does data lineage take place? How is data loss prevention (DLP) configured?
How is security implemented for data?
Data governance typically involves the following:
Establish the data governance team and define its roles and responsibilities.
Develop a data governance framework that includes policies, standards, and procedures.
Data consistency across user behavior ensures completeness and accuracy in generating
required KPIs (Key Performance Indicators).
Identify critical data assets and classify them according to their importance.
Define compliance matrices like GDPR, etc.
Fact-based decisions based on advanced analytics become actual time events, and data
governance ensures data veracity, which builds the confidence an organization needs to
achieve the real-time goal for decision making.
Consider using data governance software such as Alation, Collibra, Informatica, etc.

Data Is an Asset
Data is an asset that has value to organizations and must be managed accordingly. Data is an
organizational resource with real measurable value, informing decisions, improving operations,
and driving business growth. Organizations’ assets are carefully managed, and data are equally
important as physical or digital assets. Quality data are the foundation of the organization’s
decisions, so you must ensure that the data are harvested with quality and accuracy and are
available when needed. The techniques used to measure the data value are directly related to the
accuracy of the outcome of the decision, the accuracy of the outcome depends on the quality,
relevance, and reliability of hte data used in the decision-making process. the common
techniques are data quality assessment, data relevance analysis, cost-benefit analysis, impact
analysis and differnet forms of analytics.

Data Is Shared
Different organizational stakeholders will access the datafication data to analyze various KPIs.
Therefore, the data can be shared with relevant teams across an organization. Timely access to
accurate and cleansed data is essential to improving the quality and efficiency of an
organization’s decision-making ability. The speed of data collection, creation, transfer, and
assimilation is driven by the ability of an organization’s process and technology to capture social
media or IoT sensor data.
To enable data sharing, you must develop and abide by a common set of policies, procedures,
and standards governing data management and access in the short and long term. It would be
best if you had a clear blueprint for data sharing; there should not be any compromise of the
confidentiality and privacy of data.

Data Trustee
Each data element in a datafication architecture has a trustee accountable for its quality. As the
degree of data sharing grows and business units within an organization rely upon information, it
becomes essential that only the data trustee makes decisions about the content of the data. In
this role, the data trustee is responsible for ensuring that the data used are following applicable
laws, regulations, or policies and are handled securely and responsibly. The specific
responsibilities of a data trustee will vary depending on the type of data being shared and the
context in which it is being used.
The trustee and steward are different roles. The trustee is responsible for the accuracy and
currency of the data, while the steward may be broader and include standardization and
definition tasks.

Ethical Principle
Datafication focuses on and analyzes social media, medical, and IoT data. These data are focused
on human dignity, which involves considering the potential consequences of data and ensuring
that it is used fairly, responsibly, and transparently. This principle reflects the fundamental
ethical requirement that people be treated in a way that respects their dignity and autonomy as
human individuals. When analyzing social media and medical data, we must remember that data
also affects, represents, and touches people. Personal data are entirely different from any
machine’s raw data, and the unethical use of personal data can directly influence people’s
interactions, places in the community, personal product usage, etc. It would be best if you
considered various laws across the globe to meet ethics needs while designing your system.
There are various laws in place globally; here are a few:
GDPR Principles (Privacy): Its focus is protecting, collecting, and managing personal data;
i.e., data about individuals. It applies to all companies and originations in the EU and companies
outside of Europe that hold or otherwise process personal data. The following are a few
guidelines from the GDPR. For more details, refer to https://gdpr-info.eu/:
Fairness, Lawfulness, Transparency: Personal data shall be processed lawfully, fairly, and
transparently about the data subject.
Purpose Limitation: Personal data must be collected for specified, explicit, and legitimate
purposes and not processed in an incompatible manner.
Data Minimization: Personal data must be adequate, relevant, and limited to what is
necessary for the purpose they are processed.
Accuracy: Personal data must be accurate and, where necessary, kept up to date.
Integrity and Confidentiality: Data must be processed with appropriate security of the
personal data, including protection against unauthorized and unlawful processing.
Accountability: Data controllers must be responsible for any compliance
PIPEDA (Personal Information Protection and Electronic Documents Act): This applies
to every organization that collects, uses, and disseminates personal information. The following
are the statutory obligations of PIPEDA; for more information, visit
https://www.priv.gc.ca/:
Accountability: Organizations are responsible for personal information under its control and
must designate an individual accountable for compliance.
Identifying Purpose: You must specify the purpose for which personal information is
collected.
Consent: You must obtain the knowledge and consent of the individual for the collection.
Accuracy: Personal information must be accurate, complete, and up to date.
Safeguards: You must protect personal information.
Human Rights and Technology Act: The U.K. government proposed this act. It would
require companies to conduct due diligence to ensure that their datafication system does not
violate human rights and to report any risk or harm associated with the technology. You can find
more information at https://www.equalityhumanrights.com/. The following are a few
guidelines:
Human Rights Impact Assessment: Conduct a human rights impact assessment before
launching new services.
Transparency and Accountability: You must disclose information about technology services,
including how you collect the data and the algorithms you use to make decisions affecting
individual rights.
Universal Guidelines for AI: This law provides a set of guidelines for AI/ML and was
developed by IEEE (Institute of Electrical and Electronics Engineers). These guidelines include
transparency, accountability, and safety. You can find more information at
https://thepublicvoice.org/ai-universal-guidelines/. The following are a few
guidelines:
Transparency: AI should be transparent in decision-making process, and data algorithms
used in AI should be open and explainable.
Safety and Well-being: Should be designed to ensure the safety and well-being of individuals
and society.
There are various laws available for each country, and we suggest following the laws and
compliance requirements before processing any data for analysis.

Security by Design Principle


Security by design also means privacy by design and is a concept in which security and privacy
are considered fundamental aspects of the design.
This principle emphasizes the importance of keeping security and privacy at the core of a
product system.
The following practices help with the design and development of a datafication architecture:2
Minimize Attack Surface Area: Restricts a user’s access to services.
Establish Secure Defaults: Strong security rules on registering users to access your services.
The Principle of Least Privilege: The user should have minimum privileges needed to
perform a particular task.
The Principle of Defense Depth: Add multiple layers of security validations.
Fail Securely: Failure is unavoidable and therefore you want it to fail securely.
Don’t Trust Services: Do not trust third-party services without implementing a security
mechanism.
Separation of Duties: Prevent individuals from acting fraudulently.
Avoid Security by Obscurity: Should be sufficient security controls in place to keep your
application safe without hiding core functionality or source code.
Keep Security Simple: Avoid the use of very sophisticated architecture when developing
security controls.
Fix Security Issues Correctly: Developers should carefully identify all affected systems.

Datafication Patterns
Datafication is the process of converting various aspects of invisible data into digital data that
can be analyzed and used for decision making. As I explained in Chapter 1, “Introduction to
Datafication,” datafication is increasingly prevalent in recent years, as advances in technology
have made it easier to collect, store, and analyze large amount of data.
The datafication patterns are the common approaches and techniques used in the process of
datafication. These patterns involve the use of various technologies and methods, digitization,
aggregation, visualization, AI, and ML to convert data into useful insights.
By understanding these patterns, you can effectively store, collect, analyze, and use data to
drive decision making and gain a competitive edge. By leveraging these patterns, you optimize
storage operations.
Each solution is stated so that it gives the essential fields of the relationships needed to solve
the problem, but in a very general and abstract way so that you can solve the problem for
yourself by adapting it to your preferences and conditions.
The patterns can be the following:
Can be seen as building blocks of more complex solutions
The function is a common language used by technology architects and designers to describe
solutions.3

Data Partitioning Pattern


Partition allows a table, index, or index-organized table to be subdivided into smaller chunks,
where each chunk of such a database object is called a partition. This is often done for reasons of
efficiency, scalability, or security.
Data partitioning divides the data set and distributes the data over multiple servers or
shards. Each shard is an independent database, and collectively the shards make up a single
database. The portioning helps with manageability, performance, high availability, security,
operational flexibility, and scalability.
Data partitioning addresses the following scale-like issues:
High query rates exhausting the CPU capacity of the server
Larger data sets exceeding the storage capacity of a single machine
Working set sizes are more significant than the system’s RAM, thus stressing the I/O capacity
of disk drives.
You can use the following strategies for database partitioning:
Horizontal Partitioning (Sharding): Each partition is a separate data store, but all partitions
have the same schema. Each partition is known as a shard and holds a subset of data.
Vertical Partitioning: Each partition holds a subset of the fields for items in the data store.
These fields are divided according to how you access the data.
Functional Partitioning: Data are aggregated according to how each bounded context in the
system uses it.
You can combine multiple strategies in your application. For example, you can apply
horizontal partitioning for high availability and use a vertical partitioning strategy to store based
on data access.
The database, either RDBMS or NoSQL, provides different criteria to share the database.
These criteria are as follows:
Range or interval partitioning
List partitioning
Round-robin partitioning
Hash partitioning
Round-robin partitioning is a data partitioning strategy used in distributed computing
systems. In this strategy, data is divided into equal-sized partitions or chunks and assigned to
different nodes in a round-robin fashion. It distributes the rows of a table among the nodes. In
range, list, and hash partitioning, an attribute “partitioning key” must be chosen from among the
table attributes. The partition of the table rows is based on the value of the partitioning key. For
example, if there are three nodes and 150 records to be partitioned, the records are divided into
three equal chunks of 50 records each. The first chunk is assigned to the first node, the second
chunk assigned to the second node, and so on. After each node is assigned a chunk of data, the
partitioning starts again from the beginning, assigning the fifth chunk to the first node and so on.
Range partitioning is a partitioning strategy where data is partitioned based on a specific
range of values. For example, you have a large data set of patient records, and you want to
partition the data based on the age group. To do this, first you determine the minimum and
maximum age groups in the data set and then divide the range of dates into equal intervals, each
representing a partition.

Data Replication
Data replication is the process of copying data from one location to another location. The two
locations are generally located on different servers. This kind of distribution satisfies the failover
and fault tolerance characteristics.
Replication can serve many nonfunctional requirements, such as the following:
Scalability: Can handle higher query throughput than a single machine can handle
High Availability: Keeping the system running even when one or more nodes go down
Disconnected Operations: Allowing an application to continue working when there is a
network problem
Latency: Placing data geographically closer to users so that users can interact with the data
faster
In some cases, replication can provide increased read capacity as the client can send read
operations to different servers. Maintaining copies of data in different nodes and different
availability zones can increase the data locality and availability of the distributed application.
You can also maintain additional copies of dedicated purposes, such as disaster recovery,
reporting, or backup.
There are two types of replication:
Leader-based or leader-follower replication
Quorum-based replication
These two types of replication support full data replication, partial data replication, master-
slave replication, and multi-master replication.4

Stream Processing
Stream processing is the real-time processing of data streams. A stream is a continuous flow of
data that is generated by a variety of sources, such as social media, medical data, sensors, and
financial transactions.
Stream processing helps consumers query continuous data streams to detect conditions (for
example, in payment processing, the AML (Anti-Money Laundering) system alerts if it founds
anamolies in transactions) quickly in a near real-time mode instead of batch mode. The
detection of the condition varies depending on the type of source and use cases used.
There are several approaches to stream processing, including stream processing application
frameworks, application engines, and platforms. Stream processing allows applications to
exploit a limited form of parallel processing more easily. The application that supports stream
processing can manage multiple computational units without explicitly managing allocation,
synchronization, or communication among those units. The stream processing pattern simplifies
parallel software and hardware by restricting the parallel computations that can be performed.
Stream processing takes on data via aggregation, analytics, transformations, enrichment, and
ingestion.
As shown in Figure 2-1, for each input source, the stream processing engine operates in real
time on the data source and provides output in the target database.

Figure 2-1 Stream processing

The output is delivered to a streaming analytics application and added to the output streams.
The stream processing pattern addresses many challenges in the modern architecture of
real-time analytics and event-driven applications, such as the following:
Stream processing can handle data volumes that are much larger than the data processing
systems.
Stream processing easily models the continuous flow of data.
Stream processing decentralizes and decouples the infrastructure.
The typical use cases of stream processing will be examined next.

Social Media Data Use Case


Let’s consider a real-time sentiment analysis. Sentiment analysis is the process of analyzing text
data to determine the attitude expressed in the text, video, etc. Let’s consider an e-commerce
platform. They sell smart phones, and the company wants to monitor public opinion about
different smart phone brands on social media, and to respond quickly to any negative feedback
or complaints. To do this, you need to set up a stream processing to continuously monitor social
media platforms for the mention of the various brands. The system can use natural language
processing (NLP) technique to perform sentiment analysis on the text of the posts and classifies
each mention as positive, negative and neutral.

IoT Sensors
Stream processing is used for real-time analysis of data generated by IoT sensors. Let’s consider
a boiler machine at a chemical plant. They have a network of IoT sensors that are used to
monitor environmental condition, temperature, humidity, etc. The company wants to use the
data generated by the boiler to optimize the chemical process and detect potential issues in real-
time. To do this, you need stream processing to continuously analyze the sensor data. The
system uses ML algorithms to detect anomalies or patterns in the data and triggers alerts when
certain thresholds are met.

Geospatial Data Processing


Stream processing is used for real-time analysis of data generated by GPS tracking. Let’s
consider an example with a shipping company. It wants to optimize its fleet management
operation by tracking the location and status of each container in real-time. You can use GPS
tracking on the containers to collect location data, which is then streamed to a stream
processing system for real-time analysis.

Change Data Capture (CDC)


CDC is a replication solution that captures database changes as they happen and delivers them
to the target database. CDC is a modern cloud architecture, as this pattern is a highly efficient
way to move data from the source to the target, which helps to generate real-time analytics. CDC
is a part of the ETL (extract, transform, and load) process, where data are extracted from the
source, transformed, and loaded into target resources, such as a data lake or a data warehouse,
as shown in Figure 2-2.

Figure 2-2 Change data capture (CDC)

CDC extracts data in real-time from the source refresh. The data are sent to the staging area
before being loaded into the data warehouse of the data lake. In this process, data
transformation occurs in chunks. The load process places data into the target source, where it
can be analyzed with the help of algorithms and BI (Business Intelligence) tools.
There are many techniques available to implement CDC depending on the nature of your
implementation. They include the following:
Timestamp: The Timestamp column in a table represents the time of the last change; any
data changes in a row can be identified with the timestamp.
Version Number: The Version Number column in a table represents the version of the last
change; all data with the latest version number are considered to have changed.
Triggers: Write a trigger for each table; the triggers in a table log events that happen to the
table.
Log-based: Databases store all changes in a transaction log to recover the committed state of
the database. CDC reads the changes in the log, identifies the modification, and publishes an
event.
The preferred approach is the log-based technique. In today’s world, many databases offer a
stream of data-change logs and expose them through an event.5

Data Mesh
In modern data business disruption, you need to have the right set of technologies in place to
support it. Organizations are working on implementing a data lake and data warehouse strategy
for datafication, which is good thinking. Nevertheless, these implementations have limitations,
such as the centralization of domains and domain ownership. There are better solutions than
concentrating all domains centrally; you need to have a decentralized approach. To implement
decentralization, you need to adopt a data mesh concept that provides a new way to address
common problems.
The data mesh helps create a decentralized data governance model where teams are
responsible for the end-to-end ownership of data, from its creation to consumption. This
ownership includes defining data standards, creating data lineages, and ensuring that the data
are accurate, complete, and accessible.
The goal of the data mesh is to enable organizations to create a consistent and trusted data
infrastructure by breaking data lakes into silos and then into smaller, more decentralized parts,
as shown in Figure 2-3.6
Figure 2-3 Data mesh architecture
To implement the data mesh, the following principles must be considered:
Domain-oriented decentralized data ownership and architecture
Data as a product
Self-service infrastructure as a platform
Federated computational governance

Machine Learning Patterns


Building a production-grade machine learning (ML) model is a discipline that takes advantage of
ML methods that have been proven in the engineering discipline and applies them to day-to-day
business problems. In datafication, a scientist must take advantage of tried and proven methods
to address recurring issues. The following are a few essential patterns that can help you define
effective datafication.

Hashed Feature
The hashed feature is a technique used to reduce the dimensionality of input data while
maintaining the degree of accuracy in the model’s predictions. In this pattern, the original input
features are first hashed into a smaller set of features using a hash function.
The hash function maps input data of arbitrary size to a fixed size output. The output is
typically a string of characters or a sequence of bits that represent the input data in a compact
and unique way. The objective of the hash function is to make it computationally infeasible to
generate the same hash value from two different input data.
The hashed feature is useful when you working with high-dimensional input data because it
will help you to reduce the computational and memory requirements of the model while still
achieving accuracy.
The hashed feature component in ML transforms a stream of English text into a set of integer
values, as shown in Table 2-1. You can then pass this hashed feature set to an ML algorithm to
train a text analytics model.

Table 2-1 English Text to Integer Values

Comments Sentiment
I loved this restaurant. 3
I hated this restaurant. 1
This restaurant was excellent. 3
The taste is good, but the ambience is average. 2
The restaurant is a good but too crowded place. 2
Internally, the hashing component creates a dictionary, an example of which is shown in
Table 2-2.

Table 2-2 Dictionary

Comments Sentiment
This restaurant 3
I loved 1
I hated 1
I love 1
Ambience 2

The hashing feature transforms categorical input into a unique string using a hash function,
which maps the input to the fixed-sized output with a integer. The resulting hash value can be a
positive or negative integer, depending on the input data. To convert the hash value to a positive
integer, the hashed feature takes the absolute value of the hash value, which ensures the
resulting index is always positive.
For example, suppose you have a categorical feature of “fruit” with “orange”, “apple”, and
“watermelon.” For this, you apply a hash function to each value to generate a hash value, which
can be a positive or negative integer. After this, you can take the aboslute value of the hash value
and use the modulo operator to Mal the resulting index to a fixed range of values. Suppose you
want to use 100 buckets, so you take the modulo of the absolute hash value with 1000 to get an
index in the range [0,99]. If the hash value is negative, taking the absolute value ensures that the
resulting index is still in the range [0,99].
Using a hash function handles the large categorical input data with high cardinality. By using
a hash function to map the categorical values to a small number of hash buckets, you can reduce
the dimensionality of the input space and improve the efficiency of model. However, using small
hash buckets can leads to collisions, where different input values are mapped to the same hash
bucket, resulting loss of information. Therefore, it is important to choose a good hash function
and a suitable number of buckets to minimize collision and ensure that the resulting feature
vectors are accurate.
If a new restaurant opened in your area and the restaurant team launched a campaign on
social media but there was no historical value existing for this restaurant, the hashing feature
could still be used to make predictions. The new restaurant can be hashed into one of the
existing hash buckets based on its characteristics, and the prediction for that hash bucket can be
used as a proxy for the new restaurant.

Embeddings
Embeddings are a learnable data representation that maps high-cardinality data to a low-
dimensional space without losing information, and the information relevant to the learning
problem is preserved. These embeddings help to identify the properties of the input features
related to the output label because the input feature’s data representation directly affects the
final model’s quality. While handling structured numeric fields is straightforward, disparate data
such as video, image, text, audio, and so on require training of the model. It would be best if you
had a meaningful numerical value to train models; this pattern helps you to handle these
disparate data types.
Usually, one-hot encoding is a common way to represent categorical input variables.
Nevertheless, in disparate data the one-hot encoding of high-cardinality categorical features
leads to a sparse matrix that does not work well with ML models and treats categorical variables
as being independent, so we cannot capture the relationship between different variables using
one-hot encoding.
Embeddings solve the problem by representing high-cardinality data densely in a lower
dimension by passing the input data through an embedding layer that has trainable weights.
This helps capture close relationships between high-dimensional variables in a lower-
dimensional space. The weights to create the dense representation are learned as part of the
optimization model.
The tradeoff of this model is the loss of information involved in moving from a high-
cardinality representation to a low-dimensional representation.
The embedding design pattern can be used in text embeddings in classification problems
based on text inputs, image embeddings, contextual language models, and training an
autoencoder for image embedding where the feature and the label are the same.
Let us take the same restaurant example: You have a hundred thousand diverse restaurants,
and you might have ten thousand users. In this example, I recommend the restaurant to the
respective users based on their likings.
Input: 100,000 restaurants, 10,000 users to eat; task: recommended restaurants to users.
I have put the restaurants’ names in order, with Asian restaurants to the left, African
restaurants to the right, and the rest in the center, as shown in Figure 2-4.

Figure 2-4 List of restaurants

I have considered the dimensions of separating the content-specific restaurants; there are
many dimensions you can consider, like vegetarian, dessert, decadent, coffee, etc.
Let us add restaurants to the x-axis and y-axis as shown in Figure 2-5, with the x-axis for
Asian and African and the y-axis for Europe and Latin American restaurants.
Figure 2-5 Restaurants along x-axis and y-axis
The similarity between restaurants is now captured by how close these points are. Here, I am
representing only two dimensions. The two dimensions may need to be three to capture
everything about the restaurants.
d-Dimensional Embeddings: Assume user interest in restaurants can be roughly explained
by d aspects, and each restaurant becomes the d-dimensional point where the value in
dimension d represents how much the restaurant fits the aspects and embeddings that can be
learned from data.
Learnings Embeddings in Deep Network: No training process is needed. The embedding
layer is just a hidden layer; supervised information (e.g., users went to the same restaurant
twice) tailors the learned embeddings for the desired task, and the hidden units discover how to
organize the items in d-dimensional space in a way to optimize the final objective best.
If you want restaurant recommendations, we want these embeddings aimed toward
recommended restaurants. The matrix shown in Figure 2-6 is a classic method of filtering input.
Here, I have one row for each user and one column for each restaurant. The simple arrow
indicates that the user has visited the restaurant.

Figure 2-6 Restaurants matrix

Input Representation:
For the above example, you need to build a dictionary mapping each feature to an integer from
0, …, restaurant 1.
Efficiently represent the sparse vector as just the restaurants at which the user dined; the
representation is shown in the figure.
You can use these input data to identify the ratings based on user dining and provide
recommendations.
Selecting How Many Embedding Dimensions:
Higher-dimensional embeddings can more accurately represent the relationships between
input values.
Having more dimensions increases the chance of overfitting and leads to slower training.
The embeddings can be applied to dense social media, video, audio, text, images, and so on.

Feature Cross
The feature cross pattern combines multiple features to create new, composite features that can
capture complex interactions between the original features. It is always good to increase the
representation power of the model by introducing non-linear relationships between features.
The feature cross pattern can be applied to neural networks, decision trees, linear
regression, and support vector machines.
There are different ways to implement the feature cross pattern depending on the algorithm
and the questionnaire.
The first approach is manually creating new features by combining multiple existing features
using mathematical operations. For example, a credit risk prediction task requires multiplying
the applicant’s income by their credit score. The disadvantages of this approach is that it is time
consuming.
The second approach is automating feature engineering by using algorithms to automatically
generate new features from the existing ones. For example, the autoML library can be used to
generate new features based on relationships between features in each set. The disadvantage of
this approach is that it is difficult to interpret a large number of features.
The third approach is neural network–based and uses neural networks to learn the optimal
feature interactions directly from the data. For example, a deep neural network may include
multiple layers that combine the input features in a non-linear way to create new, higher-level
features.
The benefits of using feature cross patterns are as follows:
They improve model performance by capturing complex interactions between features that
may not be captured by individual features alone.
They can use automation to generate new features from existing ones, and reduce the need for
manual feature engineering.
They help to improve the interpretability of the model by capturing meaningful interactions
between features.
The drawbacks of feature cross patterns are as follows:
Feature crosses can increase the number of features in the model, which can lead to the curse
of dimensionality and overfitting.
Feature crosses can increase the computational complexity of the model, making it harder to
train and scale.
Let’s consider a simple example to use feature crosses for linear regression in R.

library(caret)
Another random document with
no related content on Scribd:
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

You might also like