Introduction To Datafication Implement Datafication Using Ai and ML Algorithms Shivakumar R Goniwada 2 Full Chapter
Introduction To Datafication Implement Datafication Using Ai and ML Algorithms Shivakumar R Goniwada 2 Full Chapter
Introduction to Datafication
Implement Datafication Using AI and ML
Algorithms
Shivakumar R. Goniwada
Gubbalala, Bangalore, Karnataka, India
Standard Apress
The publisher, the authors, and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
1. Introduction to Datafication
Shivakumar R. Goniwada1
(1) Mantri Tranquil, #I-606, Gubbalala, Bangalore, Karnataka, India
What Is Datafication?
Datafication involves using digital technologies such as the cloud, data
products, and AI/ML algorithms to collect and process vast amounts of
data on human behavior, preferences, and activities.
Datafication converts various forms of information, such as texts,
images, audio recordings, comments, claps, and likes/dislikes to
curated format, and that data can be easily analyzed and processed by
multiple algorithms. This involves extracting relevant data from social
media, hospitals, and Internet of Things (IoT). These data are organized
into a consistent format and stored in a way that makes them accessible
for further analysis.
Everything around us, from finance, medical, construction, and
social media to industrial equipment, is converted into data. For
example, you create data every time you post to social media platforms
such as WhatsApp, Instagram, Twitter, or Facebook, and any time you
join meetings in Zoom or Google Meet, or even when you walk past a
CCTV camera while crossing the street. The notion differs from
digitization, as datafication is much broader than digitization.
Datafication can help you to understand the world more fully than
ever before. New cloud technologies are available to ingest, store,
process, and analyze data. For example, marketing companies use
Facebook and Twitter data to determine and predict sales. Digital Twin
uses industrial equipment behavior to analyze the behavior of the
machine.
Datafication also raises important questions about privacy, security,
and ethics. The collection and use of personal data can infringe on
individual rights and privacy, and there is a need for greater
transparency and accountability in how data are collected and used.
Overall, datafication represents a significant shift in how we live, work,
and act.
Datafication Steps
For datafication, as defined by DAMA (Data Management Association),
you must have a clear set of data, well-defined analysis models, and
computing power. To obtain a precise collection of data, relevant
models, and required computing power, one must follow these steps:
Data Harvesting: This step involves obtaining data in a real-time
and reliable way from various sources, such as databases, sensors,
files, etc.
Data Curation: This step involves organizing and cleaning the data
to prepare it for analysis. You need to ensure that the data collected
are accurate by removing errors, inconsistencies, and duplicates with
a standardized format.
Data Transformation: This step involves converting data into a
suitable format for analysis. This step helps you transform the data
into a specific form, such as dimensional and graph models.
Data Storage: This step involves storing the data after
transformation in storage, such as a data lake or data warehouse, for
further analysis.
Data Analysis: This step involves using statistical and analytical
techniques to gain insights from data and identify trends, patterns,
and correlations in the data that help with predictions and
recommendations.
Data Dissemination: This step involves sharing the dashboards,
reports, and presentations with relevant stakeholders.
Cloud Computing: This step provides the necessary infrastructure
and tools for the preceding steps.
Elements of Datafication
As defined by DAMA, Figure 1-1 illustrates the seven critical elements
of the datafication architecture used to develop the datafication
process. Datafication will only be successful if at least one of the steps is
included.
Figure 1-1 Data elements
Data Harvesting
Data harvesting is extracting data from a given source, such as social
media, IoT devices, or other various data sources.
Before harvesting any data, you need to analyze it to identify the
source and software tools needed for harvesting.
First, the data is undesiably noticeable if it is inaccurate, biased,
confidential, and irrelevant. Therefore, harvested information is more
objective and reliable than familiar data sources. However, the
disadvantage is that it is difficult to know the users’ demographic and
psychological variables for social media data sources.
Second, harvesting must be automatic, real-time, streaming and
able to handle large-scale data sources efficiently.
Third, the data are usually fine-grained and available in real-time.
Text mining techniques are used to preprocess raw text images, text
processing techniques are used to preprocess essential texts, and video
processing techniques are used to preprocess photos and videos for
further analysis.
Fourth, the data can be ingested in real-time or in batches. In real-
time, each data item is imported as the source changes it. When data
are ingested through sets, the data elements are imported in discrete
chunks at periodic intervals.
Various data harvesting methods can be used depending on the data
source type, as follows:
IoT devices typically involve collecting data from IoT sensors and
devices using protocols such as MQTT, CoAP, HTTP, and AMQP.
Social media platforms such as Facebook, Twitter, LinkedIn,
Instagram, and others use REST API, streaming, Webhooks, and
GraphQL.
Data Curation
Data curation organizes and manages data collected through ingestion
from various sources. This involves organizing and maintaining data in
a way that makes it accessible and usable for data analysis. This
involves cleaning and filtering data, removing duplicates and errors,
and properly labeling and annotating data.
Data curation is essential for ensuring that the data are accurate,
consistent, and reliable, which is crucial for data analysis.
The following are the few steps involved in data curation:
Data Cleaning: Once the data is harvested, it must be cleaned to
remove errors, inconsistencies, and duplicates. This involves
removing missing values, correcting spelling errors, and
standardizing data formats.
Data Transformation: After the data has been cleaned, it needs to
be transformed into a format suitable for analysis. This involves
aggregating data, creating new variables, and so forth. For example,
you might have a data set of pathology reports with variables that
include such elements as patient ID, date of visit, test ID, test
description, and test results. You want to transform this data set into
a format that shows each patient’s total health condition. To do this
you need to alter harvested data for data analysis with
transformations such as creating a new variable for test category,
aggregating test data for patient for a year, summarizing data by
group (ex: hemoglobin), etc.
Data Labeling: Annotating data with relevant metadata, such as
variable names and data descriptions.
Data Quality Test: In this step, you need to ensure the data is
accurate and reliable by using various tests like statistical tests, etc.
The overall objective of data curation is to reduce the time it takes
to obtain insight from raw data by organizing and bringing relevant
information together for further analysis.
The steps involved in data curation are organizing and cataloging
data, ensuring data quality, preserving data, and providing access to
data.
Data Storage
Data storage stores actual digital data on a computer with the help of a
hard drive, solid-state drive, and related software to manage and
organize the data.
Data storage is the actual physical storage of datafication data. More
than 2.5 quintillion bytes of data are created daily, and data snowballs
of approximately 2 MB are made every second for every person. These
numbers are from users searching the content in the internet, browsing
social media networks, posting blogs, photos, comments, status
updates, watching a video, downloading images, streaming songs, etc.
To make a business decision, the data must be stored in a way that is
easier to manage and access, and it is essential to protect data against
cyber threats.
For IoT, the data need to be collected from sensors and devices and
stored in the cloud.
Several types of database storage exist, including relational
databases, NoSQL databases, in-memory databases, and cloud
databases. Each type of database storage has advantages and
disadvantages, but the best choice for datafication is cloud databases
that involve data lakes and warehouses.
Data Analysis
Data analysis refers to analyzing a large set of data to discover different
patterns and KPIs (Key Performance Indicators) of an organization. The
main goal of analytics is to help organizations make better business
decisions and future predictions. Advanced analytics techniques such as
machine learning models, text analytics, predictive analytics, data
mining, statistics, and natural language processing are used. With these
ML models, you uncover hidden patterns, unknown correlations,
market trends, customer preferences, feedback about your new FMCG
(Fast Moving Consumer Goods) products, and so on.
The following are the types of analytics that you can process using
ML models:
Prescriptive: This type of analytics helps to decide what action
should be taken and examines data to answer various questions such
as what should be done. Or what can we do to make our product
attractive? This helps to find an answer to various problems, such as
where to focus on treatment.
Predictive: This type of analytics helps to predict the future or what
might happen, such as emphasizing the business relevance of the
resulting insights and use cases, such as sales and production data.
Diagnostic: This type of analytics helps to analyze past situations,
such as what went wrong and why it happened. This helps to
facilitate correction in the future; for example, weather prediction
and customer behavior.
Descriptive: This type of analytics helps to analyze current and
future use cases, such as behavioral analysis of users.
Exploratory: This type of analytics involves visualization.
Cloud Computing
Cloud computing is the use of computing resources delivered over the
internet and has the potential to offer substantial opportunities in
various datafication scenarios. It is a flexible delivery platform for data,
computing, and other services. It can support many architectural and
development styles, from extensive, monolithic systems to sizeable
virtual machine deployments, nimble clusters of containers, a data
mesh, and large farms of serverless functions.
The primary services of cloud offerings for data storage are as
follows:
Data storage as a service
Streaming services for data ingestion
Machine learning workbench for analysis
Summary
Datafication is the process of converting various types of data and
information into a digital format that can easily be processed and
analyzed. With datafication, you can increase your organization’s
footprint by using data effectively for decision making. It helps to
improve operational efficiency and provides input to the manufacturing
hub to develop new products and services.
Overall, data curation is the key component of effective datafication,
as it ensures that the data is accurate, complete, and reliable, which is
essential for making decisions and gleaning meaningful insights.
In this chapter, I described datafication and discussed the types of
data involved in datafication, datafication steps, and datafication
elements. Next chapter provides more details of principles, patterns
and methodolgoies to realize the datafication.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
S. R. Goniwada, Introduction to Datafication
https://doi.org/10.1007/978-1-4842-9496-3_2
Principles are guidelines for the design and development of a system. They reflect the level of
consensus among the various elements of your system. Without proper principles, your
architecture has no compass to guide its journey toward datafication.
Patterns are tried-and-tested accurate solutions to common design problems, and they can
be used as a starting point for developing a datafication.
The processes involved in datafication are to collect, analyze, and interpret the vast amount
of information from a range of sources, such as social media, Internet of Things (IoT) sensors,
and other devices. The principles and patterns underlying datafication must be understood to
ensure that it benefits all.
The patterns are reusable solutions to commonly occurring problems in software design.
These patterns provide a template for creating designs that solve specific problems while also
being flexible to adapt to different contexts and requirements.
This chapter provides you with an overview of the principles and patterns shaping the
development of datafication. It will examine the ethical implication of these technologies for
society. Using these principles and patterns, you can develop datafication projects that are fair
and transparent and that perform well.
Datafication Principles
As mentioned in the previous chapter, datafication analyzes data from various sources, such as
social media, IoT, and other digital devices. For a successful and streamlined datafication
architecture, you must define principles related to data ingestion, data streaming, data quality,
data governance, data storage, data analysis, visualization, and metrics. These principles ensure
that data and analytics are used in a way that is aligned with the organization’s goals and
objectives.
Examples of datafication principles include the use of accurate and up-to-date data, the use
of a governance framework, the application of ethical standards, and the application of quality
rules.
The following few principles that helps you to design the datafication process:
Data Is an Asset
Data is an asset that has value to organizations and must be managed accordingly. Data is an
organizational resource with real measurable value, informing decisions, improving operations,
and driving business growth. Organizations’ assets are carefully managed, and data are equally
important as physical or digital assets. Quality data are the foundation of the organization’s
decisions, so you must ensure that the data are harvested with quality and accuracy and are
available when needed. The techniques used to measure the data value are directly related to the
accuracy of the outcome of the decision, the accuracy of the outcome depends on the quality,
relevance, and reliability of hte data used in the decision-making process. the common
techniques are data quality assessment, data relevance analysis, cost-benefit analysis, impact
analysis and differnet forms of analytics.
Data Is Shared
Different organizational stakeholders will access the datafication data to analyze various KPIs.
Therefore, the data can be shared with relevant teams across an organization. Timely access to
accurate and cleansed data is essential to improving the quality and efficiency of an
organization’s decision-making ability. The speed of data collection, creation, transfer, and
assimilation is driven by the ability of an organization’s process and technology to capture social
media or IoT sensor data.
To enable data sharing, you must develop and abide by a common set of policies, procedures,
and standards governing data management and access in the short and long term. It would be
best if you had a clear blueprint for data sharing; there should not be any compromise of the
confidentiality and privacy of data.
Data Trustee
Each data element in a datafication architecture has a trustee accountable for its quality. As the
degree of data sharing grows and business units within an organization rely upon information, it
becomes essential that only the data trustee makes decisions about the content of the data. In
this role, the data trustee is responsible for ensuring that the data used are following applicable
laws, regulations, or policies and are handled securely and responsibly. The specific
responsibilities of a data trustee will vary depending on the type of data being shared and the
context in which it is being used.
The trustee and steward are different roles. The trustee is responsible for the accuracy and
currency of the data, while the steward may be broader and include standardization and
definition tasks.
Ethical Principle
Datafication focuses on and analyzes social media, medical, and IoT data. These data are focused
on human dignity, which involves considering the potential consequences of data and ensuring
that it is used fairly, responsibly, and transparently. This principle reflects the fundamental
ethical requirement that people be treated in a way that respects their dignity and autonomy as
human individuals. When analyzing social media and medical data, we must remember that data
also affects, represents, and touches people. Personal data are entirely different from any
machine’s raw data, and the unethical use of personal data can directly influence people’s
interactions, places in the community, personal product usage, etc. It would be best if you
considered various laws across the globe to meet ethics needs while designing your system.
There are various laws in place globally; here are a few:
GDPR Principles (Privacy): Its focus is protecting, collecting, and managing personal data;
i.e., data about individuals. It applies to all companies and originations in the EU and companies
outside of Europe that hold or otherwise process personal data. The following are a few
guidelines from the GDPR. For more details, refer to https://gdpr-info.eu/:
Fairness, Lawfulness, Transparency: Personal data shall be processed lawfully, fairly, and
transparently about the data subject.
Purpose Limitation: Personal data must be collected for specified, explicit, and legitimate
purposes and not processed in an incompatible manner.
Data Minimization: Personal data must be adequate, relevant, and limited to what is
necessary for the purpose they are processed.
Accuracy: Personal data must be accurate and, where necessary, kept up to date.
Integrity and Confidentiality: Data must be processed with appropriate security of the
personal data, including protection against unauthorized and unlawful processing.
Accountability: Data controllers must be responsible for any compliance
PIPEDA (Personal Information Protection and Electronic Documents Act): This applies
to every organization that collects, uses, and disseminates personal information. The following
are the statutory obligations of PIPEDA; for more information, visit
https://www.priv.gc.ca/:
Accountability: Organizations are responsible for personal information under its control and
must designate an individual accountable for compliance.
Identifying Purpose: You must specify the purpose for which personal information is
collected.
Consent: You must obtain the knowledge and consent of the individual for the collection.
Accuracy: Personal information must be accurate, complete, and up to date.
Safeguards: You must protect personal information.
Human Rights and Technology Act: The U.K. government proposed this act. It would
require companies to conduct due diligence to ensure that their datafication system does not
violate human rights and to report any risk or harm associated with the technology. You can find
more information at https://www.equalityhumanrights.com/. The following are a few
guidelines:
Human Rights Impact Assessment: Conduct a human rights impact assessment before
launching new services.
Transparency and Accountability: You must disclose information about technology services,
including how you collect the data and the algorithms you use to make decisions affecting
individual rights.
Universal Guidelines for AI: This law provides a set of guidelines for AI/ML and was
developed by IEEE (Institute of Electrical and Electronics Engineers). These guidelines include
transparency, accountability, and safety. You can find more information at
https://thepublicvoice.org/ai-universal-guidelines/. The following are a few
guidelines:
Transparency: AI should be transparent in decision-making process, and data algorithms
used in AI should be open and explainable.
Safety and Well-being: Should be designed to ensure the safety and well-being of individuals
and society.
There are various laws available for each country, and we suggest following the laws and
compliance requirements before processing any data for analysis.
Datafication Patterns
Datafication is the process of converting various aspects of invisible data into digital data that
can be analyzed and used for decision making. As I explained in Chapter 1, “Introduction to
Datafication,” datafication is increasingly prevalent in recent years, as advances in technology
have made it easier to collect, store, and analyze large amount of data.
The datafication patterns are the common approaches and techniques used in the process of
datafication. These patterns involve the use of various technologies and methods, digitization,
aggregation, visualization, AI, and ML to convert data into useful insights.
By understanding these patterns, you can effectively store, collect, analyze, and use data to
drive decision making and gain a competitive edge. By leveraging these patterns, you optimize
storage operations.
Each solution is stated so that it gives the essential fields of the relationships needed to solve
the problem, but in a very general and abstract way so that you can solve the problem for
yourself by adapting it to your preferences and conditions.
The patterns can be the following:
Can be seen as building blocks of more complex solutions
The function is a common language used by technology architects and designers to describe
solutions.3
Data Replication
Data replication is the process of copying data from one location to another location. The two
locations are generally located on different servers. This kind of distribution satisfies the failover
and fault tolerance characteristics.
Replication can serve many nonfunctional requirements, such as the following:
Scalability: Can handle higher query throughput than a single machine can handle
High Availability: Keeping the system running even when one or more nodes go down
Disconnected Operations: Allowing an application to continue working when there is a
network problem
Latency: Placing data geographically closer to users so that users can interact with the data
faster
In some cases, replication can provide increased read capacity as the client can send read
operations to different servers. Maintaining copies of data in different nodes and different
availability zones can increase the data locality and availability of the distributed application.
You can also maintain additional copies of dedicated purposes, such as disaster recovery,
reporting, or backup.
There are two types of replication:
Leader-based or leader-follower replication
Quorum-based replication
These two types of replication support full data replication, partial data replication, master-
slave replication, and multi-master replication.4
Stream Processing
Stream processing is the real-time processing of data streams. A stream is a continuous flow of
data that is generated by a variety of sources, such as social media, medical data, sensors, and
financial transactions.
Stream processing helps consumers query continuous data streams to detect conditions (for
example, in payment processing, the AML (Anti-Money Laundering) system alerts if it founds
anamolies in transactions) quickly in a near real-time mode instead of batch mode. The
detection of the condition varies depending on the type of source and use cases used.
There are several approaches to stream processing, including stream processing application
frameworks, application engines, and platforms. Stream processing allows applications to
exploit a limited form of parallel processing more easily. The application that supports stream
processing can manage multiple computational units without explicitly managing allocation,
synchronization, or communication among those units. The stream processing pattern simplifies
parallel software and hardware by restricting the parallel computations that can be performed.
Stream processing takes on data via aggregation, analytics, transformations, enrichment, and
ingestion.
As shown in Figure 2-1, for each input source, the stream processing engine operates in real
time on the data source and provides output in the target database.
The output is delivered to a streaming analytics application and added to the output streams.
The stream processing pattern addresses many challenges in the modern architecture of
real-time analytics and event-driven applications, such as the following:
Stream processing can handle data volumes that are much larger than the data processing
systems.
Stream processing easily models the continuous flow of data.
Stream processing decentralizes and decouples the infrastructure.
The typical use cases of stream processing will be examined next.
IoT Sensors
Stream processing is used for real-time analysis of data generated by IoT sensors. Let’s consider
a boiler machine at a chemical plant. They have a network of IoT sensors that are used to
monitor environmental condition, temperature, humidity, etc. The company wants to use the
data generated by the boiler to optimize the chemical process and detect potential issues in real-
time. To do this, you need stream processing to continuously analyze the sensor data. The
system uses ML algorithms to detect anomalies or patterns in the data and triggers alerts when
certain thresholds are met.
CDC extracts data in real-time from the source refresh. The data are sent to the staging area
before being loaded into the data warehouse of the data lake. In this process, data
transformation occurs in chunks. The load process places data into the target source, where it
can be analyzed with the help of algorithms and BI (Business Intelligence) tools.
There are many techniques available to implement CDC depending on the nature of your
implementation. They include the following:
Timestamp: The Timestamp column in a table represents the time of the last change; any
data changes in a row can be identified with the timestamp.
Version Number: The Version Number column in a table represents the version of the last
change; all data with the latest version number are considered to have changed.
Triggers: Write a trigger for each table; the triggers in a table log events that happen to the
table.
Log-based: Databases store all changes in a transaction log to recover the committed state of
the database. CDC reads the changes in the log, identifies the modification, and publishes an
event.
The preferred approach is the log-based technique. In today’s world, many databases offer a
stream of data-change logs and expose them through an event.5
Data Mesh
In modern data business disruption, you need to have the right set of technologies in place to
support it. Organizations are working on implementing a data lake and data warehouse strategy
for datafication, which is good thinking. Nevertheless, these implementations have limitations,
such as the centralization of domains and domain ownership. There are better solutions than
concentrating all domains centrally; you need to have a decentralized approach. To implement
decentralization, you need to adopt a data mesh concept that provides a new way to address
common problems.
The data mesh helps create a decentralized data governance model where teams are
responsible for the end-to-end ownership of data, from its creation to consumption. This
ownership includes defining data standards, creating data lineages, and ensuring that the data
are accurate, complete, and accessible.
The goal of the data mesh is to enable organizations to create a consistent and trusted data
infrastructure by breaking data lakes into silos and then into smaller, more decentralized parts,
as shown in Figure 2-3.6
Figure 2-3 Data mesh architecture
To implement the data mesh, the following principles must be considered:
Domain-oriented decentralized data ownership and architecture
Data as a product
Self-service infrastructure as a platform
Federated computational governance
Hashed Feature
The hashed feature is a technique used to reduce the dimensionality of input data while
maintaining the degree of accuracy in the model’s predictions. In this pattern, the original input
features are first hashed into a smaller set of features using a hash function.
The hash function maps input data of arbitrary size to a fixed size output. The output is
typically a string of characters or a sequence of bits that represent the input data in a compact
and unique way. The objective of the hash function is to make it computationally infeasible to
generate the same hash value from two different input data.
The hashed feature is useful when you working with high-dimensional input data because it
will help you to reduce the computational and memory requirements of the model while still
achieving accuracy.
The hashed feature component in ML transforms a stream of English text into a set of integer
values, as shown in Table 2-1. You can then pass this hashed feature set to an ML algorithm to
train a text analytics model.
Comments Sentiment
I loved this restaurant. 3
I hated this restaurant. 1
This restaurant was excellent. 3
The taste is good, but the ambience is average. 2
The restaurant is a good but too crowded place. 2
Internally, the hashing component creates a dictionary, an example of which is shown in
Table 2-2.
Comments Sentiment
This restaurant 3
I loved 1
I hated 1
I love 1
Ambience 2
The hashing feature transforms categorical input into a unique string using a hash function,
which maps the input to the fixed-sized output with a integer. The resulting hash value can be a
positive or negative integer, depending on the input data. To convert the hash value to a positive
integer, the hashed feature takes the absolute value of the hash value, which ensures the
resulting index is always positive.
For example, suppose you have a categorical feature of “fruit” with “orange”, “apple”, and
“watermelon.” For this, you apply a hash function to each value to generate a hash value, which
can be a positive or negative integer. After this, you can take the aboslute value of the hash value
and use the modulo operator to Mal the resulting index to a fixed range of values. Suppose you
want to use 100 buckets, so you take the modulo of the absolute hash value with 1000 to get an
index in the range [0,99]. If the hash value is negative, taking the absolute value ensures that the
resulting index is still in the range [0,99].
Using a hash function handles the large categorical input data with high cardinality. By using
a hash function to map the categorical values to a small number of hash buckets, you can reduce
the dimensionality of the input space and improve the efficiency of model. However, using small
hash buckets can leads to collisions, where different input values are mapped to the same hash
bucket, resulting loss of information. Therefore, it is important to choose a good hash function
and a suitable number of buckets to minimize collision and ensure that the resulting feature
vectors are accurate.
If a new restaurant opened in your area and the restaurant team launched a campaign on
social media but there was no historical value existing for this restaurant, the hashing feature
could still be used to make predictions. The new restaurant can be hashed into one of the
existing hash buckets based on its characteristics, and the prediction for that hash bucket can be
used as a proxy for the new restaurant.
Embeddings
Embeddings are a learnable data representation that maps high-cardinality data to a low-
dimensional space without losing information, and the information relevant to the learning
problem is preserved. These embeddings help to identify the properties of the input features
related to the output label because the input feature’s data representation directly affects the
final model’s quality. While handling structured numeric fields is straightforward, disparate data
such as video, image, text, audio, and so on require training of the model. It would be best if you
had a meaningful numerical value to train models; this pattern helps you to handle these
disparate data types.
Usually, one-hot encoding is a common way to represent categorical input variables.
Nevertheless, in disparate data the one-hot encoding of high-cardinality categorical features
leads to a sparse matrix that does not work well with ML models and treats categorical variables
as being independent, so we cannot capture the relationship between different variables using
one-hot encoding.
Embeddings solve the problem by representing high-cardinality data densely in a lower
dimension by passing the input data through an embedding layer that has trainable weights.
This helps capture close relationships between high-dimensional variables in a lower-
dimensional space. The weights to create the dense representation are learned as part of the
optimization model.
The tradeoff of this model is the loss of information involved in moving from a high-
cardinality representation to a low-dimensional representation.
The embedding design pattern can be used in text embeddings in classification problems
based on text inputs, image embeddings, contextual language models, and training an
autoencoder for image embedding where the feature and the label are the same.
Let us take the same restaurant example: You have a hundred thousand diverse restaurants,
and you might have ten thousand users. In this example, I recommend the restaurant to the
respective users based on their likings.
Input: 100,000 restaurants, 10,000 users to eat; task: recommended restaurants to users.
I have put the restaurants’ names in order, with Asian restaurants to the left, African
restaurants to the right, and the rest in the center, as shown in Figure 2-4.
I have considered the dimensions of separating the content-specific restaurants; there are
many dimensions you can consider, like vegetarian, dessert, decadent, coffee, etc.
Let us add restaurants to the x-axis and y-axis as shown in Figure 2-5, with the x-axis for
Asian and African and the y-axis for Europe and Latin American restaurants.
Figure 2-5 Restaurants along x-axis and y-axis
The similarity between restaurants is now captured by how close these points are. Here, I am
representing only two dimensions. The two dimensions may need to be three to capture
everything about the restaurants.
d-Dimensional Embeddings: Assume user interest in restaurants can be roughly explained
by d aspects, and each restaurant becomes the d-dimensional point where the value in
dimension d represents how much the restaurant fits the aspects and embeddings that can be
learned from data.
Learnings Embeddings in Deep Network: No training process is needed. The embedding
layer is just a hidden layer; supervised information (e.g., users went to the same restaurant
twice) tailors the learned embeddings for the desired task, and the hidden units discover how to
organize the items in d-dimensional space in a way to optimize the final objective best.
If you want restaurant recommendations, we want these embeddings aimed toward
recommended restaurants. The matrix shown in Figure 2-6 is a classic method of filtering input.
Here, I have one row for each user and one column for each restaurant. The simple arrow
indicates that the user has visited the restaurant.
Input Representation:
For the above example, you need to build a dictionary mapping each feature to an integer from
0, …, restaurant 1.
Efficiently represent the sparse vector as just the restaurants at which the user dined; the
representation is shown in the figure.
You can use these input data to identify the ratings based on user dining and provide
recommendations.
Selecting How Many Embedding Dimensions:
Higher-dimensional embeddings can more accurately represent the relationships between
input values.
Having more dimensions increases the chance of overfitting and leads to slower training.
The embeddings can be applied to dense social media, video, audio, text, images, and so on.
Feature Cross
The feature cross pattern combines multiple features to create new, composite features that can
capture complex interactions between the original features. It is always good to increase the
representation power of the model by introducing non-linear relationships between features.
The feature cross pattern can be applied to neural networks, decision trees, linear
regression, and support vector machines.
There are different ways to implement the feature cross pattern depending on the algorithm
and the questionnaire.
The first approach is manually creating new features by combining multiple existing features
using mathematical operations. For example, a credit risk prediction task requires multiplying
the applicant’s income by their credit score. The disadvantages of this approach is that it is time
consuming.
The second approach is automating feature engineering by using algorithms to automatically
generate new features from the existing ones. For example, the autoML library can be used to
generate new features based on relationships between features in each set. The disadvantage of
this approach is that it is difficult to interpret a large number of features.
The third approach is neural network–based and uses neural networks to learn the optimal
feature interactions directly from the data. For example, a deep neural network may include
multiple layers that combine the input features in a non-linear way to create new, higher-level
features.
The benefits of using feature cross patterns are as follows:
They improve model performance by capturing complex interactions between features that
may not be captured by individual features alone.
They can use automation to generate new features from existing ones, and reduce the need for
manual feature engineering.
They help to improve the interpretability of the model by capturing meaningful interactions
between features.
The drawbacks of feature cross patterns are as follows:
Feature crosses can increase the number of features in the model, which can lead to the curse
of dimensionality and overfitting.
Feature crosses can increase the computational complexity of the model, making it harder to
train and scale.
Let’s consider a simple example to use feature crosses for linear regression in R.
library(caret)
Another random document with
no related content on Scribd:
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.