IDS Notes
IDS Notes
IDS Notes
Unit -1
Data Science:
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. It
combines elements of statistics, machine learning, computer science, and domain-specific
knowledge to analyse and interpret complex data sets. Data science techniques are used to
solve complex problems, make informed decisions, and drive business strategy.
Data science is a field that deals with unstructured, structured, and semi-structured
data.
Data science is the combination of statistics, mathematics, programming, and
problem-solving skills.
Data scientists help companies make data-driven decisions to improve their business.
Big data: refers to extremely large and complex datasets that cannot be easily managed,
processed, or analysed using traditional data processing tools or methods. These datasets
typically involve massive volumes of data, generated at high velocity, and of varying types
(structured, unstructured, and semi-structured).
Data Science Hype: Data science enables companies not only to understand data from
multiple sources but also to enhance decision making. As a result, data science is widely
used in almost every industry, including health care, finance, marketing, banking, city
planning, and more.
Lack of clear definitions: Basic terminology like "Big Data" and "data science" lacks
clear definitions, leading to ambiguity.
Lack of respect for researchers: Many researchers in academia and industry have
been working in this field for years, but their contributions are sometimes
overlooked.
Crazy hype: The hype surrounding data science can lead to exaggeration and
unrealistic expectations, increasing the noise-to-signal ratio.
Statisticians' perspective: Statisticians feel that they have been studying the "Science
of Data" for a long time and may feel overshadowed by the rise of data science.
Perception of data science: Some people question whether data science is truly a
science or more of a craft, leading to debates about its legitimacy.
The hype around data science stems from its potential to revolutionize industries and drive business
success through predictive analytics, machine learning, and data-driven decision-making. Companies
across various sectors are investing heavily in data science initiatives to gain a competitive edge and
unlock new insights from their data.
However, there is also a fair amount of skepticism and caution surrounding the hype of data science.
Critics argue that the field is overhyped and that businesses may not always see the expected ROI
from their data science investments. Additionally, there are concerns about data privacy and ethical
implications of using data science techniques to make decisions that impact individuals and society
as a whole.
Overall, while the hype around data science is well-founded in its potential to drive innovation and
growth, it is important for businesses to approach data science initiatives with a clear strategy and
ethical considerations in mind.
Datafication:
Datafication is a buzzword ie, popular/important word
Datafication involves transforming social actions into quantified online data for real-time
tracking and predictive analysis.
Latest technologies enable new ways to capture and utilize data from daily activities.
Datafication is a technological trend that turns aspects of life into computerized data, driving
organizations to become data-driven enterprises.
It refers to rendering daily interactions into a data format for social use.
Organizations utilize data for critical business processes, decision-making, and strategies.
Total control over data storage, extraction, manipulation, and utilization is crucial for
organizational survival in a data-oriented landscape.
Examples of data generation include phone calls, SMS, social media interactions, financial
transactions, and video surveillance. This astronomical amount of data has information
about our identity and our behaviour.
Datafication goes beyond digitization and encompasses a vast amount of data containing
information about identity and behavior.
Marketers analyze social media data to predict sales, showcasing the benefits of data
analytics across various industries and company sizes.
Statistical Inference:
Definition and Purpose
Statistics is a branch of mathematics concerned with the collection, analysis,
interpretation, and presentation of numerical data.
Its main purpose is to make accurate conclusions about a larger population using a
limited sample.
Types of Statistics
1. Descriptive Statistics:
Summarizes or describes the characteristics of a data set.
Consists of measures of central tendency (mean, median, mode), measures of
variability (variance, standard deviation), and frequency distribution.
2. Inferential Statistics:
Draws conclusions about a population by examining random samples.
Goal is to make generalizations about a population based on sample data.
Uses analytical tools to make inferences about population parameters (e.g.,
population mean) from sample statistics (e.g., sample mean).
Difference Between Descriptive and Inferential Statistics
Descriptive statistics describe the data set. Whereas Inferential statistics use that
data make predictions and allow generalizations about the population based on
sample data.
Statistical Inference
Inference means making conclusions or guesses about something.
Statistical inference involves making conclusions about the population using various
statistical analysis techniques applied to sample data.
Population:
· A complete collection of the objects or measurements is called the population or else
everything in the group we want to learn about will be termed as population or else In
statistics population is the entire set of items from which data is drawn in the statistical
study.
· It can be a group of individuals or a set of items.
The population is usually denoted with N
The number of citizens living in the State of Rajasthan represents a population of the state
Sample:
· A sample is like taking is a small amount out of population It's a smaller group that
represents the bigger population.
A sample represents a group of the interest of the population which we will use to represent
the data. The sample is an unbiased(balanced) subset of the population in which we
represent the whole data.
· A sample is a group of the elements actually participating in the survey or study.
· A sample is the representation of the manageable size.
· samples are collected and stats are calculated from the sample so one can make
interferences or extrapolations from the sample.
· This process of collecting information from the sample is called sampling
· The sample is denoted by the n
Statistical Modeling:
Statistical modeling is a method used to analyze and interpret data in order to make
predictions or draw conclusions. It involves the use of mathematical and statistical
techniques to create a simplified representation of a complex system or process. Statistical
models are often used in various fields such as economics, biology, finance, and social
sciences to understand relationships between variables, make predictions about future
outcomes, or test hypotheses. Popular statistical modeling techniques include linear
regression, logistic regression, time series analysis, and survival analysis.
Training in statistical modeling involves using a portion of the available data to estimate the
parameters of the model. This is typically done through methods such as Maximum
Likelihood Estimation or Bayesian inference. The trained model can then be used to make
predictions on new data or infer relationships between variables in the dataset.
Evaluation in statistical modeling refers to assessing the performance of the model on the
data it was trained on, as well as on new data that was not used in the training process. This
is done by comparing the model's predictions to the actual values in the dataset. Evaluation
helps to determine how well the model is able to generalize to new data and make reliable
predictions.
Types of Regression Algorithm:
o Simple Linear Regression
o Multiple Linear Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
Linear Regression: Predicting continuous variables based on relationships between
independent and dependent variables.
Decision Tree Regression: Building a tree-like structure to predict outcomes based on
features.
Purpose: To predict continuous variables and understand relationships between variables.
Probability:
o the word 'Probability' means the chance of occurring of a
particular event.
o Probability denotes the possibility of something happening.
o It is a mathematical concept that predicts how likely events are
to occur. The probability values are expressed between 0 and 1.
o The definition of probability is the degree to which something is
likely to occur.
Probability Distribution
A probability distribution is a function used to give the probability of all possible values that
a random variable can take. It describes the probability of different outcomes of a variable.
It provides probabilities for each possible outcome of a random experiment.
Probability distributions can be represented using graphs or tables.
There are two main types: discrete and continuous probability distributions.
Types of Probability Distribution:
1. Discrete Probability Distributions:
Gives the probability of a discrete random variable having a specified value.
Represents data with a finite countable number of outcomes.
Example: Rolling a fair dice.
2. Continuous Probability Distributions:
Has a range of values that are infinite and uncountable.
Represents data with infinite possibilities.
Example: Time.
Probability distributions are fundamental in understanding and analyzing random
phenomena and are essential in various fields such as statistics, mathematics, and data
science.
Overfitting:
Overfitting occurs when a model learns the details and noise in the training data to the
extent that it negatively impacts the model's performance on new, unseen data. In other
words, the model is too complex and captures the noise in the data rather than the
underlying pattern.
Some common reasons for overfitting include:
Using a model that is too complex for the amount of data available
Training the model for too long, causing it to memorize the training data
Using features that are irrelevant or noisy
To address overfitting, you can try the following techniques:
Use simpler models with fewer parameters
Regularize the model by adding penalties to the model parameters
Use cross-validation to evaluate the model's performance on multiple subsets of the
data
Use feature selection techniques to only include relevant features
Increase the amount of training data
By taking steps to prevent overfitting, you can ensure that your model generalizes well to
new data and provides accurate predictions.
Attributes
Attributes are characteristics or qualities that describe an object, person, or entity. They
define the properties of something and can be used to identify or differentiate it from other
similar things. Attributes can be physical, such as color or size, or abstract, such as
personality traits or values. They play a crucial role in categorizing, organizing, and
understanding the world around us.
Collection of data objects and their attributes: An Attribute is a property or characteristic of
an object Example: eye colour of person, temperature, etc. Attribute is also known as
variable, field, characteristic or features The collection of attributes that describes an object
Object is also known as record/row, point, case, sample, entity or instance Example:
Types of Attributes:
Understanding the types of attributes is crucial in the data preprocessing stage, as it helps
differentiate between different kinds of data and guides how the data should be processed.
Here's a breakdown of the types of attributes:
1. Qualitative Attributes:
Nominal Attributes (N): Nominal attributes are related to names and represent
categories or states. There is no inherent order among the values of nominal
attributes.
Example: Colors (Red, Blue, Green)
Binary Attributes (B): Binary attributes have only two values or states.
Example: Yes/No, True/False, Affected/Unaffected
Symmetric Attribute: Both values of a symmetric attribute are equally important.
Example: Gender (Male/Female)
Ordinal Attributes (O): Ordinal attributes have values with a meaningful sequence or
ranking, but the magnitude between values is not known.
Example: Education Level (High School, Bachelor's, Master's)
2. Quantitative Attributes:
Numeric Attributes: Numeric attributes are measurable quantities represented in
integer or real values. They can be further categorized into:
Interval-Scaled Attribute: Values have interpretable differences, but there is
no true zero point. Addition and subtraction can be performed.
Example: Temperature in Celsius
Ratio-Scaled Attribute: Values have a fixed zero point, allowing for
meaningful ratios and computations.
Example: Weight in kilograms
Discrete Attributes: Discrete data have finite values and can be either numerical or
categorical. These attributes have a finite or countably infinite set of values.
Example: Number of children in a family
Mean:
Mean is the sum of all the values in the data set divided by the number of values in the data
set. It is also called the Arithmetic Average. Mean is denoted as x̅ and is read as x bar.
Median:
A Median is a middle value for sorted data. The sorting of the data can be done either in
ascending order or descending order. A median divides the data into two equal halves.
The formula to calculate the median of the number of terms if the number of terms is even
is shown in the image below,
Mode:
A mode is the most frequent value or item of the data set. A data set can generally have one
or more than one mode value. If the data set has one mode then it is called “Uni-modal”.
Similarly, If the data set contains 2 modes then it is called “Bimodal” and if the data set
contains 3 modes then it is known as “Trimodal”. If the data set consists of more than one
mode then it is known as “multi-modal”(can be bimodal or trimodal). There is no mode for a
data set if every number appears only once.
Range:
The range is the difference between the highest and lowest values in a
dataset.
It provides a simple measure of dispersion but can be influenced by
outliers.
Formula: Range = Maximum value - Minimum value.
Interquartile Range (IQR):
IQR is the range of the middle 50% of values in a dataset, less affected by
outliers.
It is calculated as the difference between the third quartile (Q3) and the first
quartile (Q1).
Formula: IQR = Q3 - Q1
Consider the following data set with 13 observations (1, 2, 3, 5, 7, 8, 11, 12, 15,
15, 18, 18, 20):
First, we want to find the 25th percentile, so k = 25.
We have 13 observations, so n = 13.
(nk)/100 = (25 × 13)/100 = 3.25, which is not an integer, so we will use the
second method (#3 in the preceding list).
j = 3 (the largest integer less than (nk)/100, that is, less than 3.25).
Therefore, the 25th percentile is the ( j + 1)th or 4th observation, which has the
value 5.
We can follow the same steps to find the 75th percentile:
(nk)/100 = (75*13)/100 = 9.75, not an integer.
j = 9, the smallest integer less than 9.75.
Therefore, the 75th percentile is the 9 + 1 or 10th observation, which has the
value 15.
Therefore, the interquartile range is (15 − 5) or 10.
Variance and Standard Deviation:
Variance:
Variance measures the average squared deviation of each data point
from the mean.
It gives a sense of how much the individual values in a dataset vary from
the mean.
Population Variance: σ2
Sample Variance: s2
Standard Deviation:
Standard deviation is the square root of the variance.
It represents the average deviation of data points from the mean.
Population Standard Deviation: σ
Sample Standard Deviation: s
Graphic Displays:
1. Histograms:
Histograms display the frequency distribution of a continuous
variable by dividing it into intervals (bins) and representing the
frequency of values within each interval by the height of bars.
They provide insights into the shape, center, and spread of the
data distribution.
3. Scatter Plots:
Scatter plots are used to visualize the relationship between two
continuous variables.
Each point on the plot represents a single observation, with one
variable on the x-axis and the other on the y-axis.
4. Line Charts:
Line charts are useful for displaying trends over time or other
ordered categories.
They connect data points with lines to show the overall pattern or
trend in the data.
5. Bar Charts:
Bar charts represent categorical data with rectangular bars whose
lengths are proportional to the values they represent.
They are commonly used to compare the frequencies or
proportions of different categories.
6. Pie Charts:
Pie charts display categorical data as slices of a circle, with each
slice representing a proportion of the whole.
They are useful for illustrating the composition of a dataset or the
distribution of categories.
UNIT – 3
R FACTOR
In R, a factor is a special data structure designed to represent categorical data. It goes
beyond simply storing character strings or integers. Factors are particularly useful for
variables with a limited set of predefined values, like:
Gender (Male, Female, Non-binary)
Color (Red, Green, Blue)
Day of the week (Monday, Tuesday, Wednesday, etc.)
Creating a Factor in R
You can create a factor using the factor() function. It takes a vector as input and converts it
into a factor object. Here's an example:
Key Points to Remember
Factors can store both character vectors and integer vectors.
They are particularly useful for data with a limited number of unique values.
Factors are internally stored as integers with labels for easy manipulation and
analysis.
Data reduction is the process of transforming large datasets into smaller ones, while
preserving the most important information. This is done for a number of reasons, including:
Reduced storage requirements: Smaller datasets take up less storage space, which
can save money and make data management easier.
Faster processing times: Smaller datasets can be processed by computers more
quickly, which can be important for tasks such as data mining and machine learning.
Improved accuracy: By removing irrelevant or redundant data, data reduction can
sometimes improve the accuracy of machine learning models.
UNIT – 5
Data Reduction:
Data reduction refers to the process of reducing the amount of data while preserving its
essential characteristics.
This is often necessary in fields such as data mining, machine learning, and signal processing,
where large datasets can be computationally expensive to process or may contain redundant
or irrelevant information.
There are several strategies for data reduction, including wavelet transforms, principal
components analysis (PCA), and attribute subset selection.
Sampling:
In data reduction, sampling refers to the technique of selecting a subset of data
points from a larger dataset to represent the whole. It's a powerful tool for
dealing with large datasets, offering several key benefits:
The effectiveness of sampling in data reduction depends on selecting the right
type of sample and ensuring it's representative of the entire dataset.
Random Sampling: Every data point has an equal chance of being
selected, ensuring an unbiased representation of the entire dataset.
Sample Size: A larger sample size generally provides a more accurate
representation, but there's a trade-off between size and efficiency.
Cluster Sampling: The data is first divided into clusters based on
similarity, and then a random sample of clusters is chosen. This can be
helpful when dealing with naturally occurring groups within the data.
Overall, sampling is a crucial technique in data reduction. It allows you to
work with manageable-sized data subsets while still gaining valuable
insights from the larger dataset.
Data Cube Aggregation:
Data cube aggregation is a powerful technique used in data reduction
specifically for multidimensional datasets. Here's how it helps reduce data size
while preserving valuable information:
Concept of Data Cubes:
Imagine a dataset with multiple dimensions or attributes (e.g., sales data with
product category, region, and time period). A data cube represents this data as
a multidimensional structure, similar to a cube with each dimension as an axis.
Aggregation in Data Cubes:
Data cube aggregation involves precalculating summaries (often using
functions like sum, average, count, minimum, maximum) of the data for various
combinations of dimensions. This essentially condenses the data into a more
compact representation.
Reduced Storage Requirements: By storing only the aggregated values
for different dimension combinations, data cube aggregation significantly
reduces the storage space needed compared to storing every single data
point.
Faster Data Retrieval: When you need to analyze a specific subset of the
data (e.g., total sales for a particular product category in a specific
region), you can quickly retrieve the precomputed aggregation from the
data cube, saving processing time.
Data visualization
is the art of taking complex data and transforming it into a visual format that's
easy to understand. Imagine a giant spreadsheet filled with numbers - data
visualization turns that into charts, graphs, maps, or other visuals that make
the information clear and accessible.
Here's the key idea:
Data: Raw information, like numbers, statistics, or text.
Visualization: Turning that data into a visual format (charts, graphs, etc.).
Easy to Understand: Making the information clear and accessible, even for
people without a data background.