Data, Measurements, and Data Preprocessing Summer Term 2024

1.1. In the table generated through R1 below, is displayed the age distribution within each marital status
Absurd Alone Divorced Married Single Together Widow YOLO
1893 0 0 0 0 1 0 0 0
1899 0 0 0 0 0 1 0 0
1900 0 0 1 0 0 0 0 0
1940 0 0 0 0 1 0 0 0
1941 0 0 0 1 0 0 0 0
1943 0 0 2 2 1 0 2 0
1944 0 0 1 4 1 0 1 0
1945 0 0 0 3 2 2 1 0
1946 0 0 0 7 3 6 0 0
1947 0 0 0 3 2 9 2 0
1948 0 0 1 9 2 6 3 0
1949 0 0 3 11 1 7 8 0
1950 0 0 4 7 2 12 4 0
1951 0 0 3 18 9 9 4 0
1952 0 0 8 12 14 16 2 0
1953 0 0 3 12 6 12 2 0
1954 0 0 11 21 6 10 2 0
1955 0 0 5 19 10 13 2 0
1956 0 0 4 19 10 19 3 0
1957 1 0 3 18 7 12 2 0
1958 0 1 5 10 13 21 3 0
1959 0 0 8 21 5 14 3 0
1960 0 0 6 16 6 18 3 0
1961 0 0 3 15 8 9 1 0
1962 0 0 8 11 11 12 2 0
1963 0 0 9 13 11 9 3 0
1964 0 0 6 20 10 5 1 0
1965 0 0 7 33 14 18 2 0
1966 0 0 5 13 14 16 2 0
1967 0 0 8 14 9 13 0 0
1968 0 0 14 17 7 12 1 0
1969 0 0 3 30 11 22 5 0
1970 0 0 12 28 20 15 2 0
1971 0 0 13 33 15 22 4 0
1972 0 0 4 40 18 17 0 0
1973 0 1 10 37 8 13 3 2
1974 0 0 7 34 11 17 0 0
1975 0 0 9 36 19 16 3 0
1976 0 0 15 31 14 29 0 0
1977 0 0 5 26 9 12 0 0
1978 0 0 7 31 14 24 1 0
1979 0 0 5 16 14 18 0 0
1980 0 0 2 21 10 6 0 0
1981 0 0 2 14 10 13 0 0
1982 0 0 0 16 11 18 0 0
1983 0 0 2 21 8 11 0 0
1984 0 0 0 18 12 8 0 0
1985 0 0 0 15 13 4 0 0
Note that I utilized AI for the programming aspect of this assignment, given my novice status in programming.
Despite this reliance, it's crucial to note that I meticulously analyzed every step of the process.
1986 0 0 2 19 11 10 0 0
1987 0 0 3 7 13 4 0 0
1988 0 1 1 12 10 5 0 0
1989 0 0 2 11 11 6 0 0
1990 0 0 0 11 7 0 0 0
1991 0 0 0 2 9 4 0 0
1992 0 0 0 4 6 3 0 0
1993 1 0 0 0 4 0 0 0
1994 0 0 0 0 1 2 0 0
1995 0 0 0 0 5 0 0 0
1996 0 0 0 2 0 0 0 0

Continuing with this dataset, the information is presented in the form of a box plot.

According to the table of distribution and the boxplot these are the findings:

 Absurd, Alone, and YOLO: These categories have limited data points and fewer customers
compared to others. Age ranges vary, with Absurd having the widest range.
 Divorced: Concentrated around middle-aged group with fewer outliers.
 Married and Together: Similar age distributions, focused on middle-aged group with fewer
 Single: Broad age range, slightly skewed towards younger ages, with outliers in older age group.
 Widow: Concentrated towards older ages with fewer outliers in younger age group.
 Count of Customers: Single category has the highest customer count, followed by married.
Absurd, alone, and YOLO have the lowest counts.

From the analysis of the data, we can conclude that:

 The analysis reveals distinct age distributions within each marital status category, providing
valuable insights for targeted marketing strategies.
 Marital status categories like Divorced, Married, and Together show concentrated age groups,
suggesting potential marketing approaches tailored to these demographics.
 Categories with limited data points like Absurd, Alone, and YOLO may require further
investigation or different marketing strategies due to their lower customer counts.
 Understanding the age distribution within each category allows for more precise targeting of
customer segments, enhancing the effectiveness of marketing campaigns.


Describing the Data:

Educational Categories: Identify and list the different levels of education represented in the dataset.
Based on the information provided, it appears the educational levels are categorized as "2nd Cycle,"
"Basic," "Graduation," "Master," and "PhD."

Frequency Distribution: Provide the frequency count for each education category to understand the
distribution. From the information given, we have:

 2nd Cycle: Approximately 200

 Basic: Around 100
 Graduation: More than 800
 Master: Roughly 400
 PhD: More than 400

Exploring the Data:

Visualization: Below is the bar plot of the Distribution of Education:


In summary, the analysis of education distribution within the dataset indicates a diverse representation
of educational backgrounds among customers. The majority have completed their undergraduate
education ("Graduation"), with significant presence of highly educated individuals with "Master" and
"PhD" qualifications. These insights can inform targeted marketing strategies tailored to the
preferences and behaviors of educated consumers. Further exploration into correlations with other
variables could enhance the depth of analysis. Leveraging this knowledge effectively can drive customer
engagement and contribute to the success of marketing efforts for the retail group.


Based on the provided data, SP (which I believe is referring to SPAIN) has the most web purchases.
Since Spain has the highest number of web purchases, it indicates that there is a potential market
opportunity in Spain for the company's online sales. Enhancing marketing strategies targeted specifically
at the Spanish market could yield significant benefits, such as increased sales and customer


To find how “the average customer” looks like, I analyzed all the numerical variables and all the non-
numerical variables with using R.

Based on the data provided, we can compile the characteristics of the average customer profile:


- Age: Born around 1969, making the average customer in their early to mid-fifties.

- Marital Status: Likely married or single, with a diverse range of marital statuses.


- Predominantly holding master's degrees, with smaller proportions holding other educational
qualifications such as 2nd cycle, basic graduation, or PhDs.

Geographic Location:

- Based predominantly in Spain, with significant proportions from Saudi Arabia, Canada, Australia,
India, Germany, the United States, and smaller numbers from other regions.
Customer Behavior:

- Been a customer since around July 10, 2013, indicating a long-standing relationship with the

- Makes purchases both online and in-store, with a slightly higher frequency of in-store purchases.

- Engages with the company's website around 5 times on average.

- Last purchase occurred around 49 days ago.


The average customer profile based on the provided data is likely a middle-aged individual, possibly
married. They are well-educated, predominantly holding master's degrees, and come from diverse
geographic locations, with Spain being the primary market. They have been loyal customers for several
years, engaging with the company's online and offline channels. Customizing marketing strategies to
align with customer preferences and behaviors can significantly boost engagement and sales for the
company, particularly by focusing on online campaigns to optimize the digital shopping experience.
Although in-store shopping remains slightly more prevalent, the cost-effectiveness of online shopping
suggests a clear opportunity for the company to bolster its online marketing efforts.


The 4th Campaign stood out as the most successful, while both the 3rd and 5th campaigns achieved
equal levels of success.


 Mean production 1: 4.937433

 Mean Production 2: 5.100732
 Mean Production 3: 4.960518

The production line with the shortest mean production time is Line 1.

Why Production Line 1:

Taking into account the past production times per unit for every production line, it is advised to select
Line 1 in order to promptly fulfill the 1000 unit new customer order. The shortest mean production time
per unit among the three production lines, Line 1, is the basis for this decision. By choosing Line 1, the
customer order will be delivered on time because the production process will probably be quicker and
more effective than with Lines 2 and 3. By promptly meeting the needs of the customer, this decision
enhances customer satisfaction by optimizing resource utilization and minimizing production delay.



 Metric Selection: We use the coefficient of variation2 (CV) to assess variability in production
times across multiple lines.

 Calculation: The CV is calculated by dividing the standard deviation of production times by the
average production time (R).

 Selection: The production line with the lowest CV is chosen because it provides the most
accurate estimate of production time.

 Explanation: Line 3 was chosen because it has the lowest CV, which means it has the least
variation, making it more reliable for meeting just-in-time production needs.

The coefficient of variation (CV) helps compare how much data varies, no matter its size.

 Shape and representation: Boxplots represent the distribution of data using quartiles, median,
and outliers, while violin plots provide a more detailed representation by showing the entire
probability density of the data.
 Width of representation: Boxplots typically have fixed widths for each category, whereas violin
plots adjust their width based on the density of the data. This means violin plots can provide
more information about the data distribution, especially when there are multiple modes or
complex shapes.
 Outliers: Boxplots explicitly show outliers as individual points beyond the whiskers, while violin
plots incorporate outliers into the overall density estimation.

Advantages and Disadvantages:



Simple and easy to interpret.

Clearly show the median, quartiles, and outliers.

Good for comparing the spread and skewness of different distributions.


Less detailed compared to violin plots.

May obscure underlying data distribution, especially with small sample sizes or multimodal distributions.

Violin Plots:


Provide a more detailed view of the data distribution, including multimodality and skewness.

Show the entire distribution density.

Can be more informative, especially with larger datasets.


Can be more complex and harder to interpret than boxplots.

May be less familiar to some audiences.

Require more computational resources for plotting compared to boxplots.

In summary, boxplots are simpler and more familiar, making them suitable for quick comparisons and
detecting outliers. Violin plots offer a more detailed view of the data distribution, especially useful when
exploring complex or multimodal distributions. The choice between the two depends on the specific
needs of the analysis and the audience's familiarity with the visualization type.


To investigate the possible connection between crime and quality of life, I will perform the following

3.1 Data Loading

3.2 Data Preprocessing
3.3 Data Visualization
3.4 Interpretation and Conclusion

3.1 Dataset Cities loaded in R

3.2 Since the task is to investigate the connection between crime and quality of life, we will
focus on these two attributes. We will first explore the relationship between them visually.

From the scatter plot, we can observe the following:

There seems to be a slight negative correlation between crime rating and quality of life,
indicating that cities with higher quality of life tend to have lower crime ratings.
However, there are exceptions, as some cities with high quality of life still have relatively high
crime ratings.
This preliminary analysis suggests that there might be a connection between crime and quality
of life, but further analysis and statistical tests are required to confirm and quantify this

Correlation Analysis: The correlation coefficient will indicate the strength and direction of the
linear relationship between crime rating and quality of life. A positive correlation suggests that
higher quality of life is associated with higher crime ratings, while a negative correlation
suggests the opposite.

The result of correlation using R was: -0.427, this result supports the idea that there is a negative
relationship between crime rating and quality of life: higher quality of life tends to be associated
with lower crime ratings.

To identify possible quality groups in the dataset, we can use various techniques such as
clustering, binning, or statistical methods. In this case, since we're dealing with quality of life
data, we can use binning to categorize the quality of life values into distinct groups.

Let's use binning to categorize the quality of life values into quality groups. We'll divide the
range of quality of life values into a few bins to represent different levels of quality. We'll then
label these bins as quality groups.

This is the result using R:

Low Moderate High Very High

216 0 0 0

The result indicates that all cities in the dataset fall into the "Low" quality group according to the
defined binning scheme. This suggests that the quality of life values in the dataset are
predominantly in the lower range, with no cities falling into the "Moderate," "High," or "Very
High" quality groups based on the specified bin breaks.

Interpreting this result:

The dataset suggests that the quality of life values tend to be lower for the cities included,
possibly indicating significant challenges or issues affecting their overall quality of life. None of
the cities fall into the "Moderate," "High," or "Very High" quality categories according to the
defined grouping. This suggests that the range of quality of life values is likely narrow, with most
falling in the lower range.
Further investigation is necessary to comprehend why the quality of life values are low and how
they might relate to other variables such as crime rates.


In R, these steps were undertaken to examine the potential correlation between the increased
volatility of Tesla stock prices and its recent rise in value:

 Import the data of voestalpine and Tesla stock prices from 2015 to 2020.
 Normalize the price data for both stocks to account for differences in their absolute price levels.
 Calculate the fluctuations (e.g., standard deviation) of both normalized stock prices.
 Visualize the fluctuations using an appropriate diagram type, such as a line plot or a bar plot.

The calculated standard deviation for Voestalpine is higher than for Tesla, it suggests that Voestalpine
experienced higher price fluctuations relative to its mean compared to Tesla during the analyzed period.

This information contradicts the initial assumption that Tesla had higher price fluctuations. Therefore,
based on the provided data and analysis, it can be concluded that the higher fluctuation of Tesla was not
necessarily a hint for the enormous increase in value of the recent weeks.

To determine the most suitable sampling technique for each dataset, we need to consider the
distribution of classes within each dataset. The choice of sampling technique depends on various factors
such as class balance, size of dataset, and the specific goals of the analysis. Here are the
recommendations for each dataset:

Dataset A:

Class Distribution:

Class1: 30%; Class2: 34%; Class3: 1%; Class4: 35%

Sampling Technique Recommendation:

This dataset has a relatively balanced distribution across classes, except for Class 3 which is significantly

For this dataset, stratified sampling would be most suitable to ensure representation from each class in
the sample. Stratified sampling is recommended for Dataset A (compared to simple random sampling) to
ensure accurate representation of each class, especially for Class 3, which is underrepresented. This
method ensures proportional representation of all classes in the sample, leading to a more reliable and
comprehensive analysis.

Dataset B:

Class Distribution:

Class1: 26%; Class2: 23%; Class3: 23%; Class4: 28%

With a balanced class distribution, using stratified sampling is crucial. It ensures that each class is
represented proportionally in the sample, preventing bias and improving the accuracy of conclusions
about the whole population. This method boosts the reliability of findings by capturing the dataset's
diversity, resulting in stronger and more widely applicable results.

Dataset C:

Class Distribution:

Class1: 37%; Class2: 19%; Class3: 37%; Class4: 7%

Sampling Technique Recommendation:

This dataset presents a more imbalanced distribution across classes, with Class 4 being notably
Given the imbalance, oversampling or undersampling techniques may be required to address the class
imbalance issue. Techniques like SMOTE 3(Synthetic Minority Over-sampling Technique) for
oversampling or random undersampling could be applied to balance the classes before sampling.

Alternatively, if the imbalance isn't a concern for the analysis, simple random sampling might suffice.
However, this could lead to insufficient representation of the minority class (Class4).

Problematic Datasets:

Dataset C has an issue with its imbalanced class distribution, especially with Class 4 being notably
underrepresented. It's important to make sure that Class 4 is well-represented in the sample. Neglecting
this could cause biased results or models trained on the sampled data to perform poorly. In short, while
all datasets could use stratified sampling to some extent, Dataset C needs extra methods to tackle its
imbalanced class distribution, like oversampling or undersampling.

