Report
Report
UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
R STUDIO PROJECT
Semester: 212 - Group 03 Class CC02
Contents
1 Introduction 2
2 Theoretical basis 3
2.1 T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 One sample T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Two samples T-test - Paired samples . . . . . . . . . . . . . . . . . 3
2.1.3 Two samples T-test - Independent samples . . . . . . . . . . . . . . 4
2.2 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Prediction model - Multiple Linear Regression . . . . . . . . . . . . . . . . 7
3 Activity 1 8
3.1 Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Descriptive statistics for each variables . . . . . . . . . . . . . . . . 9
3.3.2 Graphs: Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 T-test: pre.weight & weight6weeeks . . . . . . . . . . . . . . . . . . . . . . 12
3.5 One-way ANOVA: What is the best diet for weightLoss? . . . . . . . . . . 13
3.6 Two-way ANOVA: Do Diet and gender affect weightLoss? . . . . . . . . . . 14
4 Activity 2 17
4.1 Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Descriptive statistics for all attributes . . . . . . . . . . . . . . . . . 19
4.3.2 Graph: Published CPU Performance By Vendors . . . . . . . . . . . 19
4.3.3 Graph: Preparation for One-way ANOVA . . . . . . . . . . . . . . . 20
4.4 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Prediction model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5.1 Data transformation: . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5.2 Choosing appropriate model: . . . . . . . . . . . . . . . . . . . . . . 24
4.5.3 Building multiple linear regression model: . . . . . . . . . . . . . . 25
4.5.3.a Determine variables: . . . . . . . . . . . . . . . . . 25
4.5.3.b Preparing data sets for training and testing: . . . . 25
4.5.3.c Building model: . . . . . . . . . . . . . . . . . . . . 26
4.5.3.d Outcome evaluation: . . . . . . . . . . . . . . . . . 27
5 Conclusion 29
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 1/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
1 Introduction
Probability And Statistics are useful topics in the science, with applications in a vari-
ety of industries including engineering, medicine, biology, economics, physics, and more.
This report focuses on using probability and statistics models to analyze and investigate
how choosing the proper diet might effect weight loss after 6 weeks, as well as developing
prediction models to compare estimated and announced CPU hardware performance. In
this project of Probability and Statistics, we use R-studio as the sole software to analyze
the data provided and OverLeaf to prepare the report.
In Activity 2, the chosen data set is about Computer Hardware, a topic related to
Computer Science, our major of study. In this section, we will analyze, visuallize the
data, and make prediction of CPU performance based on available features by applying
the basic knowledge of data visualization, ANOVA and prediction model.
The prediction model which we use in Activity 2 is Linear Regression model, a sim-
ple and long-standing method (about 200 years old), applied to research in many scientific
fields. Linear regression assumes a linear relationship between the input variables and the
single output variable. Details about this model will be mentioned in the Theoretical
basis section.
A critical method employed throughout in this project is ANOVA, a very useful and
powerful statistical technique for determining how one or more factors impact a response
variable. ANOVA is prioritized in a wide variety of real-life situations, but the most
common include: Retail, Medical and Environmental sciences.
• Introduction
• Theoretical basis
• Activity 1
• Activity 2
• Conclusion
• References
We would like to express our thank to the Dr. Phan Thi Huong and the Faculty of Applied
Science for continued support throughout this project.
Team members,
2022
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 2/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2 Theoretical basis
2.1 T-test
• Null hypothesis (H0 ): the initial claim that is always true.
• T-test, together with z-test, are two common parametric tests used in testing the
null hypothesis. While we use z-test (based on Normal distribution) for normal
distribution with known σ or large sample size (n >= 30), we use t-test (based on
t-distribution) for small sample size (n < 30) that follows normal distribution with
unknown σ and we replace it by the sample standard deviation s.
• Before diving into some most common t-test applications, we should be noticed that
p-value is the smallest significance α at which the null hypothesis can be rejected.
In this project, if p-value < 0.05, we reject the null hypothesis H0 , and fail
to reject it otherwise. We can also use rejection region to reject H0 if the test
statistic falls into the rejection region.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 3/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
assumed to follow N(µ, σ 2 ). After that, we carry out the same steps like one-sample t-test.
For example: A car manufacturer may want to evaluate the average parking time of two
new car models with different radii. They may invite 25 volunteers to join and record their
parking time with each type of car. Then, the car manufacturer may use paired samples
t-test to study which model is better (faster in parking time).
√
• Test statistic: t = (D̄−µsDD ). n where D̄ is the mean of the difference sample, µD is
the value that we wish to test, n is the sample size and sD is the sample standard
deviation of the difference sample.
In the formula, s1 and s2 are the sample standard deviations; m and n are the sample
size; x̄ and ȳ are the sample mean.
/ [0.5; 2])
Case 2: Unknown Variance ( ss12 ∈
q
• Test statistic: t = (x̄−ȳ)−∆ with se = sm1 + sn2
2 2
se
In the formula, s1 and s2 are the sample standard deviations; m and n are the sample
size; x̄ and ȳ are the sample mean.
s1 2 s 2
( + 2n )2
Let v = 1
m
s 2 1 s 2
(round down if v ∈
/ Z +)
( 1 )2 + n−1
m−1 m
( 2n )2
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 4/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.2 One-way ANOVA
One way ANOVA is a hypothesis test used for testing the equality of three or more
population means simultaneously using variance.
For example: In one laboratory, a team studied whether changes in CO2 concentration
affected the germination rate of soybean seeds by gradually increasing the CO2 concen-
tration and recording the height of the bean sprouts after 1 day.
• Statistical problem: Comparing the height means between groups of CO2 con-
centration.
Assumptions for using one-way ANOVA:
• The population are normally distributed. To test the normality, we use the Normal
probability plot of the Residuals (mentioned in Prediction model).
• The sample are random and independent
• The population has equal variances.
Degree of
Sum of square(SS) Median of square(MS)
freedom(df)
a
Treatment SStreatment = n (y i· − y..)2 a−1 M Streatment = SStreatment /(a − 1)
P
i=1
a P
n
Error SSE = (yij − y i· )2 a(n − 1) M Serror = SSE /[a(n − 1)]
P
i=1 j=1
Total SST = SStreatment + SSerror an − 1
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 5/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.3 Two-way ANOVA
Two way ANOVA is a statistical technique that used for examining the effect of two
factors on the continuous dependent variable. It also studies the interrelationship between
the two independent variables which influences the values of the dependent one.
For example: In an Arithmetic test, several male and female students of different ages
participated. Exam results are recorded. In this case, two-way ANOVA could be used to
determine if gender and age affected the scores.
• Statistical problem : Comparing the score means according to the genders and
ages.
Assumptions for using two-way ANOVA are similar with one-way ANOVA (section 2.2).
The table of dataset for two-way ANOVA can be generalize as follow :
Factor 2
Factor 1
1 2 ... K
1 X11 X21 ... XK1
2 X12 X22 ... XK2
... ... ... ... ...
H X1H X2H ... XKH
Factor 1 Factor 2
H0 No difference in means of group i No difference in means of group j
H1 At least 1 difference in means of group i At least 1 difference in means of group j
Given α Reject H0 if f1 > fk−1,(k−1)(h−1),α· Reject H0 if f2 > fh−1,(k−1)(h−1),α·
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 6/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.4 Prediction model - Multiple Linear Regression
Regression analysis is the collection of statistical tools that are used to model and explore
relationships between variables that are related in a non-deterministic manner.
Multiple linear regression is a critical technique that is deployed to study the linearity and
dependency between a group of independent variables and a dependent one. The general
formula for multiple linear regression can be expressed as:
Y = β0 + β1 x1 + ... + βk xk + ϵ
• β0 , β1 , ..., βn are regression coefficients. Each parameter represents the change in the
mean response, E(y), per unit increase in the associated predictor variable when all
the other predictors are held constant.
• ϵ is called the random error and follow N (0, σ 2 )
The assumptions of multiple linear regression model :
• A linear relationship between the dependent and independent variables (can be
tested by using Scatter diagram).
Notice that, in some cases, the independent variables are not in compatible formats
or linear relationship,..we can use data transformation to make them fitted and
better-organized.
• The independent variables are not highly correlated with each other
• The variance of the residuals is constant.
• Independence of observation.
• Multivariate normality (occurs when residuals are normally distributed).
Predicted Values and Residuals :
• A predicted value is calculated as ybi = b0 + b1 x1 + ... + bk xk , where the b values
come from statistical software and the x-values are specified by us.
• A residual (error) term is calculated as ei = yi − ybi , the difference between an actual
and a predicted value of y.
Analysis of Variance for Testing Significance of Regression in Multiple Re-
gression.
Source Sum of square df Mean square F0
Regression SSR = ni=1 (ybi − y)2 M SR = SSR /k
P
k M SR /M SE
Residual SSE = ni=1 (yi − yb)2 n−p M SE = SSE /(n − p)
P
H0 : β1 = β2 = ... = βk = 0
with the hypothesises for F0 :
H1 : βi ̸= 0 with at least one i
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 7/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3 Activity 1
3.1 Import data
First of all, we use the read.csv() function to import our data - the "Diet.csv" file into our
R working environment, setting header = TRUE takes the header of the csv file and sep
= "," for working with csv file. After that, we try to print nrow(dataset) and its attributes
to determine whether we have successfully loaded the data.
1 # Import the Diet . csv file into R studio
2 dataset <- read . csv ( " Diet . csv " , header = TRUE , sep = " ," )
3 nrow ( dataset )
4 str ( dataset )
And here is the result, this means that our dataset has 78 rows; and the attributes of our
dataset: Person, gender, Age, Height (cm), Diet type, pre.weight (kg), weight6weeks (kg).
> nrow(dataset). The result is: [1] 78
> str(dataset). The result is: 'data.frame': 78 obs. of 7 variables:
$ Person : int 25 26 1 2 3 4 5 6 7 8 ...
$ gender : int NA NA 0 0 0 0 0 0 0 0 ...
$ Age : int 41 32 22 46 55 33 50 50 37 28 ...
$ Height : int 171 174 159 192 170 171 170 201 174 176 ...
$ pre.weight : int 60 103 58 60 64 64 65 66 67 69 ...
$ Diet : int 2 2 1 1 1 1 1 1 1 1 ...
$ weight6weeks: num 60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...
After we print the nrow(dataset) again and see that the result is 76 now. This means
that there are 2 rows containing NA values in our dataset and we have omitted them.
2. Removing duplicated data rows: We will count the unique person in the Person
column and compare it to the nrow(dataset). If they are equal, our dataset is already
unique; otherwise, we will remove them.
1 if ( nrow ( dataset ) == length ( unique ( dataset $ Person ) ) ) {
2 print ( " No person is recorded more than 1 time . " )
3 } else { dataset <- dataset [ ! duplicated ( dataset ) ,]}
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 8/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3. Adding a new column: We take the difference of the pre.weight and weight6weeks
columns and use cbind() function to append it to our dataset, name it weightLoss.
1 dataset <- cbind ( dataset , weightLoss = dataset $ pre .
weight - dataset $ weight6weeks )
4. Factorizing data labels: For making further analysis easier, we will change the labels
of the "gender" and "Diet" columns by utilizing the help of factor() function.
1 dataset $ gender <- factor ( dataset $ gender , levels = c (0 ,1) ,
labels = c ( " F " ," M " ) )
2 dataset $ Diet <- factor ( dataset $ Diet , levels = c (1 ,2 ,3) ,
labels = c ( " Diet 1 " , " Diet 2 " , " Diet 3 " ) )
Now, let us print the first 5 columns to see the changes we have made.
> dataset[1:5,]
Person gender Age Height pre.weight Diet weight6weeks weightLoss
3 1 F 22 159 58 Diet 1 54.2 3.8
4 2 F 46 192 60 Diet 1 54.0 6.0
5 3 F 55 170 64 Diet 1 63.3 0.7
6 4 F 33 171 64 Diet 1 61.1 2.9
7 5 F 50 170 65 Diet 1 62.2 2.8
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 9/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Next, we try to partition our dataset into smaller datasets according to their Diet.
1 diet1 = dataset [( dataset $ Diet == " Diet 1 " ) ,]
2 diet2 = dataset [( dataset $ Diet == " Diet 2 " ) ,]
3 diet3 = dataset [( dataset $ Diet == " Diet 3 " ) ,]
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 10/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
We are also interested in analysing each particular Diet. So we may draw the boxplot for
each Diet too. We will write the code for Diet 1, Diet 2 and Diet 3 are similar.
1 boxplot ( weightLoss ~ gender , data = diet1 , horizontal = TRUE ,
main = " Boxplot weight loss of Diet 1 after 6 weeks " , xlab = "
Weight loss after 6 weeks ( kg ) " , ylab = " Gender " , las = 1 , col
= c ( " pink " , " skyblue " ) )
Apart from drawing boxplots, normal Q-Q plot also gives us the information about the
normality of our dataset. Our group present the code for drawing the normal Q-Q plot of
the dataset; doing the same things for each type of diet is similar.
1 qqnorm ( dataset $ weightLoss , main = " Normal Q - Q plot of dataset " )
2 qqline ( dataset $ weightLoss )
3.3.3 Conclusion
• Our dataset, the diet1, diet2, and diet3 "sub-datasets" all follow Normal
distribution.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 11/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
• The largest weightLoss belongs to Diet 3 and the smallest weightLoss is from Diet
2 (which is negative).
• In our sample, 100 percent do not suffer from adverse effect when following Diet 3,
while there are some adverse effects in Diet 1 and Diet 2.
• The range of Diet 2 is the largest & that of Diet 1 (excluding outliers) is the smallest.
• The median of Diet 3 is the largest and the median of Diet 1 is the smallest.
• In the sample, there is a higher proportion of female compared to male and the
proportion joining each diet is relatively equal and less than 30.
We can see that the p-value of shapiro.test() returns 0.7903 (> 0.05), so we can conclude
that our sample follows Normal distribution. Another way is that nrow() of our dataset
is 76, which is also large enough to be assumed to follow Normal distribution.
Next, we apply t.test() function to study.
H0 : µpre.weight − µweight6weeks = 0
H1 : µpre.weight − µweight6weeks ̸= 0
Paired t-test
data: dataset$pre.weight and dataset$weight6weeks
t = 13.728, df = 75, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 3.373452 4.518653
sample estimates: mean of the differences 3.946053
=> Since the p-value of t-test yields 2.2e-16 (< 0.05), we have a strong evi-
dence to reject the null hypothesis and confirm the effectiveness of 3 diets.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 12/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3.5 One-way ANOVA: What is the best diet for weightLoss?
Previously, we use t-test and know the effectiveness of three diets; now, we try to determine
which diet gives the best weightLoss by applying One-way ANOVA.
1 one _ way _ anova = aov ( weightLoss ~ Diet , data = dataset )
2 summary ( one _ way _ anova ) # Give information about the dataset
using ANOVA
3 TukeyHSD ( one _ way _ anova ) # Compare pairwise
4 plot ( TukeyHSD ( one _ way _ anova , conf . level =.95) , las = 1)
> summary(one_way_anova)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 2 60.5 30.264 5.383 0.0066 **
Residuals 73 410.4 5.622
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> TukeyHSD(one_way_anova)
Tukey multiple comparisons of means
95% family-wise confidence level
It is evident that the p-value and the mean of Diet3-Diet1 and Diet3-Diet2 is much smaller
than 0.05 and positive, respectively. Also Pr(>F) is smaller than 0.05. => We can con-
clude that Diet 3 gives us the largest weightLoss, or Diet 3 is the most effective
one.
Finally, we must test whether our dataset satisfies the conditions: normality and equal
variances. We will use shapiro.test() and bartlett.test().
1 shapiro . test ( x = residuals ( object = one _ way _ anova ) )
2 bartlett . test ( weightLoss ~ Diet , data = dataset )
Since all p-value are larger than 0.05, we do not have enough evidence for any violation
in normal distribution and unequal variances. Therefore, applying one-way ANOVA is
sensible.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 13/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
• Or we can draw the boxplot to illustrate the the weightLoss for each type of gender
and each type of diet.
• Also we can draw the interaction plot to determine whether the "gender" and "Diet"
factor have any interaction on each other or not. And we can observe that the lines
are not parallel, this is a strong indication of interaction between gender and Diet
factors.
> table(dataset$gender,dataset$Diet)
Diet 1 Diet 2 Diet 3
F 14 14 15
M 10 11 12
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 14/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
> summary(two_way_anova)
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 0.3 0.278 0.052 0.82062
Diet 2 60.4 30.209 5.619 0.00546 **
gender:Diet 2 33.9 16.952 3.153 0.04884 *
Residuals 70 376.3 5.376
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> TukeyHSD(two_way_anova)
Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = weightLoss ~ gender * Diet, data = dataset)
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 15/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
M:Diet 3-M:Diet 1 0.5833333 -2.3256625 3.4923292 0.9915569
M:Diet 2-F:Diet 2 1.5019481 -1.2354126 4.2393087 0.5963201
F:Diet 3-F:Diet 2 3.2728571 0.7481458 5.7975685 0.0040103
M:Diet 3-F:Diet 2 1.6261905 -1.0465354 4.2989163 0.4833188
F:Diet 3-M:Diet 2 1.7709091 -0.9260048 4.4678230 0.3965102
M:Diet 3-M:Diet 2 0.1242424 -2.7117126 2.9601974 0.9999949
M:Diet 3-F:Diet 3 -1.6466667 -4.2779524 0.9846191 0.4513580
From the given result, we can observe that Pr(>F) of Diet, Pr(>F) of gender:Diet, p-adj
of F:Diet 3-F:Diet 2 and p-adj of F:Diet 3-F:Diet 1 is less than 0.05. Therefore, we can
make a conclusion that:
• The gender factor does not have effects on the weightLoss. However, the Diet 3 gives
followers the most weightLoss.
• For female, the Diet 3 is more effective than Diet 1 and Diet 2. Also, all three diets
are equally effective for male.
Finally, we also test the normality and equality in variances in the similar way of using
one-way ANOVA. We can also use leveneTest() to test the equality in variances instead
of bartlett.test()
1 shapiro . test ( x = residuals ( object = two _ way _ anova ) )
2 leveneTest ( weightLoss ~ gender * Diet , data = dataset )
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 16/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4 Activity 2
Computer technology has come a long way. From giant, bulky machines with messy wires
all over the place, to tiny smartphones that slip straight into our pockets, it was the
work of numerous scientists and engineers that continuously simplifies our interaction
with these mysterious machines. Nowadays, even kids can pick up a phone and find their
favorite media contents - musics, cartoons, games - that may entertain them for days.
The accessibility work of computer scientists and engineers proved itself to be extremely
successful: today, almost no one knows how a computer operates, yet anyone can use any
form of computer with ease. The complex underlyings get abstracted away, leaving the
simple interface on top for the users to interact with. But as Computer Science students,
we have our needs to explore computers exhaustively. Dealing with computers from the
atomic levels allow us to make improvements that could cascadingly affect multiple as-
pects of the users’ experience. Observing how computer power changed throughout the
years greatly fascinated us as well, as it gives us a chance to ponder about the difference
between the past and future of computer evolution.
With the same mindset, we chose this Computer Hardware dataset to study the progress
of computer development. It is also an opportunity for us to appreciate the very much
unappreciated technological feat of our predecessors.
Women of the WRNS operate the Colossus, the world’s first electronic programmable computer
at Bletchley Park in Buckinghamshire. (Photo by SSPL/Getty Images)
The dataset contains several technical features of CPUs that had circulated in the mar-
ket from 1981 to 1984. The data is represented with 10 columns - including MYCT for
machine cycle time, MMIN and MMAX for main memory, CACH for cache, CHMIN
and CHMAX for channels, PRP published relative performance and ERP for estimated
relative performance - for the features and 209 rows for the CPUs from 30 vendors.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 17/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.1 Import data
First, we use the read.csv() function to import the "machine_data.csv" file and set header
= FALSE because the data does not have an existing header line.
1 dataset <- read . csv ( " machine _ data . csv " , header = FALSE )
2 dataset [1:5 ,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 adviser 32/60 125 256 6000 256 16 128 198 199
2 amdahl 470v/7 29 8000 32000 32 8 32 269 253
3 amdahl 470v/7a 29 8000 32000 32 8 32 220 253
4 amdahl 470v/7b 29 8000 32000 32 8 32 172 253
5 amdahl 470v/7c 29 8000 16000 32 8 16 132 132
Next, to enhance the data’s readability, we will change the column names based on the
information we got from the accompanied "machine_name.csv" file.
1 column _ names <- c ( " NAME " , " MODEL " , " MYCT " , " MMIN " , " MMAX " , "
CACH " , " CHMIN " , " CHMAX " , " PRP " , " ERP " )
2 names ( dataset ) = column _ names
3 dataset [1:5 ,]
NAME MODEL MYCT MMIN MMAX CACH CHMIN CHMAX PRP ERP
1 adviser 32/60 125 256 6000 256 16 128 198 199
2 amdahl 470v/7 29 8000 32000 32 8 32 269 253
3 amdahl 470v/7a 29 8000 32000 32 8 32 220 253
4 amdahl 470v/7b 29 8000 32000 32 8 32 172 253
5 amdahl 470v/7c 29 8000 16000 32 8 16 132 132
The number of unique models and the number of rows in our data are equal, indi-
cating that our data contains no duplications.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 18/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.3 Data Visualization
4.3.1 Descriptive statistics for all attributes
We will utilize the handy summary() function once again for our data in Activity 2. Each
column also requires us to call sd() to print its standard deviation. The outputs were
manually formatted for legibility.
> summary(dataset)
NAME MODEL MYCT MMIN
Length:209 Length:209 Min. : 17.0 Min. : 64
Class :character Class :character 1st Qu.: 50.0 1st Qu.: 768
Mode :character Mode :character Median : 110.0 Median : 2000
Mean : 203.8 Mean : 2868
3rd Qu.: 225.0 3rd Qu.: 4000
Max. :1500.0 Max. :32000
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 19/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
11 theme ( axis . text . x = element _ text ( angle =90 , vjust = 0.5 , hjust
=1) )
Conclusion: Each line represents the range of CPU performances of a vendor’s models,
and the dot depicts the median of that range. It is clear that vendors with higher CPU
powers tend to be more varied in terms of performance. Given the median points, we
can conclude that the upper halves of CPU performance of big vendors are significantly
stronger than the lower halves. Moving to the right side of the graph, smaller vendors
usually possess less diverse CPUs, and with similarities in performances.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 20/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Unfortunately, the initial distribution of PRP is clearly not normal. Performing ANOVA
on this data will not yield any legitimate result, so data transformation is required. Given
the multiplicative nature of CPU progression throughout history, we can execute a basic
transformation by taking the logarithm of PRPs in hope that it will shift closer to a
normal distribution. We achieve this by appending a new column containing the logarithm
of PRPs, called LOGPRP.
1 dataset <- cbind ( dataset , LOGPRP = log ( dataset $ PRP ) )
2 plot ( density ( dataset $ LOGPRP ) , main = " Distribution of CPU
performances " , xlab = " " )
The transformed data better resembles a normal distribution. Further testing confirms the
normality of the distribution of our new attribute, so we can proceed to perform ANOVA
on our dataset.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 21/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.4 One-way ANOVA
With the dataset containing CPUs on the market from 1981 to 1984, we, as Computer
Scientist students, are curious to know if there were any contemporary dominating CPU
vendor. Since there are 30 vendors, and some vendors only have a few models, we will
perform ANOVA selectively with only vendors that had the majority of the data’s CPUs.
We start by taking a look at the number of CPUs each vendor had.
1 sort ( table ( dataset $ NAME ) , decreasing = TRUE )
The first row contains vendors that had more than 10 CPUs, namely "ibm", "nas", "hon-
eywell", "ncr", "sperry" and "siemens". CPU performance of these vendors should be the
reasonable inputs for ANOVA test. We then extract from the dataset models of the afore-
mentioned vendors.
1 chosen _ vendors <- names ( sort ( table ( dataset $ NAME ) , decreasing =
TRUE ) [1:6])
2 anova _ dataset <- dataset [ dataset $ NAME % in % chosen _ vendors ,]
Now, let us carry out the one way ANOVA method to determine whether or not the
vendor names has significant effects on the PRP and if yes, which vendor among the
chosen ones has the best performance.
1 one _ way _ anova <- aov ( LOGPRP ~ NAME , data = anova _ dataset )
2 summary ( one _ way _ anova )
3 TukeyHSD ( one _ way _ anova )
4 plot ( TukeyHSD ( one _ way _ anova , conf . level =.95) , las = 1 , cex .
axis =0.4)
Before we move on to the results of the ANOVA test, we need to make sure that no
assumption of the test get violated. Once again, we rely on the shapiro.test() and
bartlett.test().
1 shapiro . test ( x = residuals ( object = one _ way _ anova ) )
2 bartlett . test ( LOGPRP ~ NAME , data = anova _ dataset )
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 22/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
All p-value are larger than our threshold of 0.05, so we can safely proceed with our test.
...
Df Sum Sq Mean Sq F value Pr(>F)
NAME 5 20.44 4.087 3.199 0.0103 *
Residuals 96 122.65 1.278
...
$NAME diff lwr upr p adj
ibm-honeywell -0.02240313 -1.10348356 1.05867731 0.9999999
nas-honeywell 1.08469044 -0.09841902 2.26779991 0.0917570
ncr-honeywell -0.04541042 -1.33467407 1.24385323 0.9999984
siemens-honeywell 0.46050567 -0.85534353 1.77635487 0.9109468
sperry-honeywell 0.74097437 -0.54828928 2.03023802 0.5539065
nas-ibm 1.10709357 0.15510504 2.05908209 0.0129661
ncr-ibm -0.02300729 -1.10408773 1.05807315 0.9999999
siemens-ibm 0.48290879 -0.62974267 1.59556025 0.8046088
sperry-ibm 0.76337750 -0.31770294 1.84445793 0.3205724
ncr-nas -1.13010086 -2.31321032 0.05300861 0.0698523
siemens-nas -0.62418477 -1.83621050 0.58784096 0.6665900
sperry-nas -0.34371607 -1.52682554 0.83939339 0.9582035
siemens-ncr 0.50591608 -0.80993312 1.82176529 0.8727632
sperry-ncr 0.78638479 -0.50287886 2.07564844 0.4874012
sperry-siemens 0.28046870 -1.03538050 1.59631790 0.9893234
Due to the limitations of our dataset, the results from ANOVA are not significantly con-
vincing. We can see that "nas" models were better than "ibm" in general, and a few other
slightly similar observations, but apart from that, no remarkable result is yielded.
Building prediction models, on the other hand, was what the dataset originally used
for. Using a range features like machine cycle time, memory, cache and channels, we can
attempt to predict the power of the CPUs and compare our prediction with actual values.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 23/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.5 Prediction model
4.5.1 Data transformation:
Our model ultimate aim is to predict the value of Published Relative Performance based
on other technical factors like MYCT, MMAX, MMIN, etc. However, the given data
set contains too much information with 6 different components which can be use for
making predictions. Therefore, we decided to reduce the number of independent variables
to simplify our model using the concept of data transformation. For the original data, we
will declare 3 new variables based on the old ones:
• Frequency (F): The lower cycle time is, the higher CPU’s frequency will be (higher
performance as well). Due to this inversely proportional characteristic, frequency
will be calculated by taking the inverse of MYCT (Unit: cycles/nanosecond).
Then, we will add these new variable into the data set for later usage as well as remove
unused columns.
1 dataset $ CH _ average <-( dataset $ CHMIN + dataset $ CHMAX ) / 2
2 dataset $ F <- 1 / dataset $ MYCT
3 dataset $ M _ average <- ( dataset $ MMIN + dataset $ MMAX ) / 2
4 dataset <- dataset [ - c (1:5 ,7:8 ,10) ]
5 str ( dataset )
After execution, ggpairs() returns a combination of various plots and numeric data.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 24/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
It is obvious that the pair of M_average and PRP suggest a linear relationship (as its plot follows
a straight line) and has a high correlation score of 0.887. Other pairs like PRP:CH_average,
PRP:F and PRP:CACH also show the shape of a straight line to some extent. Moreover, their
correlation score are relatively high (0.657, 0.622 and 0.663 respectively). In other words, they
have decent impacts on PRP value. From this observation, we build a multiple linear regression
model to estimate the value of PRP based on M_average and CH_average, F and CACH.
[1] 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0
[39] 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0
[77] 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0
[115] 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1
[153] 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 1 1
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 25/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
[191] 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0
The next 2 lines classify every row in the data set into train data set and test data set. Each
row will be assigned to either value 0 or value 1 whose order correspond to its order (The 1st
row will be assigned to the 1st value of the sequence, the 2nd row will be assigned to the 2nd
value and so on).
The sample() function generates sequence of 0 and 1 randomly. In other words, we will probably
get new train data and test data every time we run this function. This variety of data sets would
be a great help in indicating whether the accuracy of model is maintained in different situations.
Call:
lm(formula = PRP ~ F + M_average + CACH + CH_average, data = data.train)
Residuals:
Min 1Q Median 3Q Max
-205.887 -36.494 9.193 33.534 309.843
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 26/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.816536 8.789465 -5.895 2.60e-08 ***
F 424.632670 748.268699 0.567 0.57128
M_average 0.014919 0.001184 12.596 < 2e-16 ***
CACH 0.562222 0.167711 3.352 0.00103 **
CH_average 2.859298 0.507979 5.629 9.37e-08 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Based on the returned data, we can establish the equation for our model: y = −51.816536 +
424.632670x1 + 0.014919x2 + 0.562222x3 + 2.859289x4 . Furthermore, we can also estimate the
accuracy of model through the value adjusted R2 . For our model, this value equals 0.8552, which
means the accuracy is really high.
Then, we will calculate the absolute error between the real value of PRP and the predicted one
to see if the difference is small enough for the model to give precise predictions.
1 plot ( abs ( data . test $ Predicted - data . test $ PRP ) , pch =19 , xlab = " i ^( th )
testcase " , ylab = " Error " , main = " Absolute error in predictions
of model " )
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 27/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The graph illustrates a high accuracy of predictions when most of errors are below 50. However,
there are also a few cases whose errors are greater than 50 and reach the peak with the absolute
error above 150.
Besides absolute error, we can plot the linear graph to assess model’s fitness.
1 ggplot ( data . test , aes ( x = PRP , y = Predicted ) ) + geom _ point () + stat _
smooth ( method = " lm " , col = " red " )
The linear graph demonstrates a quite good fitness to the data when most points are close to
the red line, showing a good accuracy.
For all the above reasons, we can conclude that multiple linear regression model would be a
good choice to build a prediction model based on our initial data with relatively low error and
precise prophecy for most of the cases.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 28/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
5 Conclusion
Through two activities of the Assignment, we have learnt and improved skills in using R-studio’s
tools based on basic knowledge of Probability and Statistics to visualize, analyze and make
predictions on a given data set. Eventually, after dealing with the two data sets, we end up the
report with the following conclusions:
• In activity 1 :
– All three diet have certain effects on weightLoss.
– Diet 3 is the most effective diet among three diets.
– Gender does not have impact on weightLoss.
– Diet 3 was particularly effective for females than other diets, while as for males, all
three diets have a similar effect.
• In activity 2 :
– Vendors with higher CPU powers tend to be more varied in terms of performance.
Smaller vendors has less diverse CPUs, and with similarities in performances
– The performance of nas is generally better than that of ibm.
– After building and considering the outcome evaluation, the our model equation for
the relationship between dependent variable PRP and the independent variables
CH_average, M_average, CACH and F is :
y = 51.816536 + 424.632670x1 + 0.014919x2 + 0.562222x3 + 2.859289x4
[1] Applied Statistics and Probability for Engineers, Douglas C. Montgomery, George C.
Runger, 5th ed.
[2] Statistics and Computing : Introductory Statistics with R, Peter Dalgaard, Springer, 2nd
ed.
[3] A Beginner’s Guide to R (2009), Alain F. Zuur, Elena N.leno, Erik H.W.G Meesters,
Springer.
Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 29/29