Data Acqusition Final Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Conestoga college

Kitchner, Ontario
Canada

A Report On

Microsoft stock prices Prediction


Under subject of
Data Acquisition, Analysis and Visualization

Submitted by:
Group 6
Sr.No. Name

1. Urvika Patel

2. Yash Thakker

3. Revathy

4. Adagboyi

1
TABLE OF CONTENTS

Content Page No
1 INTRODUCTION 3
1.1 Problem statement 3
1.2 Summary 3
1.3 Data set 3
2 Descriptive Statistics 4
2.1 Descriptive Statistics Analysis 4
2.2 Excel Output 4
2.3 Insights 5
3 Data Visualization 7
4 Statistical Inference 8
5 Regression Analysis 9
6 Trend and Seasonality Check 11
7 Output Excel spreadshe 14
8 Recommendation 14
9 Conclusion 14
RACI Chart 15
Appendix 16

2
1. Introduction
1.1 Problem Statement:

Owning stocks in various companies can enhance your investment portfolio's value,
enabling you to sustain savings, safeguard your funds against inflation and taxes, and
optimize income from your investments.

Investors are particular about Microsoft stock. However, they are aware that the
stock market doesn't go up every year. They typically fall below expectations in most
years. Some drops can feel quite brutal, and their level of volatility is not for
everyone.

Therefore, they want to understand the relationship between the closing price of
Microsoft stock and the passage of time.

We would like to carry out at least four data analyses to help them identify stocks
with strong growth potential.

1.2 Summary

This report presents a comprehensive analysis of historical Microsoft stock prices &
predicting the prices. The analysis covers descriptive statistics, data visualization,
linear regression, time series analysis, predictive data mining, and evaluation
methods. Each section provides insights into different aspects of the data.

1.3 Dataset
Displays of the first 5 rows in the dataset

3
2. Descriptive Statistics
2.1 Descriptive Statistics Analysis
The central tendency, variability, and distribution of each column in the dataset are
numerically summarized by the descriptive statistics, which, in turn, give information
about the stock's price and trading activity throughout the given time period.

Output from python:


Open High Low Close Volume Range(Closing)
count 1511.000000 1511.000000 1511.000000 1511.000000 1.511000e+03
mean 107.385976 108.437472 106.294533 107.422091 3.019863e+07
std 56.691333 57.382276 55.977155 56.702299 1.425266e+07
min 40.340000 40.740000 39.720000 40.290000 1.016120e+05
25% 57.860000 58.060000 57.420000 57.855000 2.136213e+07
50% 93.990000 95.100000 92.920000 93.860000 2.662962e+07
75% 139.440000 140.325000 137.825000 138.965000 3.431962e+07
max 245.030000 246.130000 242.920000 244.990000 1.352271e+08

Skewness: 0.8265555200891408

Kurtosis: -0.47511021911585827

2.2 Excel Output

4
2.3 Insights
Close, Open, High, and Low:

'Open,' 'High,' 'Low,' and 'Close' mean values indicate the average prices over the
length of the time. The average closing price, or "Close," for instance, is roughly
$107.42.

Standard Deviation:
These columns' standard deviations show how variable or spread out the
corresponding prices are. The standard deviation for "Close" is roughly $56.70.

Minimum and Maximum:


The range of observed prices is indicated by the minimum and maximum numbers.
The least amount for "Close" is $40.29, and the highest amount is $244.99.

Volume level:
Mean Volume:
The typical trading activity is shown by the mean volume, which is roughly
30,198,630.

Volume Standard Deviation:


The trade volume's standard deviation, which is roughly 14,252,660, shows how
much it varies.

Volume Range:
The values of the minimum and maximum volumes indicate the
Closing Prices Over Time.

Daily Return:

Mean Daily Return:


The average percentage change in the 'Close' price from the previous day is shown
by the mean daily return, which is roughly 0.13%.

Standard Deviation of Daily Returns:


The standard deviation of daily returns, which is roughly 1.74%, shows how volatile
they are.

Daily Return Range:


The range of daily returns is indicated by the minimum and maximum values.

Skewness (0.8266):

5
A positive skewness means that the 'Close' price distribution has a longer right tail
and is skewed to the right. A right-skewed distribution may result from periods of
comparatively higher positive returns or price increases, as suggested by the positive
skewness.

Kurtosis (-0.4751):
When compared to a normal distribution, a negative kurtosis means that the
distribution of "Close" prices is less peaked and has thinner tails.
In contrast to a normal distribution, the negative kurtosis indicates that extreme
price movements—both high and low—occur less frequently. The distribution's tails
are narrower, and its peak is milder.

6
3 Data Visualization - Closing Prices Over
Time
Shows the 'Close' pricing as a line plot over time. The date (2015-2021) is shown on
the x-axis, while the closing price is shown on the y-axis. The graphic depiction
provided by this visualization illustrates the historical fluctuations in stock values.
The is an upward positive linear trend in the graph.

7
4 Statistical Inference

Null hypothesis (H0): The average 'Close' price is equal to the mean.
The difference between the mean and the average 'Close' price is measured by the
T-statistic. The T-statistic in this instance is precisely 0.0, indicating that the average
'Close' price and the mean do not differ significantly.

1.0 for the p-value

Assuming that the null hypothesis is true, the p-value is the likelihood of finding a T-
statistic that is as extreme as the one shown in the data. The p-value in this instance
is 1.0, which is quite high. When the p-value is 1.0, the null hypothesis is not
rejected.

Interpretation
There is no evidence to support a significant difference between the mean and the
average 'Close' price, as indicated by the T-statistic of 0.0 and p-value of 1.0. The T-
value: 0.0

The T-statistic calculates the difference between the mean and average 'Close' price.
The average 'Close' price and the mean do not differ considerably in this case, as
indicated by the T-statistic of exactly 0.0.

The p-value is 1.0.

The probability of obtaining a T-statistic as extreme as the one displayed in the data,
if the null hypothesis is true, is represented by the p-value. In this case, the p-value is
1.0, which is relatively high. The null hypothesis is not rejected when the p-value is
1.0.

Interpretation

The T-statistic indicates that there's no evidence to suggest a substantial difference


between the mean and the average 'Close' price.

8
5 Regression Model
Scatter Chart

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.936173661
R Square 0.876421124
Adjusted R Square 0.87633923
Standard Error 19.93960906
Observations 1511

ANOVA
df SS MS F Significance F
Regression 1 4254917.219 4254917.219 10701.8248 0
Residual 1509 599960.3064 397.5880095
Total 1510 4854877.526

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -60655.42397 587.3667455 -103.2666974 0 -61807.56575 -59503.28218 -61807.56575 -59503.28218
Year 30.11425122 0.291100634 103.4496244 0 29.54324647 30.68525598 29.54324647 30.68525598

Interpretation

Multiple R: This is the correlation coefficient between the observed and predicted
values. In this case, it is approximately 0.936 (94%), indicating a strong positive
correlation.

R Square: This is the coefficient of determination, representing the proportion of the


variance in the dependent variable (Y) that is predictable from the independent
variable (X).

9
In this case, it is approximately 0.876 (88%), which is quite high and suggests a good
fit.

Coefficients:

Intercept: The y-intercept of the regression line.

Year: The coefficient for the independent variable (Year), representing the change in
the dependent variable (Close price of the stock) for a one-unit change in the
independent variable.

For the "Year" variable:

P-value: The probability that the coefficient is not significantly different from zero.

95% Confidence Interval: The range within which we are 95% confident the true
coefficient lies.

In summary, the regression suggests that there is a significant relationship between


the "Year" variable and the dependent variable in your regression model.

To fit the linear regression equation line using the formula for a simple linear
regression:

Y=β0+β1×X

Here:

Y is the dependent variable (Closing Price),

β0 is the intercept,

β1 is the coefficient for the independent variable (Year),

X is the independent variable.

Using the coefficients from the output:

Intercept (β0): -60655.42397

Coefficient for Year (β1): 30.11425122

The linear regression equation for the model would be:

Closing Price=−60655.42397+30.11425122×Year

=−60655.42397 + 30.11425122(2022)

=235.59

10
This equation represents the line that best fits the relationship between the "Year"
variable and the closing price of the stock based on the regression analysis.

6 Trend and Seasonality Check

Linear trend and seasonality is observed in the above graphs. However, lets delve
into the model evaluation scores

Model Evaluation - Trend and Seasonality


Trend Mean: 102.22020405461629

Seasonal Mean: 0.0038081077900802957

Residual Mean: -0.16714357363315036

Trend Std Dev: 44.9355159390051

Seasonal Std Dev: 2.557350090754289

Residual Std Dev: 5.64252375634727

Insights

11
Mean Trend and Seasonality: The core tendency of Microsoft's stock prices can be
seen in the computed mean trend of 102.22. This offers a basic comprehension of
the overall movement and acts as an important starting point for more research. The
Seasonal effect is minimal; however. seasonality factor is ignored.
Residual Analysis: A subtle component is highlighted by the computed mean
residual of -0.1671, which shows that actual data points constantly tilt slightly lower
than the predicted mean trend. Although little, this disparity is significant since it
points to fundamental forces influencing stock prices.
Seasonal Variation Variability in trend, seasonality, and residuals are displayed by
standard deviations The dynamic character of seasonal effects is emphasized by the
existence of a seasonal variation, which is quantified at 2.56. The observed patterns
are made more complex overall by the stock prices' swings, which highlight the fact
that seasonality is not a static occurrence.

Time Series- Regression Analysis

The average closing stock prices of various periods are calculated for doing Time Series
Analysis

Average Closing Price for various periods

12
Average Closing Price For Various Periods
250

193.026
200
Avg Closing Price

150 130.38
101.03
100
71.98
47.72 55.25
50
0
0
0 1 2 3 4 5 6 7 8
Period

The above graph illustrates that the the graph of closing prices on various periods is
showing curvilinear trend.

Time Series – Regression Analysis


Regression Analysis for periods 1-6 were done and suing that we the average closing price
for period 7 is calculated using the regression output. Later, the calculated value is
compared to the actual average closing price for period 7.

Summary of Regression Analysis:

r Square (Coefficient of Determination): 0.91 - This means that the linear regression
model can account for roughly 91% of the volatility in the annual average closing prices.

13
ANOVA Significance F-Value: 40.69 - The regression model as a whole appears to be
significant based on the low p-value (0.0031).

Mean Values:
Period: 28.03, Intercept: 1.80

Year 7 Calculation:
Closing Price= Intercept + Prediction period * Period(in output)

Period is used to determine the closing price for Year 7.

The closing price is 1.80 + (7*28.3) = 197.41.

Comparison with the Actual Closing Price of Year 7 (232.02):

The real value (232.02) and the computed closing price (197.41) are not the same. The
regression model's linear structure is the cause of this discrepancy. The regression
assumes that the period and closing prices have a continuous linear connection.

7 Recommendation
 Use complex Machine Learning Models like SARIMAX model which takes in
to consideration both Seasonality and exogenous factors. Exogenous factors
always play a key role in Stock prices. FbProphet is also a good model for
analyzing this dataset.
 Utilizing TensorFlow or scikit-learn packages for adaptive moving averages,
utilize rolling windows.

The above two methods have not been adapted as they go beyond the scope of the
project. However, it might produce better results.

8 Conclusion
Through the examination of subtleties in the data and the application of
sophisticated modeling methods, this study can offer investors more useful
information. Making wise investing decisions will require constant improvement and
flexibility in terms of changing market conditions.

14
Urvikaben Yash
Revathy Adagboyi
Activity Jayeshbhai Pankajkumar
Prabhakaran Ugbabe
Patel Thakker
Descriptive
R A C I
statistic

Data
I A R C
Visualization

Statistical
A, C I R A
Inference

Model
A R C I
evaluation

Predictive data
C R A I
mining

Linear
A I C R
Regression

8 RACI chart
[RACI CHART]

Responsibility (R): The person who is responsible for executing the activity.

Accountability (A): The person who is ultimately accountable for the success of
the activity.

15
Consulted (C): People who provide input and feedback but are not directly
responsible.

Informed (I): People who need to be kept informed about the progress of the
activity.

Appendix

1. Venkitesh, V. V. (2022). Microsoft Stock Time Series Analysis. Kaggle.


https://www.kaggle.com/datasets/vijayvvenkitesh/microsoft-stock-time-series-
analysis

2. Python Codes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from docx import Document
from docx.shared import Inches
import io
import base64

# Step 1: Load the data


file_path = "C:\\Users\\revak\\OneDrive\\Desktop\\Acqproject\\
Microsoft_Stock.csv"
df = pd.read_csv(file_path, parse_dates=['Date'])

# Display the head of the dataset


print("Head of the Dataset:")
print(df.head())

# Step 2: Descriptive Statistics


doc = Document()
doc.add_heading('Descriptive Statistics', level=1)
doc.add_paragraph(str(df.describe()))

16
# Step 3: Data Visualization
# Line plot of 'Close' prices over time
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Close'], marker='o', linestyle='-')
plt.title('Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.grid(True)

# Save the plot as an image


img_buffer = io.BytesIO()
plt.savefig(img_buffer, format='png')
img_buffer.seek(0)
img_data = base64.b64encode(img_buffer.read()).decode('utf-8')
plt.close()

# Insert the image into the Word document


doc.add_paragraph('Closing Prices Over Time:')
doc.add_picture(io.BytesIO(base64.b64decode(img_data)), width=Inches(6))

# Step 4: Statistical Inference


doc.add_heading('Statistical Inference', level=1)
t_stat, p_value = stats.ttest_1samp(df['Close'], df['Close'].mean())
doc.add_paragraph(f"T-statistic: {t_stat}, p-value: {p_value}")

# Step 5: Time Series Analysis - Trend and Seasonality


plt.figure(figsize=(10, 6))
result = sm.tsa.seasonal_decompose(df['Close'], model='additive', period=252) #
Assuming daily data with a yearly seasonality
result.plot().suptitle('Time Series Decomposition - Trend and Seasonality', y=1.02)
plt.show()

# Step 6: Model Evaluation - Trend and Seasonality


trend = result.trend.dropna()
seasonal = result.seasonal.dropna()
residual = result.resid.dropna()

doc.add_heading('Model Evaluation - Trend and Seasonality', level=1)


doc.add_paragraph(f"Trend Mean: {trend.mean()}")
doc.add_paragraph(f"Seasonal Mean: {seasonal.mean()}")
doc.add_paragraph(f"Residual Mean: {residual.mean()}")
doc.add_paragraph(f"Trend Std Dev: {trend.std()}")
doc.add_paragraph(f"Seasonal Std Dev: {seasonal.std()}")
doc.add_paragraph(f"Residual Std Dev: {residual.std()}")

17

You might also like