IBM Data Science Capstone
IBM Data Science Capstone
• Executive
Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix
2
Executive Summary
Summary of methodologies
- Data collection
- Data wrangling
- EDA with data visualization
- EDA with SQL
- Building an interactive map with Folium
- Building a Dashboard with Plotly Dash
- Predictive analysis (Classification)
3
Introduction
4
Methodology
5
Methodology
6
Methodology
•This API will give us data about launches, including information about the rocket used, payload
• Another popular data source for obtaining Falcon 9 Launch data is web scraping Wikipedia using
BeautifulSoup.
SpaceX API Get HTML
Response from
Wikipedia
Web Scrapping
Use API Normalize
SpaceX returns data into Extract data
REST API flat data using beautiful
SpaceX file such soup
data as .csv
in
.JSON Normalize data
into flat data
file such as .csv 7
1 .Getting Response from API simplified flow chart
SpaceX API
3. Apply custom functions to clean data
Filter DF for
Falcon 9 only 4. Assign list to dictionary then dataframe
Use SpaceX / Clean Data
REST API
API returns
SpaceX data
in .JSON
Parse HTML
Add the GitHub URLtableof
List ->
intothe
a completed web
scraping notebook,Dictionary
as an external reference
and peer-review purpose
6. Appending data to keys (refer) to notebook block 12
8. Dataframe to .CSV
GitHub URL to Notebook
9
DataData
Wrangling
wrangling
Introduction
In the data set, there are several different cases where the booster did not land successfully. Sometimes a GitHub URL to Notebook
landing was attempted but failed due to an accident; for example, True Ocean means the mission outcome
was successfully landed to a specific region of the ocean while False Ocean means the mission outcome
was unsuccessfully landed to a specific region of the ocean. True RTLS means the mission outcome was
successfully landed to a ground pad False RTLS means the mission outcome was unsuccessfully landed to a
ground pad. True ASDS means the mission outcome was successfully landed on a drone ship False
Each launch aims to an dedicated orbit,
ASDS means the mission outcome was unsuccessfully landed on a drone ship. and here are some common orbit types:
We mainly convert those outcomes into Training Labels with 1 means the booster successfully
landed 0 means it was unsuccessful.
Process
• Displaying the names of the unique launch sites in the space mission
• Displaying 5 records where launch sites begin with the string 'KSC’
• Displaying the total payload mass carried by boosters launched by NASA (CRS)
• Displaying average payload mass carried by booster version F9 v1.1
• Listing the date where the successful landing outcome in drone ship was achieved.
• Listing the names of the boosters which have success in ground pad and have payload mass greater than
4000 but less than 6000
• Listing the total number of successful and failure mission outcomes
• Listing the names of the booster_versions which have carried the maximum payload mass.
• Listing the records which will display the month names, successful landing_outcomes in ground pad ,booster
versions, launch_site for the months in year 2017
• Ranking the count of successful landing_outcomes between the date 2010-06-04 and 2017-03-20 in
descending order.
To visualize the Launch Data into an interactive map. We took the Latitude and
Longitude Coordinates at each launch site and added a Circle Marker around each
launch site with a label of the name of the launch site.
Scatter Graph showing the relationship with Outcome and Payload Mass (Kg) for the different Booster
Versions
- It shows the relationship between two variables.
- It is the best method to show you a non-linear pattern.
- The range of data flow, i.e. maximum and minimum value, can be determined.
- Observation and reading are straightforward.
14
Predictive analysis
Predictive (Classification)
analysis (Classification)
BUILDING MODEL
• Load our dataset into NumPy and Pandas
• Transform Data
• Split our data into training and test data sets
• Check how many test samples we have
• Decide which type of machine learning algorithms we want to use
• Set our parameters and algorithms to GridSearchCV
• Fit our datasets into the GridSearchCV objects and train our dataset.
EVALUATING MODEL
• Check accuracy for each model
• Get tuned hyperparameters for each type of algorithms
• Plot Confusion Matrix
IMPROVING MODEL
• Feature Engineering
• Algorithm Tuning
FINDING THE BEST PERFORMING CLASSIFICATION MODEL GitHub Link to source code
• The model with the best accuracy score wins the best performing model
• In the notebook there is a dictionary of algorithms with scores at the bottom of the notebook.
15
Results
Result
s
16
EDA with Visualization
17
Flight Number vs.
Flight Site
18
Payload Mass vs.
Launch Site
The greater the payload mass for Launch Site CCAFS SLC 40 the higher the success rate for the Rocket.
There is not quite a clear pattern to be found using this visualization to make a decision if the Launch Site is
dependant on Pay Load Mass for a success launch.
19
Success rate vs. Orbit
type
Orbit GEO,HEO,SSO,ES-L1 has the best Success
Rate
20
Flight Number vs.
Orbit type
21
Payload vs. Orbit
type
22
Launch success
yearly trend
23
EDA
WITH
24
Unique Launchsite
All launch Sites
names
Using the word DISTINCT in the query means that it will only
show Unique values in the Launch_Site column from tblSpaceX
25
Launch site names begin with `CCA`
select TOP 5 * from tblSpaceX Using the word TOP 5 in the query means that it will only show
5 records from tblSpaceX and LIKE keyword has a wild card
WHERE Launch_Site LIKE 'KSC%' with the words ‘KSC%’ the percentage in the end suggests that
the Launch_Site name must start with KSC.
26
TotalTotal
Payload Massmass
payload by Customer NASA (CRS)
SQL QUERY
select SUM(PAYLOAD_MASS_KG_) TotalPayloadMass from tblSpaceX
where Customer = 'NASA (CRS)'",'TotalPayloadMass
QUERY EXPLAINATION
27
Average Payload Mass carried by booster version F9
v1.1
SQL QUERY
select AVG(PAYLOAD_MASS_KG_) AveragePayloadMass from tblSpaceX
where Booster_Version = 'F9 v1.1'
QUERY EXPLAINATION
Using the function AVG works out the average in the column
PAYLOAD_MASS_KG_
28
The date where the successful landing outcome in
First
drone successful
ship ground landing date
was achieved
SQL QUERY
select MIN(Date) SLO from tblSpaceX where Landing_Outcome = ”Success (drone ship)”
QUERY EXPLAINATION
Using the function MIN works out the minimum date in the
column Date
29
Successful drone ship landing with payload between
4000 and 6000
SQL QUERY
select Booster_Version from tblSpaceX where Landing_Outcome = 'Success (ground pad)’
AND Payload_MASS_KG_ > 4000 AND Payload_MASS_KG_ < 6000
QUERY EXPLAINATION
30
Total Number of Successful and Failure Mission
Outcomes
SQL QUERY
SELECT(SELECT Count(Mission_Outcome) from tblSpaceX where Mission_Outcome
LIKE '%Success%’) as Successful_Mission_Outcomes,
(SELECT Count(Mission_Outcome) from tblSpaceX where
Mission_Outcome LIKE '%Failure%’) as Failure_Mission_Coutcomes
QUERY EXPLAINATION
SQL QUERY
SELECT DISTINCT Booster_Version,
MAX(PAYLOAD_MASS
_KG_) AS [Maximum Payload Mass]
FROM tblSpaceX GROUP BY Booster_Version
ORDER BY [Maximum Payload Mass] DESC
QUERY EXPLAINATION
Using the word DISTINCT in the query means that it will only
show Unique values in the Booster_Version column from
tblSpaceX
GROUP BY puts the list in order set to a certain condition.
DESC means its arranging the dataset into descending order
32
20172015
Launch Records
launch records
SQL QUERY
SELECT DATENAME(month, DATEADD(month,
MONTH(CONVERT(date, Date, 105)), 0) - 1) AS Month,
Booster_Version, Launch_Site, Landing_Outcome
FROM tblSpaceX
WHERE (Landing_Outcome LIKE N'%Success%')
AND
(YEAR(CONVERT(date, Date, 105)) = '2017')
QUERY EXPLAINATION
SQL QUERY
SELECT COUNT(Landing_Outcome)
FROM tblSpaceX
WHERE (Landing_Outcome LIKE '%Success%’)
AND (Date > '04-06-2010’)
AND (Date < '20-03-
2017')
QUERY EXPLAINATION
LIKE (wildcard)
34
AND (conditions)
Interactive map with Folium
35
All launch sites global map markers
<Folium map screenshot 1>
We can see that the SpaceX launch sites are in the United States of America coasts.
Florida and California
36
Colour Labelled Markers
California Launch
Site
Florida Launch Sites
Distance to
closest Highway Distance to coast
38
Dashboard with Plotly Dash
39
DASHBOARD – Pie chart showing the success percentage
achieved by each launch
<Dashboard site
screenshot 1>
40
DASHBOARD – Pie chart for the launch site with highest
launch success ratio
KSC LC-39A achieved a 76.9% success rate while getting a 23.1% failure rate
41
DASHBOARD – Payload vs. Launch Outcome scatter plot for all
sites,<Dashboard
with different payload selected
screenshot 3>in the range slider
Low Weighted Payload 0kg – 4000kg Heavy Weighted Payload 4000kg – 10000kg
We can see the success rates for low weighted payloads is higher than the heavy weighted payloads
42
Predictive analysis (Classification)
43
Classification Accuracy
using training data
As you can see our accuracy is extremely close but we
do have a winner its down to decimal places! using this
function
After selecting the best hyperparameters for the decision tree classifier using the validation data, we
44
achieved 83.33% accuracy on the test data.
Confusion Matrix
for the Tree
Examining the confusion matrix, we
see that Tree can distinguish
between the different classes. We
see that the major problem is false
positives.
45
Conclusion
CONCLUSIO
N
• The Tree Classifier Algorithm is the best for
Machine Learning for this dataset
• Low weighted payloads perform better than
the heavier payloads
• The success rates for SpaceX launches is
directly proportional time in years they will
eventually perfect the launches
• We can see that KSC LC-39A had the most
successful launches from all the sites
• Orbit GEO,HEO,SSO,ES-L1 has the best Success
Rate
46
Appendix
APPENDIX
• Haversine formula
• ADGGoogleMaps Module (not used but
created)
• Module sqlserver (ADGSQLSERVER)
• PythonAnywhere 24/7 dashboard
47
ADGGoogleMaps Module
Introduction
Services and add Geocoding API this function returns this Folium map
in the Jypyter Notebook
Requirements
• Google API Secret (API Key)
Introduction
The haversine formula determines the great-circle distance between two points on a sphere given their
longitudes and latitudes. Important in navigation, it is a special case of a more general formula in
spherical trigonometry, the law of haversines, that relates the sides and angles of spherical triangles.
Usage
Why did I use this formula? First of all, I believe the Earth is round/elliptical. I am not a Flat Earth
Believer! Jokes aside when doing Google research for integrating my ADGGoogleMaps API with a Python function
to calculate the distance using two distinct sets of {longitudinal, latitudinal} list sets. Haversine was the
trigonometric solution to solve my requirements above.
Formula
49
ADGSQLSERVER
Introduction
Implementation
GitHub 50
Python Anywhere
Introduction
I wanted to put my python website running 24/7 on the Python Anywhere Link
cloud so anyone can view it then I came across URL Link to live website
PythonAnywhere.
Implementation
We run Flask in /www/ on the docker Linux container
We have two files flask_app.py , wsgi.py
Pricing
Free but we are restricted to hitting the renew button every
3 months and we cannot link the domain up to our own
private domain. We only can run one instance of a website
per month.
51