Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Specific Tasks:

1. Import the provided ev_cars_info.xls to the database (you may need to save it first as csv file)
and inspect what you have loaded via an appropriate command.
2. Using the RMySQL/DBI methodology, “ask” via R the following queries and present the
appropriate outcome of each query:
 Get the first 15 rows of the imported table
 Get those cars with 4 seats and top speed greater equal to 200 km/h
 Get the average range and efficiency by types of body-styles
 Get the car models and acceleration for cars belonging to segment F
 Get the max Battery_pack and the min FastCharge for cars belonging to Hatchback
body-style.
 Get those car models, which their brand starts with S and their price is less than 40,000
euros.
3. Explore the dplyr/dbplyr alternative methodology and provide results for all above queries
too. You need to read/explore the role of %>% pipe operator. This methodology is not
covered in related tutorial.
4. Extract all information from database to R and produce/illustrate the following visualization
schematics:
 Create a pie chart showing the proportion of cars from the dataset that have different
number of seats.
 Draw a scatter plot showing the relationship between Acceleration and Range. What
conclusion can we get from this graph?
 Create a boxplot to show the distribution of FastCharge per body-style “classes”.
 Create a bar graph that shows the number of each “class” of body-styles attribute in this
EV dataset.
In your main text, you need to include together with your results/discussion, the related segments of
your R code. At the end of your report, you have to include all of your codes as an appendix.

(Marks 32)

2nd Question (OLAP Operations in R)

At the core of the OLAP concept is an OLAP Cube. The OLAP cube is a data structure optimized for
very quick data analysis. The OLAP Cube consists of numeric facts called measures which are
categorized by dimensions. OLAP Cube is also called the hypercube. Usually, data operations and
analysis are performed using the simple spreadsheet, where data values are arranged in row and
column format. This is ideal for two-dimensional data. However, OLAP contains multidimensional
data, with data usually obtained from a different and unrelated source. Using a spreadsheet is not an
optimal option. The cube can store and analyse multidimensional data in a logical and orderly manner.
There are five types of analytical operations in OLAP:
 Roll-up
 Drill-down
 Slice and dice
 Pivot
In this specific question, we have to create a sales fact table that records each sales transaction for an
imaginary multi-national company that produces computing-based devices. This company has
branches in five cities: Frankfurt-Germany, Seattle-USA, Hong-Kong-China, London-UK and
Johannesburg-South Africa. The produced products are: Laptop, Tablets, Monitors, Printers, with
indicative prices 800, 400, 200 and 150 Euros respectively. The time framework for this sales
monitoring is from 2010 to 2015 (inclusive) – i.e. six years. We need to have also information of sales
for every month for each of these six years.
Hence, you need initially to create a function, in R, to generate the Sales Table. Use 500 as the
indicative number of records, while the transaction data need to be generated randomly. Through R,
show the first lines of transactions, in order to verify that you have managed to generate this required
sales table. Obviously, the random number of units (per appliance) needs to be an integer type. Then
you need to create, again via R, the so-called revenue cube for this company. Finally, you need to
utilise the R environment to demonstrate these five OLAP operations for this case study. For each one
of these operations, you need to provide/illustrate the related outcomes/results.
In your main text, you need to include together with your results/discussion the related segments of
your R code. At the end of your report, you have to include all of your codes as an appendix.

(Marks 18)

3rd Question (Decision Support Systems in BI)

Decision tree learners are powerful classifiers that utilize a tree structure to model the relationships
among the features and the potential outcomes. This structure earned its name due to the fact that it
mirrors the way a literal tree begins at a wide trunk and splits into narrower and narrower branches as
it is followed upward. In much the same way, a decision tree classifier uses a structure of branching
decisions that channel examples into a final predicted class value. Decision trees are built using a
heuristic called recursive partitioning. This approach is also commonly known as divide-and-conquer
because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so
on and so forth until the process stops when the algorithm determines the data within the subsets are
sufficiently homogenous, or another stopping criterion has been met. There are numerous
implementations of decision trees, but two of the most well-known ones are the C5.0 algorithm and
the Classification and regression tree (CART). The C5.0 algorithm has become the industry standard
for producing decision trees because it does well for most types of problems directly out of the box.
There are various measurements of purity that can be used to identify the best decision tree splitting
candidate. C5.0 and CART utilise entropy and gini index respectively as impurity measures for
selecting attribute. Both measures however have advantages/disadvantages. The process of pruning a
decision tree is an important component in the process, as it involves reducing its size such that it
generalizes better to unseen data.
Credit Risk assessment is a crucial issue faced by Banks nowadays which helps them to evaluate
if a loan applicant can be a defaulter at a later stage so that they can go ahead and grant the loan or
not. This helps the banks to minimize the possible losses and can increase the volume of credits. The
global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking
practices. As the availability of credit was limited, banks tightened their lending systems and turned to
machine learning to more accurately identify risky loans. Decision trees are widely used in the
banking industry due to their high accuracy and ability to formulate a statistical model in plain
language. Since governments in many countries carefully monitor the fairness of lending practices,
executives must be able to explain why one applicant was rejected for a loan while another was
approved. This information is also useful for customers hoping to determine why their credit rating is
unsatisfactory. It is likely that automated credit scoring models are used for credit card mailings and
instant online approval processes. R Tool is an excellent statistical and data mining tool that can
handle any volume of structured as well as unstructured data and provide the results in a fast manner
and presents the results in both text and graphical manners. This enables the decision maker to make
better predictions and analysis of the findings.
The data used in this question was originally provided by Dr. Hans Hofmann of the University of
Hamburg and hosted by the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). The
dataset contains information on loans obtained from a credit agency in Germany. The original dataset
contains 1000 entries with 20 categorical/symbolic input attributes. In this dataset, each entry
represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks
according to the set of attributes. The idea behind this dataset is to identify factors that are predictive
of higher risk of loan default. The provided credit dataset (see attached file: loans.xls) includes
however a truncated version of the original dataset with only 15 input attributes.
In this question, you need to develop two credit approval model using C5.0 and CART decision trees
respectively. The aim is to decide, at the end, which one of these two models is preferable for this
specific financial task. Both models need also to be optimised (via tuning/pruning like methods) in
order to obtain even better results. You need to perform the following tasks:
 Import the provided .csv file in your MySQL and check that it has been imported properly.
The following tasks however need to perform only via R environment.
 Import the contents of this file from your MySQL into R (using R-related commands) and
designate a specific name-variable to store them.
 Explore & prepare the data:
 The name-variable where you stored the information, contains 1000 observations
(rows) and 16 features (columns). This variable includes information from these
applicants such as checking and saving information, the amount of loan they plan to
borrow and how many months they plan to return the loan amount, etc. The target
feature is located at the last column for applicant’s default status (Yes or no). This
column indicates whether the loan applicant is finally gone into default, the ability to
pay back the amount they had borrowed plus all the interests. As some of these
features are non-numerical in nature, you may consider in transforming them into
numerical form, if you think is more convenient to you. Otherwise, you can leave
them as “factors” style.
 Investigate the relationships and discover rough structures of the imported data. More
specifically, find and display the frequency and proportion of the observations from
checking_balance, credit_history, purpose, savings_balance, employment_duration,
percent_of_income, job and default. Find and display the average value from
months_loan_duration, amount, years_at_residence, age, existing_loans_count, and
dependents.
 The data visualization process will focus on identifying patterns of key features
which clearly distinguishes an applicant’s default status. Therefore, via R
environment, you need to:
 Perform a histogram of months_loan_duration by default classes (i.e. two
histograms). What can we derive from these plots?
 Perform 3D plots of amount, age, month_loan_duration with colour
separation of two default classes (yes/no).
 Perform a scatter plot of amount and age with colour separation of two
default classes (yes/no).
 You need to shuffle and re-order the provided data, so that rows are randomly sorted.
Then, you need to split the dataset into training and testing sets. Use 800 and 200
samples for the training and testing sets respectively. In this way, each student will
have different training/testing sets. These specific training/testing sets will be used for
both decision trees models.
 You need to create a decision tree model based on C5.0 algorithm to predict whether a loan
applicant will default. You need to use the training set for the creation of that model. The
model will be then tested using the testing dataset you have already created. The evaluation of
your model will be made through all of the following tools: confusion matrix (CM), Area
under the Curve (AUC) and F1 Score.
 You need to create a decision tree model based on CART algorithm to predict whether a loan
applicant will default. You need to use exactly the same training/testing sets, you used for the
C5.0 case. The evaluation of your CART model will be made again with the same, as before,
tools.
 You need to provide a short discussion, based on these results and decide which model is
more suitable for this specific case study.
 You need to improve your current models (C5.0 and CART) via Adaptive Boosting and
Pruning schemes respectively. Develop, in R again, the necessary models and eventually
perform another performance evaluation using the same testing dataset. Provide a short
discussion of any improvement you may have compared to your previous models.

In your main text, you need to include together with your results/discussion the related segments of
your R code. At the end of your report, you have to include all of your codes as an appendix.

(Marks 50)

You might also like