Mastering Machine Learning with R
()
About this ebook
Read more from Lesmeister Cory
Mastering Machine Learning with R - Second Edition Rating: 0 out of 5 stars0 ratingsR: Unleash Machine Learning Techniques Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R: Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition Rating: 0 out of 5 stars0 ratingsAdvanced Machine Learning with R: Tackle data analytics and machine learning challenges and build complex applications with R 3.5 Rating: 0 out of 5 stars0 ratings
Related to Mastering Machine Learning with R
Related ebooks
Mastering Predictive Analytics with R Rating: 4 out of 5 stars4/5R for Data Science Rating: 5 out of 5 stars5/5R Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Learning Data Mining with Python Rating: 0 out of 5 stars0 ratingsR Object-oriented Programming Rating: 3 out of 5 stars3/5Building a Recommendation System with R Rating: 0 out of 5 stars0 ratingsR High Performance Programming Rating: 4 out of 5 stars4/5Practical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsPython Data Analysis Rating: 4 out of 5 stars4/5Mastering Social Media Mining with R Rating: 5 out of 5 stars5/5Learning Quantitative Finance with R Rating: 4 out of 5 stars4/5Predictive Analytics Using Rattle and Qlik Sense Rating: 0 out of 5 stars0 ratingsMachine Learning with R Rating: 4 out of 5 stars4/5R Data Science Essentials Rating: 2 out of 5 stars2/5Big Data Analytics with R Rating: 0 out of 5 stars0 ratingsMachine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition Rating: 0 out of 5 stars0 ratingsLearning Predictive Analytics with R Rating: 0 out of 5 stars0 ratingsR Machine Learning By Example Rating: 0 out of 5 stars0 ratingsMastering Text Mining with R Rating: 0 out of 5 stars0 ratingsggplot2 Essentials Rating: 0 out of 5 stars0 ratingsLearning Bayesian Models with R Rating: 5 out of 5 stars5/5R: Recipes for Analysis, Visualization and Machine Learning Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Cookbook Rating: 0 out of 5 stars0 ratingsR in Action: Data analysis and graphics with R Rating: 4 out of 5 stars4/5Learning RStudio for R Statistical Computing Rating: 4 out of 5 stars4/5Mastering Scientific Computing with R Rating: 3 out of 5 stars3/5R: Data Analysis and Visualization Rating: 5 out of 5 stars5/5
Programming For You
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Mastering Windows PowerShell Scripting Rating: 4 out of 5 stars4/5HTML in 30 Pages Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsPython: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Python for Beginners: Learn the Fundamentals of Computer Programming Rating: 0 out of 5 stars0 ratingsSpies, Lies, and Algorithms: The History and Future of American Intelligence Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Python Data Structures and Algorithms Rating: 5 out of 5 stars5/5C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast! Rating: 5 out of 5 stars5/5Java for Beginners: A Crash Course to Learn Java Programming in 1 Week Rating: 5 out of 5 stars5/5Visual Studio Code: End-to-End Editing and Debugging Tools for Web Developers Rating: 0 out of 5 stars0 ratingsPYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language Rating: 5 out of 5 stars5/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5Linux Command Line and Shell Scripting Bible Rating: 3 out of 5 stars3/5
Reviews for Mastering Machine Learning with R
0 ratings0 reviews
Book preview
Mastering Machine Learning with R - Lesmeister Cory
Table of Contents
Mastering Machine Learning with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
Machine learning defined
Machine learning caveats
Failure to engineer features
Overfitting and underfitting
Causality
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
eBooks, discount offers, and more
Questions
1. A Process for Success
The process
Business understanding
Identify the business objective
Assess the situation
Determine the analytical goals
Produce a project plan
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Algorithm flowchart
Summary
2. Linear Regression – The Blocking and Tackling of Machine Learning
Univariate linear regression
Business understanding
Multivariate linear regression
Business understanding
Data understanding and preparation
Modeling and evaluation
Other linear model considerations
Qualitative feature
Interaction term
Summary
3. Logistic Regression and Discriminant Analysis
Classification methods and linear regression
Logistic regression
Business understanding
Data understanding and preparation
Modeling and evaluation
The logistic regression model
Logistic regression with cross-validation
Discriminant analysis overview
Discriminant analysis application
Model selection
Summary
4. Advanced Feature Selection in Linear Models
Regularization in a nutshell
Ridge regression
LASSO
Elastic net
Business case
Business understanding
Data understanding and preparation
Modeling and evaluation
Best subsets
Ridge regression
LASSO
Elastic net
Cross-validation with glmnet
Model selection
Summary
5. More Classification Techniques – K-Nearest Neighbors and Support Vector Machines
K-Nearest Neighbors
Support Vector Machines
Business case
Business understanding
Data understanding and preparation
Modeling and evaluation
KNN modeling
SVM modeling
Model selection
Feature selection for SVMs
Summary
6. Classification and Regression Trees
Introduction
An overview of the techniques
Regression trees
Classification trees
Random forest
Gradient boosting
Business case
Modeling and evaluation
Regression tree
Classification tree
Random forest regression
Random forest classification
Gradient boosting regression
Gradient boosting classification
Model selection
Summary
7. Neural Networks
Neural network
Deep learning, a not-so-deep overview
Business understanding
Data understanding and preparation
Modeling and evaluation
An example of deep learning
H2O background
Data preparation and uploading it to H2O
Create train and test datasets
Modeling
Summary
8. Cluster Analysis
Hierarchical clustering
Distance calculations
K-means clustering
Gower and partitioning around medoids
Gower
PAM
Business understanding
Data understanding and preparation
Modeling and evaluation
Hierarchical clustering
K-means clustering
Clustering with mixed data
Summary
9. Principal Components Analysis
An overview of the principal components
Rotation
Business understanding
Data understanding and preparation
Modeling and evaluation
Component extraction
Orthogonal rotation and interpretation
Creating factor scores from the components
Regression analysis
Summary
10. Market Basket Analysis and Recommendation Engines
An overview of a market basket analysis
Business understanding
Data understanding and preparation
Modeling and evaluation
An overview of a recommendation engine
User-based collaborative filtering
Item-based collaborative filtering
Singular value decomposition and principal components analysis
Business understanding and recommendations
Data understanding, preparation, and recommendations
Modeling, evaluation, and recommendations
Summary
11. Time Series and Causality
Univariate time series analysis
Bivariate regression
Granger causality
Business understanding
Data understanding and preparation
Modeling and evaluation
Univariate time series forecasting
Time series regression
Examining the causality
Summary
12. Text Mining
Text mining framework and methods
Topic models
Other quantitative analyses
Business understanding
Data understanding and preparation
Modeling and evaluation
Word frequency and topic models
Additional quantitative analysis
Summary
A. R Fundamentals
Introduction
Getting R up and running
Using R
Data frames and matrices
Summary stats
Installing and loading the R packages
Summary
Index
Mastering Machine Learning with R
Mastering Machine Learning with R
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2015
Production reference: 1231015
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78398-452-7
www.packtpub.com
Credits
Author
Cory Lesmeister
Reviewers
Vikram Dhillon
Miro Kopecky
Pavan Narayanan
Doug Ortiz
Shivani Rao, PhD
Commissioning Editor
Kartikey Pandey
Acquisition Editor
Nadeem N. Bagban
Content Development Editor
Siddhesh Salvi
Technical Editor
Suwarna Rajput
Copy Editor
Tasneem Fatehi
Project Coordinator
Nidhi Joshi
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Production Coordinator
Nilesh Mohite
Cover Work
Nilesh Mohite
About the Author
Cory Lesmeister currently works as an advanced analytics consultant for Clarity Solution Group, where he applies the methods in this book to solve complex problems and provide actionable insights. Cory spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. A former U.S. Army Reservist, Cory was in Baghdad, Iraq, in 2009 as a strategic advisor to the 29,000-person Iraqi oil police, where he supplied equipment to help the country secure and protect its oil infrastructure. An aviation aficionado, Cory has a BBA in aviation administration from the University of North Dakota and a commercial helicopter license. Cory lives in Carmel, IN, with his wife and their two teenage daughters.
About the Reviewers
Vikram Dhillon is a software developer, bioinformatics researcher, and software coach at the Blackstone LaunchPad in the University of Central Florida. He has been working on his own start-up involving healthcare data security. He lives in Orlando and regularly attends developer meetups and hackathons. He enjoys spending his spare time reading about new technologies such as the blockchain and developing tutorials for machine learning in game design. He has been involved in open source projects for over 5 years and writes about technology and start-ups at opsbug.com.
Miro Kopecky is a passionate JVM enthusiast from the first moment he joined Sun Microsystems in 2002. Miro truly believes in a distributed system design, concurrency, and parallel computing, which means pushing the system's performance to its limits without losing reliability and stability. He has been working on research of new data mining techniques in neurological signal analysis during his PhD studies. Miro's hobbies include autonomic system development and robotics.
I would like to thank my family and my girlfriend, Tanja, for their support during the reviewing of this book.
Pavan Narayanan is an applied mathematician and is experienced in mathematical programming, analytics, and web development. He has published and presented papers in algorithmic research to the Transportation Research Board, Washington DC and SUNY Research Conference, Albany, NY. An avid blogger at https://datasciencehacks.wordpress.com, his interests are exploring problem solving techniques—from industrial mathematics to machine learning. Pavan can be contacted at
He has worked on books such as Apache mahout essentials, Learning apache mahout, and Real-time applications development with Storm and Petrel.
I would like to thank my family and God Almighty for giving me strength and endurance and the folks at Packt Publishing for the opportunity to work on this book.
Doug Ortiz is an independent consultant who has been architecting, developing, and integrating enterprise solutions throughout his whole career. Organizations that leverage his skillset have been able to rediscover and reuse their underutilized data via existing and emerging technologies such as Microsoft BI Stack, Hadoop, NOSQL Databases, SharePoint, Hadoop, and related toolsets and technologies.
Doug has experience in integrating multiple platforms and products. He has helped organizations gain a deeper understanding and value of their current investments in data and existing resources turning them into useful sources of information. He has improved, salvaged, and architected projects by utilizing unique and innovative techniques.
His hobbies include yoga and scuba diving. He is the founder of Illustris, LLC, and can be contacted at <[email protected]>.
Shivani Rao, PhD, is a machine learning engineer based in San Francisco and Bay Area working in areas of search, analytics, and machine learning. Her background and areas of interest are in the field of computer vision, image processing, applied machine learning, data mining, and information retrieval. She has also accrued industry experience in companies such as Nvidia , Google, and Box. Shivani holds a PhD from the Computer Engineering Department of Purdue University spanning areas of machine learning, information retrieval, and software engineering. Prior to that, she obtained a masters from the Computer Science and Engineering Department of the Indian Institute of Technology (IIT), Madras, majoring in Computer Vision and Image Processing.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
Machine learning is a very broad topic. The following quote sums it up nicely: The first problem facing you is the bewildering variety of learning algorithms available. Which one to use? There are literally thousands available, and hundreds more are published each year. (Domingo, P., 2012.) It would therefore be irresponsible to try and cover everything in the chapters that follow because, to paraphrase Frederick the Great, we would achieve nothing.
With this constraint in mind, I hope to provide a solid foundation of algorithms and business considerations that will allow the reader to walk away and, first of all, take on any machine learning tasks with complete confidence, and secondly, be able to help themselves in figuring out other algorithms and topics. Essentially, if this book significantly helps you to help yourself, then I would consider this a victory. Don't think of this book as a destination but rather, as a path to self-discovery.
The world of R can be as bewildering as the world of machine learning! There is seemingly an endless number of R packages with a plethora of blogs, websites, discussions, and papers of various quality and complexity from the community that supports R. This is a great reservoir of information and probably R's greatest strength, but I've always believed that an entity's greatest strength can also be its greatest weakness. R's vast community of knowledge can quickly overwhelm and/or sidetrack you and your efforts. Show me a problem and give me ten different R programmers and I'll show you ten different ways the code is written to solve the problem. As I've written each chapter, I've endeavored to capture the critical elements that can assist you in using R to understand, prepare, and model the data. I am no R programming expert by any stretch of the imagination, but again, I like to think that I can provide a solid foundation herein.
Another thing that lit a fire under me to write this book was an incident that happened in the hallways of a former employer a couple of years ago. My team had an IT contractor to support the management of our databases. As we were walking and chatting about big data and the like, he mentioned that he had bought a book about machine learning with R and another about machine learning with Python. He stated that he could do all the programming, but all of the statistics made absolutely no sense to him. I have always kept this conversation at the back of my mind throughout the writing process. It has been a very challenging task to balance the technical and theoretical with the practical. One could, and probably someone has, turned the theory of each chapter to its own book. I used a heuristic of sorts to aid me in deciding whether a formula or technical aspect was in the scope, which was would this help me or the readers in the discussions with team members and business leaders? If I felt it might help, I would strive to provide the necessary details.
I also made a conscious effort to keep the datasets used in the practical exercises large enough to be interesting but small enough to allow you to gain insight without becoming overwhelmed. This book is not about big data, but make no mistake about it, the methods and concepts that we will discuss can be scaled to big data.
In short, this book will appeal to a broad group of individuals, from IT experts seeking to understand and interpret machine learning algorithms to statistical gurus desiring to incorporate the power of R into their analysis. However, even those that are well-versed in both IT and statistics—experts if you will—should be able to pick up quite a few tips and tricks to assist them in their efforts.
Machine learning defined
Machine learning is everywhere! It is used in web search, spam filters, recommendation engines, medical diagnostics, ad placement, fraud detection, credit scoring, and I fear in these autonomous cars that I hear so much about. The roads are dangerous enough now; the idea of cars with artificial intelligence, requiring CTRL + ALT + DEL every 100 miles, aimlessly roaming the highways and byways is just too terrifying to contemplate. But, I digress.
It is always important to properly define what one is talking about and machine learning is no different. The website, machinelearningmastery.com, has a full page dedicated to this question, which provides some excellent background material. It also offers a succinct one-liner that is worth adopting as an operational definition: machine learning is the training of a model from data that generalizes a decision against a performance measure.
With this definition in mind, we will require a few things in order to perform machine learning. The first is that we have the data. The second is that a pattern actually exists, which is to say that with known input values from our training data, we can make a prediction or decision based on data that we did not use to train the model. This is the generalization in machine learning. Third, we need some sort of performance measure to see how well we are learning/generalizing, for example, the mean squared error, accuracy, and others. We will look at a number of performance measures throughout the book.
One of the things that I find interesting in the world of machine learning are the changes in the language to describe the data and process. As such, I can't help but include this snippet from the philosopher, George Carlin:
I cut my teeth on datasets that had dependent and independent variables. I would build a model with the goal of trying to find the best fit. Now, I have labeled the instances and input features that require engineering, which will become the feature space that I use to learn a model. When all was said and done, I used to look at my model parameters; now, I look at weights.
The bottom line is that I still use these terms interchangeably and probably always will. Machine learning purists may curse me, but I don't believe I have caused any harm to life or limb.
Machine learning caveats
Before we pop the cork on the champagne bottle and rest easy that machine learning will cure all of our societal ills, we need to look at a few important considerations—caveats if you will—about machine learning. As you practice your craft, always keep these at the back of your mind. It will help you steer clear of some painful traps.
Failure to engineer features
Just throwing data at the problem is not enough; no matter how much of it exists. This may seem obvious, but I have personally experienced, and I know of others who have run into this problem, where business leaders assumed that providing vast amounts of raw data combined with the supposed magic of machine learning would solve all the problems. This is one of the reasons the first chapter is focused on a process that properly frames the business problem and leader's expectations.
Unless you have data from a designed experiment or it has been already preprocessed, raw, observational data will probably never be in a form that you can begin modeling. In any project, very little time is actually spent on building models. The most time-consuming activities will be on the engineering features: gathering, integrating, cleaning, and understanding the data. In the practical exercises in this book, I would estimate that 90 percent of my time was spent on coding these activities versus modeling. This, in an environment where most of the datasets are small and easily accessed. In my current role, 99 percent of the time in SAS is spent using PROC SQL and only 1 percent with things such as PROC GENMOD, PROC LOGISTIC, or Enterprise Miner.
When it comes to feature engineering, I fall in the camp of those that say there is no substitute for domain expertise. There seems to be another camp that believes machine learning algorithms can indeed automate most of the feature selection/engineering tasks and several start-ups are out to prove this very thing. (I have had discussions with a couple of individuals that purport their methodology does exactly that but they were closely guarded secrets.) Let's say that you have several hundred candidate features (independent variables). A way to perform automated feature selection is to compute the univariate information value. However, a feature that appears totally irrelevant in isolation can become important in combination with another feature. So, to get around this, you create numerous combinations of the features. This has potential problems of its own as you may have a dramatically increased computational time and cost and/or overfit your model. Speaking of overfitting, let's pursue it as the next caveat.
Overfitting and underfitting
Overfitting manifests itself when you have a model that does not generalize well. Say that you achieve a classification accuracy rate on your training data of 95 percent, but when you test its accuracy on another set of data, the accuracy falls to 50 percent. This would be considered a high variance. If we had a case of 60 percent accuracy on the train data and 59 percent accuracy on the test data, we now have a low variance but a high bias. This bias-variance trade-off is fundamental to machine learning and model complexity.
Let's nail down the definitions. A bias error is the difference between the value or class that we predict and the actual value or class in our training data. A variance error is the amount by which the predicted value or class in our training set differs from the predicted value or class versus the other datasets. Of course, our goal is to minimize the total error (bias + variance), but how does that relate to model complexity?
For the sake of argument, let's say that we are trying to predict a value and we build a simple linear model with our train data. As this is a simple model, we could expect a high bias, while on the other hand, it would have a low variance between the train and test data. Now, let's try including polynomial terms in the linear model or build decision trees. The models are more complex and should reduce the bias. However, as the bias decreases, the variance, at some point, begins to expand and generalizability is diminished. You can see this phenomena in the following illustration. Any machine learning effort should strive to achieve the optimal trade-off between the bias and variance, which is easier said than done.
We will look at methods to combat this problem and optimize the model complexity, including cross-validation (Chapter 2, Linear Regression - The Blocking and Tackling of Machine Learning. through Chapter 7, Neural Networks) and regularization (Chapter 4, Advanced Feature Selection in Linear Models).
Causality
It seems a safe assumption that the proverbial correlation does not equal causation—a dead horse has been sufficiently beaten. Or has it? It is quite apparent that correlation-to-causation leaps of faith are still an issue in the real world. As a result, we must remember and convey with conviction that these algorithms are based on observational and not experimental data. Regardless of what correlations we find via machine learning, nothing can trump a proper experimental design. As Professor Domingos states:
In Chapter 11, Time Series and Causality, we will touch on a technique borrowed from econometrics to explore causality in time series, tackling an emotionally and politically sensitive issue.
Enough of my waxing philosophically; let's get started with using R to master machine learning! If you are a complete novice to the R programming language, then I would recommend that you skip ahead and read the appendix on using R. Regardless of where you start reading, remember that this book is about the journey to master machine learning and not a destination in and of itself. As long as we are working in this field, there will always be something new and exciting to explore. As such, I look forward to receiving your comments, thoughts, suggestions, complaints, and grievances. As per the words of the Sioux warriors: Hoka-hey! (Loosely translated it means forward together)
What this book covers
Chapter 1, A Process for Success - shows that machine learning is more than just writing code. In order for your efforts to achieve a lasting change in the industry, a proven process will be presented that will set you up for success.
Chapter 2, Linear Regression - The Blocking and Tackling of Machine Learning, provides you with a solid foundation before learning advanced methods such as Support Vector Machines and Gradient Boosting. No more solid foundation exists than the least squares linear regression.
Chapter 3, Logistic Regression and Discriminant Analysis, presents a discussion on how logistic regression and discriminant analysis is used in order to predict a categorical outcome.
Chapter 4, Advanced Feature Selection in Linear Models, shows regularization techniques to help improve the predictive ability and interpretability as feature selection is a critical and often extremely challenging component of machine learning.
Chapter 5, More Classification Techniques – K-Nearest Neighbors and Support Vector Machines, begins the exploration of the more advanced and nonlinear techniques. The real power of machine learning will be unveiled.
Chapter 6, Classification and Regression Trees, offers some of the most powerful predictive abilities of all the machine learning techniques, especially for classification problems. Single decision trees will be discussed along with the more advanced random forests and boosted trees.
Chapter 7, Neural Networks, shows some of the most exciting machine learning methods currently used. Inspired by how the brain works, neural networks and their more recent and advanced offshoot, Deep Learning, will be put to the test.
Chapter 8, Cluster Analysis, covers unsupervised learning. Instead of trying to make a prediction, the goal will focus on uncovering the latent structure of observations. Three clustering methods will be discussed: hierarchical, k-means, and partitioning around medoids.
Chapter 9, Principal Components Analysis, continues the examination of unsupervised learning with principal components analysis, which is used to uncover the latent structure of the features. Once this is done, the new features will be used in a supervised learning exercise.
Chapter 10, Market Basket Analysis and Recommendation Engines, presents the techniques that are used to increase sales, detect fraud, and improve health. You will learn about market basket analysis of purchasing habits at a grocery store and then dig into building a recommendation engine on website reviews.
Chapter 11, Time Series and Causality, discusses univariate forecast models, bivariate regression, and Granger causality models, including an analysis of carbon emissions and climate change.
Chapter 12, Text Mining, demonstrates a framework for quantitative text mining and the building of topic models. Along with time series, the world of data contains vast volumes of data in a textual format. With so much data as text, it is critically important to understand how to manipulate, code, and analyze the data in order to provide meaningful insights.
R Fundamentals, shows the syntax functions and capabilities of R. R can have a steep learning curve, but once you learn it, you will realize just how powerful it is for data preparation and machine learning.
What you need for this book
As R is a free and open source software, you will only need to download and install it from https://www.r-project.org/. Although it is not mandatory, it is highly recommended that you download IDE and RStudio from https://www.rstudio.com/products/RStudio/.
Who this book is for
If you want to learn how to use R's machine learning capabilities in order to solve complex business problems, then this book is for you. An experience with R and a working knowledge of basic statistical or machine learning will prove helpful.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows. Any command-line input or output is written as follows:
cor(x1, y1) #correlation of x1 and y1 [1] 0.8164205
> cor(x2, y1) #correlation of x2 and y2
[1] 0.8164205
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Clicking the Next button moves you to the next screen.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/4527OS_ColouredImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are