Data Science Algorithms in a Week
()
About this ebook
Build strong foundation of machine learning algorithms In 7 days.
About This Book- Get to know seven algorithms for your data science needs in this concise, insightful guide
- Ensure you're confident in the basics by learning when and where to use various data science algorithms
- Learn to use machine learning algorithms in a period of just 7 days
This book is for aspiring data science professionals who are familiar with Python and have a statistics background. It is ideal for developers who are currently implementing one or two data science algorithms and want to learn more to expand their skill set.
What You Will Learn- Find out how to classify using Naive Bayes, Decision Trees, and Random Forest to achieve accuracy to solve complex problems
- Identify a data science problem correctly and devise an appropriate prediction solution using Regression and Time-series
- See how to cluster data using the k-Means algorithm
- Get to know how to implement the algorithms efficiently in the Python and R languages
Machine learning applications are highly automated and self-modifying, and they continue to improve over time with minimal human intervention as they learn with more data. To address the complex nature of various real-world data problems, specialized machine learning algorithms have been developed that solve these problems perfectly. Data science helps you gain new knowledge from existing data through algorithmic and statistical analysis.
This book will address the problems related to accurate and efficient data classification and prediction. Over the course of 7 days, you will be introduced to seven algorithms, along with exercises that will help you learn different aspects of machine learning. You will see how to pre-cluster your data to optimize and classify it for large datasets. You will then find out how to predict data based on the existing trends in your datasets.
This book covers algorithms such as: k-Nearest Neighbors, Naive Bayes, Decision Trees, Random Forest, k-Means, Regression, and Time-series. On completion of the book, you will understand which machine learning algorithm to pick for clustering, classification, or regression and which is best suited for your problem.
Style and approachMachine learning applications are highly automated and self-modifying which continue to improve over time with minimal human intervention as they learn with more data. To address the complex nature of various real world data problems, specialized machine learning algorithms have been developed that solve these problems perfectly.
Related to Data Science Algorithms in a Week
Related ebooks
Data Science Algorithms in a Week.: Top 7 algorithms for scientific computing, data analysis, and machine learning Rating: 0 out of 5 stars0 ratingsPandas Cookbook: Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python Rating: 0 out of 5 stars0 ratingsLearning Alteryx: A beginner's guide to using Alteryx for self-service analytics and business intelligence Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R - Second Edition Rating: 0 out of 5 stars0 ratingsJupyter for Data Science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter Rating: 0 out of 5 stars0 ratingsDiscovering Business Intelligence Using MicroStrategy 9 Rating: 0 out of 5 stars0 ratingsHands-On Data Science with R: Techniques to perform data manipulation and mining to build smart analytical models using R Rating: 0 out of 5 stars0 ratingsR Deep Learning Cookbook Rating: 0 out of 5 stars0 ratingsPractical Time Series Analysis Rating: 0 out of 5 stars0 ratingsEnsemble Machine Learning: A beginner's guide that combines powerful machine learning algorithms to build optimized models Rating: 0 out of 5 stars0 ratingsLearning pandas - Second Edition Rating: 0 out of 5 stars0 ratingsPractical Predictive Analytics Rating: 0 out of 5 stars0 ratingsMachine Learning With Go: Leverage Go's powerful packages to build smart machine learning and predictive applications, 2nd Edition Rating: 0 out of 5 stars0 ratingsArtificial Intelligence for Big Data: Complete guide to automating Big Data solutions using Artificial Intelligence techniques Rating: 0 out of 5 stars0 ratingsR Programming By Example: Practical, hands-on projects to help you get started with R Rating: 0 out of 5 stars0 ratingsEffective Amazon Machine Learning Rating: 0 out of 5 stars0 ratingsFeature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems Rating: 0 out of 5 stars0 ratingsR Machine Learning Projects: Implement supervised, unsupervised, and reinforcement learning techniques using R 3.5 Rating: 0 out of 5 stars0 ratingsPython Data Structures and Algorithms: Improve application performance with graphs, stacks, and queues Rating: 0 out of 5 stars0 ratingsR Data Analysis Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsStream Analytics with Microsoft Azure: Real-time data processing for quick insights using Azure Stream Analytics Rating: 0 out of 5 stars0 ratingsBig Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data Rating: 0 out of 5 stars0 ratingsMachine Learning Algorithms Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Machine Learning with R Cookbook, Second Edition: Analyze data and build predictive models Rating: 0 out of 5 stars0 ratings
Computers For You
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Elon Musk Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsDark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsHow to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5
Reviews for Data Science Algorithms in a Week
0 ratings0 reviews
Book preview
Data Science Algorithms in a Week - David Natingga
Data Science Algorithms in a Week
Data analysis, machine learning, and more
Dávid Natingga
BIRMINGHAM - MUMBAI
Data Science Algorithms in a Week
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2017
Production reference: 1080817
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78728-458-6
www.packtpub.com
Credits
About the Author
Dávid Natingga graduated in 2014 from Imperial College London in MEng Computing with a specialization in Artificial Intelligence. In 2011, he worked at Infosys Labs in Bangalore, India, researching the optimization of machine learning algorithms. In 2012 and 2013, at Palantir Technologies in Palo Alto, USA, he developed algorithms for big data. In 2014, as a data scientist at Pact Coffee, London, UK, he created an algorithm suggesting products based on the taste preferences of customers and the structure of coffees. In 2017, he work at TomTom in Amsterdam, Netherlands, processing map data for navigation platforms.
As a part of his journey to use pure mathematics to advance the field of AI, he is a PhD candidate in Computability Theory at, University of Leeds, UK. In 2016, he spent 8 months at Japan, Advanced Institute of Science and Technology, Japan, as a research visitor.
Dávid Natingga married his wife Rheslyn and their first child will soon behold the outer world.
I would like to thank Packt Publishing for providing me with this opportunity to share my knowledge and experience in data science through this book. My gratitude belongs to my wife Rheslyn who has been patient, loving, and supportive through out the whole process of writing this book.
About the Reviewer
Surendra Pepakayala is a seasoned technology professional and entrepreneur with over 19 years of experience in the US and India. He has broad experience in building enterprise/web software products as a developer, architect, software engineering manager, and product manager at both start-ups and multinational companies in India and the US. He is a hands-on technologist/hacker with deep interest and expertise in Enterprise/Web Applications Development, Cloud Computing, Big Data, Data Science, Deep Learning, and Artificial Intelligence.
A technologist turned entrepreneur, after 11 years in corporate US, Surendra has founded an enterprise BI / DSS product for school districts in the US. He subsequently sold the company and started a Cloud Computing, Big Data, and Data Science consulting practice to help start-ups and IT organizations streamline their development efforts and reduce time to market of their products/solutions. Also, Surendra takes pride in using his considerable IT experience for reviving / turning-around distressed products / projects.
He serves as an advisor to eTeki, an on-demand interviewing platform, where he leads the effort to recruit and retain world-class IT professionals into eTeki’s interviewer panel. He has reviewed drafts, recommended changes and formulated questions for various IT certifications such as CGEIT, CRISC, MSP, and TOGAF. His current focus is on applying Deep Learning to various stages of the recruiting process to help HR (staffing and corporate recruiters) find the best talent and reduce friction involved in the hiring process.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at link.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Classification Using K Nearest Neighbors
Mary and her temperature preferences
Implementation of k-nearest neighbors algorithm
Map of Italy example - choosing the value of k
House ownership - data rescaling
Text classification - using non-Euclidean distances
Text classification - k-NN in higher-dimensions
Summary
Problems
Naive Bayes
Medical test - basic application of Bayes' theorem
Proof of Bayes' theorem and its extension
Extended Bayes' theorem
Playing chess - independent events
Implementation of naive Bayes classifier
Playing chess - dependent events
Gender classification - Bayes for continuous random variables
Summary
Problems
Decision Trees
Swim preference - representing data with decision tree
Information theory
Information entropy
Coin flipping
Definition of information entropy
Information gain
Swim preference - information gain calculation
ID3 algorithm - decision tree construction
Swim preference - decision tree construction by ID3 algorithm
Implementation
Classifying with a decision tree
Classifying a data sample with the swimming preference decision tree
Playing chess - analysis with decision tree
Going shopping - dealing with data inconsistency
Summary
Problems
Random Forest
Overview of random forest algorithm
Overview of random forest construction
Swim preference - analysis with random forest
Random forest construction
Construction of random decision tree number 0
Construction of random decision tree number 1
Classification with random forest
Implementation of random forest algorithm
Playing chess example
Random forest construction
Construction of a random decision tree number 0:
Construction of a random decision tree number 1, 2, 3
Going shopping - overcoming data inconsistency with randomness and measuring the level of confidence
Summary
Problems
Clustering into K Clusters
Household incomes - clustering into k clusters
K-means clustering algorithm
Picking the initial k-centroids
Computing a centroid of a given cluster
k-means clustering algorithm on household income example
Gender classification - clustering to classify
Implementation of the k-means clustering algorithm
Input data from gender classification
Program output for gender classification data
House ownership – choosing the number of clusters
Document clustering – understanding the number of clusters k in a semantic context
Summary
Problems
Regression
Fahrenheit and Celsius conversion - linear regression on perfect data
Weight prediction from height - linear regression on real-world data
Gradient descent algorithm and its implementation
Gradient descent algorithm
Visualization - comparison of models by R and gradient descent algorithm
Flight time duration prediction from distance
Ballistic flight analysis – non-linear model
Summary
Problems
Time Series Analysis
Business profit - analysis of the trend
Electronics shop's sales - analysis of seasonality
Analyzing trends using R
Analyzing seasonality
Conclusion
Summary
Problems
Statistics
Basic concepts
Bayesian Inference
Distributions
Normal distribution
Cross-validation
K-fold cross-validation
A/B Testing
R Reference
Introduction
R Hello World example
Comments
Data types
Integer
Numeric
String
List and vector
Data frame
Linear regression
Python Reference
Introduction
Python Hello World example
Comments
Data types
Int
Float
String
Tuple
List
Set
Dictionary
Flow control
For loop
For loop on range
For loop on list
Break and continue
Functions
Program arguments
Reading and writing the file
Glossary of Algorithms and Methods in Data Science
Preface
Data science is a discipline at the intersection of machine learning, statistics and data mining with the objective to gain new knowledge from the existing data by the means of algorithmic and statistical analysis. In this book you will learn the 7 most important ways in Data Science to analyze the data. Each chapter first explains its algorithm or analysis as a simple concept supported by a trivial example. Further examples and exercises are used to build and expand the knowledge of a particular analysis.
What this book covers
Chapter 1, Classification Using K Nearest Neighbors, Classify a data item based on the k most similar items.
Chapter 2, Naive Bayes, Learn Bayes Theorem to compute the probability a data item belonging to a certain class.
Chapter 3, Decision Trees, Organize your decision criteria into the branches of a tree and use a decision tree to classify a data item into one of the classes at the leaf node.
Chapter 4, Random Forest, Classify a data item with an ensemble of decision trees to improve the accuracy of the algorithm by reducing the negative impact of the bias.
Chapter 5, Clustering into K Clusters, Divide your data into k clusters to discover the patterns and similarities between the data items. Exploit these patterns to classify new data.
Chapter 6, Regression, Model a phenomena in your data by a function that can predict the values for the unknown data in a simple way.
Chapter 7, Time Series Analysis, Unveil the trend and repeating patters in time dependent data to predict the future of the stock market, Bitcoin prices and other time events.
Appendix A, Statistics, Provides a summary of the statistical methods and tools useful to a data scientist.
Appendix B, R Reference, Reference to the basic Python language constructs.
Appendix C, Python Reference, Reference to the basic R language constructs, commands and functions used throughout the book.
Appendix D, Glossary of Algorithms and Methods in Data Science, Provides a glossary for some of the most important and powerful algorithms and methods from the fields of the data science and machine learning.
What you need for this book
Most importantly, an active attitude to think of the problems--a lot of new content is presented in the exercises. Then you also need to be able to run Python and R programs under the operating system of your choice. The author ran the programs under Linux operating system using command line.
Who this book is for
This book is for aspiring data science professionals who are familiar with Python & R and have some statistics background. Those developers who are currently implementing 1 or 2 data science algorithms and now want to learn more to expand their skill will find this book quite useful.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: For the visualization depicted earlier in this chapter, the matplotlib library was used.
A block of code is set as follows:
import sys
sys.path.append('..')
sys.path.append('../../common')
import knn # noqa
import common # noqa
Any command-line input or output is written as follows:
$ python knn_to_data.py mary_and_temperature_preferences.data mary_and_temperature_preferences_completed.data 1 5 30 0 10
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter.
Warnings or important notes appear like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book,