Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Science Algorithms in a Week
Data Science Algorithms in a Week
Data Science Algorithms in a Week
Ebook361 pages2 hours

Data Science Algorithms in a Week

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Build strong foundation of machine learning algorithms In 7 days.

About This Book
  • Get to know seven algorithms for your data science needs in this concise, insightful guide
  • Ensure you're confident in the basics by learning when and where to use various data science algorithms
  • Learn to use machine learning algorithms in a period of just 7 days
Who This Book Is For

This book is for aspiring data science professionals who are familiar with Python and have a statistics background. It is ideal for developers who are currently implementing one or two data science algorithms and want to learn more to expand their skill set.

What You Will Learn
  • Find out how to classify using Naive Bayes, Decision Trees, and Random Forest to achieve accuracy to solve complex problems
  • Identify a data science problem correctly and devise an appropriate prediction solution using Regression and Time-series
  • See how to cluster data using the k-Means algorithm
  • Get to know how to implement the algorithms efficiently in the Python and R languages
In Detail

Machine learning applications are highly automated and self-modifying, and they continue to improve over time with minimal human intervention as they learn with more data. To address the complex nature of various real-world data problems, specialized machine learning algorithms have been developed that solve these problems perfectly. Data science helps you gain new knowledge from existing data through algorithmic and statistical analysis.

This book will address the problems related to accurate and efficient data classification and prediction. Over the course of 7 days, you will be introduced to seven algorithms, along with exercises that will help you learn different aspects of machine learning. You will see how to pre-cluster your data to optimize and classify it for large datasets. You will then find out how to predict data based on the existing trends in your datasets.

This book covers algorithms such as: k-Nearest Neighbors, Naive Bayes, Decision Trees, Random Forest, k-Means, Regression, and Time-series. On completion of the book, you will understand which machine learning algorithm to pick for clustering, classification, or regression and which is best suited for your problem.

Style and approach

Machine learning applications are highly automated and self-modifying which continue to improve over time with minimal human intervention as they learn with more data. To address the complex nature of various real world data problems, specialized machine learning algorithms have been developed that solve these problems perfectly.

LanguageEnglish
Release dateAug 16, 2017
ISBN9781787282742
Data Science Algorithms in a Week

Related to Data Science Algorithms in a Week

Related ebooks

Computers For You

View More

Related articles

Reviews for Data Science Algorithms in a Week

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science Algorithms in a Week - David Natingga

    Data Science Algorithms in a Week

    Data Science Algorithms in a Week

    Data analysis, machine learning, and more

    Dávid Natingga

    BIRMINGHAM - MUMBAI

    Data Science Algorithms in a Week

    Copyright © 2017 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: August 2017

    Production reference: 1080817

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-78728-458-6

    www.packtpub.com

    Credits

    About the Author

    Dávid Natingga graduated in 2014 from Imperial College London in MEng Computing with a specialization in Artificial Intelligence. In 2011, he worked at Infosys Labs in Bangalore, India, researching the optimization of machine learning algorithms. In 2012 and 2013, at Palantir Technologies in Palo Alto, USA, he developed algorithms for big data. In 2014, as a data scientist at Pact Coffee, London, UK, he created an algorithm suggesting products based on the taste preferences of customers and the structure of coffees. In 2017, he work at TomTom in Amsterdam, Netherlands, processing map data for navigation platforms.

    As a part of his journey to use pure mathematics to advance the field of AI, he is a PhD candidate in Computability Theory at, University of Leeds, UK. In 2016, he spent 8 months at Japan, Advanced Institute of Science and Technology, Japan, as a research visitor.

    Dávid Natingga married his wife Rheslyn and their first child will soon behold the outer world.

    I would like to thank Packt Publishing for providing me with this opportunity to share my knowledge and experience in data science through this book. My gratitude belongs to my wife Rheslyn who has been patient, loving, and supportive through out the whole process of writing this book.

    About the Reviewer

    Surendra Pepakayala is a seasoned technology professional and entrepreneur with over 19 years of experience in the US and India. He has broad experience in building enterprise/web software products as a developer, architect, software engineering manager, and product manager at both start-ups and multinational companies in India and the US. He is a hands-on technologist/hacker with deep interest and expertise in Enterprise/Web Applications Development, Cloud Computing, Big Data, Data Science, Deep Learning, and Artificial Intelligence.

    A technologist turned entrepreneur, after 11 years in corporate US, Surendra has founded an enterprise BI / DSS product for school districts in the US. He subsequently sold the company and started a Cloud Computing, Big Data, and Data Science consulting practice to help start-ups and IT organizations streamline their development efforts and reduce time to market of their products/solutions. Also, Surendra takes pride in using his considerable IT experience for reviving / turning-around distressed products / projects.

    He serves as an advisor to eTeki, an on-demand interviewing platform, where he leads the effort to recruit and retain world-class IT professionals into eTeki’s interviewer panel. He has reviewed drafts, recommended changes and formulated questions for various IT certifications such as CGEIT, CRISC, MSP, and TOGAF. His current focus is on applying Deep Learning to various stages of the recruiting process to help HR (staffing and corporate recruiters) find the best talent and reduce friction involved in the hiring process.

    www.PacktPub.com

    For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www.packtpub.com/mapt

    Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Customer Feedback

    Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at link.

    If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

    Table of Contents

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    Classification Using K Nearest Neighbors

    Mary and her temperature preferences

    Implementation of k-nearest neighbors algorithm

    Map of Italy example - choosing the value of k

    House ownership - data rescaling

    Text classification - using non-Euclidean distances

    Text classification - k-NN in higher-dimensions

    Summary

    Problems

    Naive Bayes

    Medical test - basic application of Bayes' theorem

    Proof of Bayes' theorem and its extension

    Extended Bayes' theorem

    Playing chess - independent events

    Implementation of naive Bayes classifier

    Playing chess - dependent events

    Gender classification - Bayes for continuous random variables

    Summary

    Problems

    Decision Trees

    Swim preference - representing data with decision tree

    Information theory

    Information entropy

    Coin flipping

    Definition of information entropy

    Information gain

    Swim preference - information gain calculation

    ID3 algorithm - decision tree construction

    Swim preference - decision tree construction by ID3 algorithm

    Implementation

    Classifying with a decision tree

    Classifying a data sample with the swimming preference decision tree

    Playing chess - analysis with decision tree

    Going shopping - dealing with data inconsistency

    Summary

    Problems

    Random Forest

    Overview of random forest algorithm

    Overview of random forest construction

    Swim preference - analysis with random forest

    Random forest construction

    Construction of random decision tree number 0

    Construction of random decision tree number 1

    Classification with random forest

    Implementation of random forest algorithm

    Playing chess example

    Random forest construction

    Construction of a random decision tree number 0:

    Construction of a random decision tree number 1, 2, 3

    Going shopping - overcoming data inconsistency with randomness and measuring the level of confidence

    Summary

    Problems

    Clustering into K Clusters

    Household incomes - clustering into k clusters

    K-means clustering algorithm

    Picking the initial k-centroids

    Computing a centroid of a given cluster

    k-means clustering algorithm on household income example

    Gender classification - clustering to classify

    Implementation of the k-means clustering algorithm

    Input data from gender classification

    Program output for gender classification data

    House ownership – choosing the number of clusters

    Document clustering – understanding the number of clusters k in a semantic context

    Summary

    Problems

    Regression

    Fahrenheit and Celsius conversion - linear regression on perfect data

    Weight prediction from height - linear regression on real-world data

    Gradient descent algorithm and its implementation

    Gradient descent algorithm

    Visualization - comparison of models by R and gradient descent algorithm

    Flight time duration prediction from distance

    Ballistic flight analysis – non-linear model

    Summary

    Problems

    Time Series Analysis

    Business profit - analysis of the trend

    Electronics shop's sales - analysis of seasonality

    Analyzing trends using R

    Analyzing seasonality

    Conclusion

    Summary

    Problems

    Statistics

    Basic concepts

    Bayesian Inference

    Distributions

    Normal distribution

    Cross-validation

    K-fold cross-validation

    A/B Testing

    R Reference

    Introduction

    R Hello World example

    Comments

    Data types

    Integer

    Numeric

    String

    List and vector

    Data frame

    Linear regression

    Python Reference

    Introduction

    Python Hello World example

    Comments

    Data types

    Int

    Float

    String

    Tuple

    List

    Set

    Dictionary

    Flow control

    For loop

    For loop on range

    For loop on list

    Break and continue

    Functions

    Program arguments

    Reading and writing the file

    Glossary of Algorithms and Methods in Data Science

    Preface

    Data science is a discipline at the intersection of machine learning, statistics and data mining with the objective to gain new knowledge from the existing data by the means of algorithmic and statistical analysis. In this book you will learn the 7 most important ways in Data Science to analyze the data. Each chapter first explains its algorithm or analysis as a simple concept supported by a trivial example. Further examples and exercises are used to build and expand the knowledge of a particular analysis.

    What this book covers

    Chapter 1, Classification Using K Nearest Neighbors, Classify a data item based on the k most similar items.

    Chapter 2, Naive Bayes, Learn Bayes Theorem to compute the probability a data item belonging to a certain class.

    Chapter 3, Decision Trees, Organize your decision criteria into the branches of a tree and use a decision tree to classify a data item into one of the classes at the leaf node.

    Chapter 4, Random Forest, Classify a data item with an ensemble of decision trees to improve the accuracy of the algorithm by reducing the negative impact of the bias.

    Chapter 5, Clustering into K Clusters, Divide your data into k clusters to discover the patterns and similarities between the data items. Exploit these patterns to classify new data.

    Chapter 6, Regression, Model a phenomena in your data by a function that can predict the values for the unknown data in a simple way.

    Chapter 7, Time Series Analysis, Unveil the trend and repeating patters in time dependent data to predict the future of the stock market, Bitcoin prices and other time events.

    Appendix A, Statistics, Provides a summary of the statistical methods and tools useful to a data scientist.

    Appendix B, R Reference, Reference to the basic Python language constructs.

    Appendix C, Python Reference, Reference to the basic R language constructs, commands and functions used throughout the book.

    Appendix D, Glossary of Algorithms and Methods in Data Science, Provides a glossary for some of the most important and powerful algorithms and methods from the fields of the data science and machine learning.

    What you need for this book

    Most importantly, an active attitude to think of the problems--a lot of new content is presented in the exercises. Then you also need to be able to run Python and R programs under the operating system of your choice. The author ran the programs under Linux operating system using command line.

    Who this book is for

    This book is for aspiring data science professionals who are familiar with Python & R and have some statistics background. Those developers who are currently implementing 1 or 2 data science algorithms and now want to learn more to expand their skill will find this book quite useful.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: For the visualization depicted earlier in this chapter, the matplotlib library was used.

    A block of code is set as follows:

    import sys

    sys.path.append('..')

    sys.path.append('../../common')

    import knn # noqa

    import common # noqa

    Any command-line input or output is written as follows:

    $ python knn_to_data.py mary_and_temperature_preferences.data mary_and_temperature_preferences_completed.data 1 5 30 0 10

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter.

    Warnings or important notes appear like this.

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book,

    Enjoying the preview?
    Page 1 of 1