Frank Kane's Taming Big Data with Apache Spark and Python

Ebook452 pages3 hours

Frank Kane's Taming Big Data with Apache Spark and Python

Name: Frank Kane's Taming Big Data with Apache Spark and Python
Author: Frank Kane
ISBN: 9781787288300

By Frank Kane

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Understand how Spark can be distributed across computing clusters
Develop and run Spark jobs efficiently using Python
A hands-on tutorial by Frank Kane with over 15 real-world examples teaching you Big Data processing with Spark

Who This Book Is For

If you are a data scientist or data analyst who wants to learn Big Data processing using Apache Spark and Python, this book is for you. If you have some programming experience in Python, and want to learn how to process large amounts of data using Apache Spark, Frank Kane’s Taming Big Data with Apache Spark and Python will also help you.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateJun 30, 2017

ISBN9781787288300

Author

Frank Kane

Frank Kane (1912–1968) was the author of the Johnny Liddell mystery series, including Dead Weight, Trigger Mortis, Poisons Unknown, and many more.

Related to Frank Kane's Taming Big Data with Apache Spark and Python

Related ebooks

Skip carousel

Apache Spark 2.x Cookbook
Ebook
Apache Spark 2.x Cookbook
byRishi Yadav
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
Mastering Python Design Patterns
Ebook
Mastering Python Design Patterns
bySakis Kasampalis
Rating: 0 out of 5 stars
0 ratings
Apache Spark Graph Processing
Ebook
Apache Spark Graph Processing
byRamamonjison Rindra
Rating: 0 out of 5 stars
0 ratings
Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges
Ebook
Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges
byAndrea Lonza
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Data Analysis with Python and PySpark
Ebook
Data Analysis with Python and PySpark
byJonathan Rioux
Rating: 0 out of 5 stars
0 ratings
Mastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
Ebook
Mastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
byDr. Saket S.R. Mengle
Rating: 0 out of 5 stars
0 ratings
Operationalizing Machine Learning Pipelines: Building Reusable and Reproducible Machine Learning Pipelines Using MLOps
Ebook
Operationalizing Machine Learning Pipelines: Building Reusable and Reproducible Machine Learning Pipelines Using MLOps
byVishwajyoti Pandey
Rating: 0 out of 5 stars
0 ratings
Mastering Databricks Lakehouse Platform: Perform Data Warehousing, Data Engineering, Machine Learning, DevOps, and BI into a Single Platform (English Edition)
Ebook
Mastering Databricks Lakehouse Platform: Perform Data Warehousing, Data Engineering, Machine Learning, DevOps, and BI into a Single Platform (English Edition)
bySagar Lad
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
Ebook
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
byJohn Wolohan
Rating: 0 out of 5 stars
0 ratings
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Data Pipelines with Apache Airflow
Ebook
Data Pipelines with Apache Airflow
byJulian de Ruiter
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning with Spark: Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML
Ebook
Practical Machine Learning with Spark: Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML
byGourav Gupta
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Ebook
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
byAnthony So
Rating: 0 out of 5 stars
0 ratings
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
Ebook
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
byJean-Georges Perrin
Rating: 0 out of 5 stars
0 ratings
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Ebook
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
byAlok Kumar
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Ebook
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings
Data Engineering on Azure
Ebook
Data Engineering on Azure
byVlad Riscutia
Rating: 0 out of 5 stars
0 ratings
Machine Learning Bookcamp: Build a portfolio of real-life projects
Ebook
Machine Learning Bookcamp: Build a portfolio of real-life projects
byAlexey Grigorev
Rating: 4 out of 5 stars
4/5
Machine Learning in Action
Ebook
Machine Learning in Action
byPeter Harrington
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
MLOps A Complete Guide - 2021 Edition
Ebook
MLOps A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Apache Spark for Data Science Cookbook
Ebook
Apache Spark for Data Science Cookbook
byPadma Priya Chitturi
Rating: 0 out of 5 stars
0 ratings
Getting Started with Python Data Analysis
Ebook
Getting Started with Python Data Analysis
byVo.T.H Phuong
Rating: 0 out of 5 stars
0 ratings

Data Modeling & Design For You

Skip carousel

Data Visualization: a successful design process
Ebook
Data Visualization: a successful design process
byAndy Kirk
Rating: 4 out of 5 stars
4/5
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
Ebook
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Deep Learning: An Essential Guide to Deep Learning for Beginners Who Want to Understand How Deep Neural Networks Work and Relate to Machine Learning and Artificial Intelligence
Ebook
Deep Learning: An Essential Guide to Deep Learning for Beginners Who Want to Understand How Deep Neural Networks Work and Relate to Machine Learning and Artificial Intelligence
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Thinking in Algorithms: Strategic Thinking Skills, #2
Ebook
Thinking in Algorithms: Strategic Thinking Skills, #2
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
Mastering Agile User Stories
Ebook
Mastering Agile User Stories
byDeEtta Balthazar
Rating: 4 out of 5 stars
4/5
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
Ebook
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
byRob Collie
Rating: 4 out of 5 stars
4/5
DAX Patterns: Second Edition
Ebook
DAX Patterns: Second Edition
byMarco Russo
Rating: 5 out of 5 stars
5/5
Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning
Ebook
Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning
byBrian Murray
Rating: 2 out of 5 stars
2/5
Tableau Cookbook – Recipes for Data Visualization
Ebook
Tableau Cookbook – Recipes for Data Visualization
byShweta Sankhe-Savale
Rating: 0 out of 5 stars
0 ratings
Learning Social Media Analytics with R
Ebook
Learning Social Media Analytics with R
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Living in Data: A Citizen's Guide to a Better Information Future
Ebook
Living in Data: A Citizen's Guide to a Better Information Future
byJer Thorp
Rating: 4 out of 5 stars
4/5
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5
Managing Data Using Excel
Ebook
Managing Data Using Excel
byMark Gardener
Rating: 5 out of 5 stars
5/5
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
Ebook
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
byJason Scotts
Rating: 3 out of 5 stars
3/5
Learning Cypher
Ebook
Learning Cypher
byOnofrio Panzarino
Rating: 0 out of 5 stars
0 ratings
WordPress For Beginners - How To Set Up A Self Hosted WordPress Blog
Ebook
WordPress For Beginners - How To Set Up A Self Hosted WordPress Blog
byCyrus Jackson
Rating: 0 out of 5 stars
0 ratings
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
Ebook
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
byPedro Lopes
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
Ebook
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
byBrandon Railey
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Science for Marketing: Improve your marketing strategies with machine learning using Python and R
Ebook
Hands-On Data Science for Marketing: Improve your marketing strategies with machine learning using Python and R
byYoon Hyup Hwang
Rating: 5 out of 5 stars
5/5
Advanced Splunk
Ebook
Advanced Splunk
byAshish Kumar Tulsiram Yadav
Rating: 5 out of 5 stars
5/5
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
Ebook
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
byMatt Allington
Rating: 5 out of 5 stars
5/5
Tailoring Prompts For Success - The Ultimate ChatGPT Prompt Engineering Guide
Ebook
Tailoring Prompts For Success - The Ultimate ChatGPT Prompt Engineering Guide
byMichael Ferguson
Rating: 3 out of 5 stars
3/5
The Systems Thinker - Mental Models: The Systems Thinker Series, #3
Ebook
The Systems Thinker - Mental Models: The Systems Thinker Series, #3
byAlbert Rutherford
Rating: 0 out of 5 stars
0 ratings
AI and UX: Why Artificial Intelligence Needs User Experience
Ebook
AI and UX: Why Artificial Intelligence Needs User Experience
byGavin Lew
Rating: 0 out of 5 stars
0 ratings
AI-Driven Data Engineering
Ebook
AI-Driven Data Engineering
byChuck Sherman
Rating: 0 out of 5 stars
0 ratings
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
Ebook
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
byMichael Blake
Rating: 5 out of 5 stars
5/5
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Neural Networks: Neural Networks Tools and Techniques for Beginners
Ebook
Neural Networks: Neural Networks Tools and Techniques for Beginners
byJohn Slavio
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
Podcast episode
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
Podcast episode
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
byDataFramed
0 ratings
0% found this document useful
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
Podcast episode
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
byTest and Code
0 ratings
0% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
Podcast episode
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
Podcast episode
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
#110 AI Engineering with Scrimba CEO & Engineer Per Borgan: In this week's episode of the podcast, freeCodeCamp founder Quincy Larson interviews Per Borgen about AI engineering and interactive developer education. Per is the co-founder and CEO of Scrimba and is a software engineer. Be sure to follow The...
Podcast episode
#110 AI Engineering with Scrimba CEO & Engineer Per Borgan: In this week's episode of the podcast, freeCodeCamp founder Quincy Larson interviews Per Borgen about AI engineering and interactive developer education. Per is the co-founder and CEO of Scrimba and is a software engineer. Be sure to follow The...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
JSJ 459: Codota Tabnine and the Rise of Ai-powered Developer Tooling with Kyle Simpson PT 2: Imagine a world in which your editor / IDE can actually write some of your code for you. Where you're able to produce software faster and more efficiently because your development environment "knows" what you want to do, based on code you've written before.
Podcast episode
JSJ 459: Codota Tabnine and the Rise of Ai-powered Developer Tooling with Kyle Simpson PT 2: Imagine a world in which your editor / IDE can actually write some of your code for you. Where you're able to produce software faster and more efficiently because your development environment "knows" what you want to do, based on code you've written before.
byJavaScript Jabber
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
Podcast episode
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
What's New in Rails 7 with Stefan Wienert - RUBY 529: With all this hype around Rails 7, how do you ACTUALLY use it? And is it better than its predecessors? In this episode, the Ruby Rogues sit down with Stefan Wienert, a software developer whose passion for Ruby and technical expertise speak for themselves. The group talks about how to avoid headaches in these new systems, the BEST way to encrypt your data, and plenty of details for navigating Ruby on Rails like you’ve always wanted to. “You can jump in and fix it right away. It’s very dangerous and very powerful, so it’s great that they have this kind of gem. I’m really excited to about it.” - Stefan In This Episode: Stefan’s perspective on cssbundling and jsbundling vs. webpacker Why importmaps is a viable alternative to the bundling combinations Have headaches with sprockets? Stefan tells you why The reason nodes may not be necessary in the future…if you do THIS correctly How you can implement AR encryption and keep your backups
Podcast episode
What's New in Rails 7 with Stefan Wienert - RUBY 529: With all this hype around Rails 7, how do you ACTUALLY use it? And is it better than its predecessors? In this episode, the Ruby Rogues sit down with Stefan Wienert, a software developer whose passion for Ruby and technical expertise speak for themselves. The group talks about how to avoid headaches in these new systems, the BEST way to encrypt your data, and plenty of details for navigating Ruby on Rails like you’ve always wanted to. “You can jump in and fix it right away. It’s very dangerous and very powerful, so it’s great that they have this kind of gem. I’m really excited to about it.” - Stefan In This Episode: Stefan’s perspective on cssbundling and jsbundling vs. webpacker Why importmaps is a viable alternative to the bundling combinations Have headaches with sprockets? Stefan tells you why The reason nodes may not be necessary in the future…if you do THIS correctly How you can implement AR encryption and keep your backups
byRuby Rogues
0 ratings
0% found this document useful
Apache Beam with Kenneth Knowles and Pablo Estrada: On the podcast this week, your hosts and talk about the data processing tool Apache Beam with guests and . Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and...
Podcast episode
Apache Beam with Kenneth Knowles and Pablo Estrada: On the podcast this week, your hosts and talk about the data processing tool Apache Beam with guests and . Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
The Rapid Rise of Vector Databases with Ram Sriharsha: Ram Sriharsha, VP of Engineering and R&D at Pinecone, joins Corey on Screaming in the Cloud to discuss Pinecone’s creation of Vector Databases, the challenges they solve, and why their customer adoption has seen such a rapid rise. Ram reveals the the comm
Podcast episode
The Rapid Rise of Vector Databases with Ram Sriharsha: Ram Sriharsha, VP of Engineering and R&D at Pinecone, joins Corey on Screaming in the Cloud to discuss Pinecone’s creation of Vector Databases, the challenges they solve, and why their customer adoption has seen such a rapid rise. Ram reveals the the comm
byScreaming in the Cloud
0 ratings
0% found this document useful
Do generated types from OpenAPI spec change testing?: Jordan asked this on 2024-04-10
Podcast episode
Do generated types from OpenAPI spec change testing?: Jordan asked this on 2024-04-10
byThe Call Kent Podcast
0 ratings
0% found this document useful
InDesignSecrets Podcast 214: In this episode: News about our sessions at Adobe MAX, The InDesign Conference, and InDesign Magazine Time to Make the Calendars — Faster! Understanding Frames: Points vs. Paths Obscure InDesign Feature of the...
Podcast episode
InDesignSecrets Podcast 214: In this episode: News about our sessions at Adobe MAX, The InDesign Conference, and InDesign Magazine Time to Make the Calendars — Faster! Understanding Frames: Points vs. Paths Obscure InDesign Feature of the...
byInDesign Secrets
0 ratings
0% found this document useful
253: Find Yourself Through The Art of Podcast: On this week's episode, Steph and Chris have a brief chat about Snowpack, a new and ultra-speedy bundler in the front-end world, and revisit a conversation around namespacing models in Rails. The conversation then shifts to a discussion of the ins and outs of hosting a podcast and how folks might be able to dive in if they're interested in starting one themselves -- from selecting topics, to the hardware and software they use, to the guiding philosophy in how to discuss technical concepts.
Podcast episode
253: Find Yourself Through The Art of Podcast: On this week's episode, Steph and Chris have a brief chat about Snowpack, a new and ultra-speedy bundler in the front-end world, and revisit a conversation around namespacing models in Rails. The conversation then shifts to a discussion of the ins and outs of hosting a podcast and how folks might be able to dive in if they're interested in starting one themselves -- from selecting topics, to the hardware and software they use, to the guiding philosophy in how to discuss technical concepts.
byThe Bike Shed
0 ratings
0% found this document useful
Syntax Live React Edition: It’s another live episode of Syntax in which Wes and Scott do Hook’d on Hooks, Who’s Snackin’ on React, Stump’d, Unpopular Opinions, Q & Eh, and more! Sentry - Sponsor If you want to know what’s happening with your errors, track them...
Podcast episode
Syntax Live React Edition: It’s another live episode of Syntax in which Wes and Scott do Hook’d on Hooks, Who’s Snackin’ on React, Stump’d, Unpopular Opinions, Q & Eh, and more! Sentry - Sponsor If you want to know what’s happening with your errors, track them...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
How to Crack the ‘Bestseller Code’ with Jodie Archer & Matt Jockers: Part Two: In the cliffhanger conclusion to my chat with author and publishing consultant, Jodie Archer, we are joined this week by Dr. Matthew Jockers, English Professor & Dean at the University of Nebraska, and co-author of the internationally acclaimed...
Podcast episode
How to Crack the ‘Bestseller Code’ with Jodie Archer & Matt Jockers: Part Two: In the cliffhanger conclusion to my chat with author and publishing consultant, Jodie Archer, we are joined this week by Dr. Matthew Jockers, English Professor & Dean at the University of Nebraska, and co-author of the internationally acclaimed...
byThe Writer Files: Writing, Productivity, Creativity, and Neuroscience
0 ratings
0% found this document useful
DOP 122: What Are the Costs of a Digital Transformation?: #122: In this episode, we speak with Randy Abernethy about a number of topics ranging from the costs of digital transformation, how companies are embracing hybrid cloud, and the differences between the Apache Software Foundation (ASF) and the Cloud...
Podcast episode
DOP 122: What Are the Costs of a Digital Transformation?: #122: In this episode, we speak with Randy Abernethy about a number of topics ranging from the costs of digital transformation, how companies are embracing hybrid cloud, and the differences between the Apache Software Foundation (ASF) and the Cloud...
byDevOps Paradox
0 ratings
0% found this document useful
18: Preparing for Black Friday: Ben has finalized the logo for his Refactoring Rails course and is currently getting the intro/outro animations and the sales site designed. He also finished the notes for all the videos and was very pleased with the final edit quality. At Drip, Derrick is deep in the backend scaling challenges with a lot of attention towards Black Friday for their ecommerce customers. They are looking to historical data in order to anticipate peak volumes and simulate high loads.
Podcast episode
18: Preparing for Black Friday: Ben has finalized the logo for his Refactoring Rails course and is currently getting the intro/outro animations and the sales site designed. He also finished the notes for all the videos and was very pleased with the final edit quality. At Drip, Derrick is deep in the backend scaling challenges with a lot of attention towards Black Friday for their ecommerce customers. They are looking to historical data in order to anticipate peak volumes and simulate high loads.
byThe Art of Product
0 ratings
0% found this document useful
X-Ray Vision For Your Flink Stream Processing With Datorios: Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink.
Podcast episode
X-Ray Vision For Your Flink Stream Processing With Datorios: Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink.
byData Engineering Podcast
0 ratings
0% found this document useful
Release Management For Data Platform Services And Logic: Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem.
Podcast episode
Release Management For Data Platform Services And Logic: Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem.
byData Engineering Podcast
0 ratings
0% found this document useful
Humble Bundle with Andy Oxfeld: Andy Oxfeld, Engineering Manager, tells us all the details about how Humble Bundle runs on Google Cloud Platform.
Podcast episode
Humble Bundle with Andy Oxfeld: Andy Oxfeld, Engineering Manager, tells us all the details about how Humble Bundle runs on Google Cloud Platform.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Dev Tools Tabs Explained — Plus Tips & Tricks: In this episode of Syntax, Scott and Wes talk about dev tools tabs, what each tab does and how you can use them. Vonage - Sponsor Vonage is a Cloud Communications platform that allows developers to integrate voice, video and messaging into their...
Podcast episode
Dev Tools Tabs Explained — Plus Tips & Tricks: In this episode of Syntax, Scott and Wes talk about dev tools tabs, what each tab does and how you can use them. Vonage - Sponsor Vonage is a Cloud Communications platform that allows developers to integrate voice, video and messaging into their...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful

Skip carousel

AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
AWS vs Azure
Linux Format
Article
AWS vs Azure
Aug 22, 2023
9 min read
How We Test
PC Pro Magazine
Article
How We Test
May 9, 2024
2 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
How We Test
PC Pro Magazine
Article
How We Test
Dec 7, 2023
2 min read
How We Test
PC Pro Magazine
Article
How We Test
Feb 8, 2024
2 min read
How We Test
PC Pro Magazine
Article
How We Test
Jan 4, 2024
2 min read
Acer Nitro N50-610
Computeractive
Article
Acer Nitro N50-610
Oct 21, 2020
DESKTOP PC | £799 from Currys www.snipca.com/36126 Brand of gold As a rule, you get a better desktop PC from a local (ie, British) company and a better laptop from an international brand. That’s because what matters with a PC is getting the best deal
3 min read
How We Test
PC Pro Magazine
Article
How We Test
Aug 13, 2020
We run our own benchmarks on every Windows and macOS system we test. These are based around image editing, video editing and multitasking (where we run the video-editing benchmark while simultaneously playing back a 4K video). At the bottom of each l
2 min read
How We Test
PC Pro Magazine
Article
How We Test
Jan 7, 2021
2 min read
Inside APC
APC
Article
Inside APC
Feb 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
Feb 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
Jul 17, 2023
2 min read
How We Test
PC Pro Magazine
Article
How We Test
Sep 10, 2020
We run our own benchmarks on every Windows and macOS system we test. These are based around image editing, video editing and multitasking (where we run the video-editing benchmark while simultaneously playing back a 4K video). At the bottom of each l
2 min read
Answers
Linux Format
Article
Answers
Mar 5, 2024
10 min read
Inside APC
APC
Article
Inside APC
Jun 19, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
May 22, 2023
2 min read
Inside APC
APC
Article
Inside APC
Apr 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
Mar 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read

Related categories

Skip carousel

Reviews for Frank Kane's Taming Big Data with Apache Spark and Python

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Frank Kane's Taming Big Data with Apache Spark and Python - Frank Kane

Frank Kane's Taming Big Data with Apache Spark and Python

Â Â Â Â Â Â Â Â Â Â

Real-world examples to help you analyze large datasets with Apache Spark

Â Â Â Â Â Â Â Â Â Â

Frank Kane

BIRMINGHAM - MUMBAI

< html PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN http://www.w3.org/TR/REC-html40/loose.dtd>

Frank Kane's Taming Big Data with Apache Spark and Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2017

Production reference: 1290617

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78728-794-5

www.packtpub.com

Credits

About the Author

My name is Frank Kane. I spent nine years at amazon.com and imdb.com, wrangling millions of customer ratings and customer transactions to produce things such as personalized recommendations for movies and products and people who bought this also bought. I tell you, I wish we had Apache Spark back then, when I spent years trying to solve these problems there. I hold 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, I left to start my own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

www.PacktPub.com

For support files and downloads related to your book, please visitÂ www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version atÂ www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us atÂ [email protected] for more details.

AtÂ www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787287947.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Spark

Getting set up - installing Python, a JDK, and Spark and its dependencies

Installing Enthought Canopy

Installing the Java Development Kit

Installing Spark

Running Spark code

Installing the MovieLens movie rating dataset

Run your first Spark program - the ratings histogram example

Examining the ratings counter script

Running the ratings counter script

Summary

Spark Basics and Spark Examples

What is Spark?

Spark is scalable

Spark is fast

Spark is hot

Spark is not that hard

Components of Spark

Using Python with Spark

The Resilient Distributed Dataset (RDD)

What is the RDD?

The SparkContext object

Creating RDDs

Transforming RDDs

Map example

RDD actions

Ratings histogram walk-through

Understanding the code

Setting up the SparkContext object

Loading the data

Extract (MAP) the data we care about

Perform an action - count by value

Sort and display the results

Looking at the ratings-counter script in Canopy

Key/value RDDs and the average friends by age example

Key/value concepts - RDDs can hold key/value pairs

Creating a key/value RDD

What Spark can do with key/value data?

Mapping the values of a key/value RDD

The friends by age example

Parsing (mapping) the input data

Counting up the sum of friends and number of entries per age

Compute averages

Collect and display the results

Running the average friends by age example

Examining the script

Running the code

Filtering RDDs and the minimum temperature by location example

What is filter()

The source data for the minimum temperature by location example

Parse (map) the input data

Filter out all but the TMIN entries

Create (station ID, temperature) key/value pairs

Find minimum temperature by station ID

Collect and print results

Running the minimum temperature example and modifying it for maximums

Examining the min-temperatures script

Running the script

Running the maximum temperature by location example

Counting word occurrences using flatmap()

Map versus flatmap

Map ()

Flatmap ()

Code sample - count the words in a book

Improving the word-count script with regular expressions

Text normalization

Examining the use of regular expressions in the word-count script

Running the code

Sorting the word count results

Step 1 - Implement countByValue() the hard way to create a new RDD

Step 2 - Sort the new RDD

Examining the script

Running the code

Find the total amount spent by customer

Introducing the problem

Strategy for solving the problem

Useful snippets of code

Check your results and sort them by the total amount spent

Check your sorted implementation and results against mine

Summary

Advanced Examples of Spark Programs

Finding the most popular movie

Examining the popular-movies script

Getting results

Using broadcast variables to display movie names instead of ID numbers

Introducing broadcast variables

Examining the popular-movies-nicer.py script

Getting results

Finding the most popular superhero in a social graph

Superhero social networks

Input data format

Strategy

Running the script - discover who the most popular superhero is

Mapping input data to (hero ID, number of co-occurrences) per line

Adding up co-occurrence by hero ID

Flipping the (map) RDD to (number, hero ID)

Using max() and looking up the name of the winner

Getting results

Superhero degrees of separation - introducing the breadth-first search algorithm

Degrees of separation

How the breadth-first search algorithm works?

The initial condition of our social graph

First pass through the graph

Second pass through the graph

Third pass through the graph

Final pass through the graph

Accumulators and implementing BFS in Spark

Convert the input file into structured data

Writing code to convert Marvel-Graph.txt to BFS nodes

Iteratively process the RDD

Using a mapper and a reducer

How do we know when we're done?

Superhero degrees of separation - review the code and run it

Setting up an accumulator and using the convert to BFS function

Calling flatMap()

Calling an action

Calling reduceByKey

Getting results

Item-based collaborative filtering in Spark, cache(), and persist()

How does item-based collaborative filtering work?

Making item-based collaborative filtering a Spark problem

It's getting real

Caching RDDs

Running the similar-movies script using Spark's cluster manager

Examining the script

Getting results

Improving the quality of the similar movies example

Summary

Running Spark on a Cluster

Introducing Elastic MapReduce

Why use Elastic MapReduce?

Warning - Spark on EMR is not cheap

Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY

Partitioning

Using .partitionBy()

Choosing a partition size

Creating similar movies from one million ratings - part 1

Changes to the script

Creating similar movies from one million ratings - part 2

Our strategy

Specifying memory per executor

Specifying a cluster manager

Running on a cluster

Setting up to run the movie-similarities-1m.py script on a cluster

Preparing the script

Creating a cluster

Connecting to the master node using SSH

Running the code

Creating similar movies from one million ratings – part 3

Assessing the results

Terminating the cluster

Troubleshooting Spark on a cluster

More troubleshooting and managing dependencies

Troubleshooting

Managing dependencies

Summary

SparkSQL, DataFrames, and DataSets

Introducing SparkSQL

Using SparkSQL in Python

More things you can do with DataFrames

Differences between DataFrames and DataSets

Shell access in SparkSQL

User-defined functions (UDFs)

Executing SQL commands and SQL-style functions on a DataFrame

Using SQL-style functions instead of queries

Using DataFrames instead of RDDs

Summary

Other Spark Technologies and Libraries

Introducing MLlib

MLlib capabilities

Special MLlib data types

For more information on machine learning

Making movie recommendations

Using MLlib to produce movie recommendations

Examining the movie-recommendations-als.py script

Analyzing the ALS recommendations results

Why did we get bad results?

Using DataFrames with MLlib

Examining the spark-linear-regression.py script

Getting results

Spark Streaming and GraphX

What is Spark Streaming?

GraphX

Summary

Where to Go From Here? – Learning More About Spark and Data Science

Preface

We will do some really quick housekeeping here, just so you know where to put all the stuff for this book. First, I want you to go to your hard drive, create a new folder called SparkCourse, and put it in a place where you're going to remember it is:

For me, I put that in my C drive in a folder called SparkCourse. This is where you're going to put everything for this book. As you go through the individual sections of this book, you'll see that there are resources provided for each one. There can be different kinds of resources, files, and downloads. When you download them, make sure you put them in this folder that you have created. This is the ultimate destination of everything you're going to download for this book, as you can see in my SparkCourse folder, shown in the following screenshot; you'll just accumulate all this stuff over time as you work your way through it:

So, remember where you put it all, you might need to refer to these files by their path, in this case, C:\SparkCourse. Just make sure you download them to a consistent place and you should be good to go. Also, be cognizant of the differences in file paths between operating systems. If you're on Mac or Linux, you're not going to have a C drive; you'll just have a slash and the full path name. Capitalization might be important, while it's not in Windows. Using forward slashes instead of backslashes in paths is another difference between other operating systems and Windows. So if you are using something other than Windows, just remember these differences, don't let them trip you up. If you see a path to a file and a script, make sure you adjust it accordingly to make sense of where you put these files and what your operating system is.

What this book covers

Chapter 1, Getting Started with Spark, covers basic installation instructions for Spark and its related software. This chapter illustrates a simple example of data analysis of real movie ratings data provided by different sets of people.

Chapter 2, Spark Basics and Simple Examples, provides a brief overview of what Spark is all about, who uses it, how it helps in analyzing big data, and why it is so popular.

Chapter3, Advanced Examples of Spark Programs, illustrates some advanced and complicated examples with Spark.

Chapter 4, Running Spark on a Cluster, talks about Spark Core, covering the things you can do with Spark, such as running Spark in the cloud on a cluster, analyzing a real cluster in the cloud using Spark, and so on.

Chapter 5, SparkSQL, DataFrames, and DataSets, introduces SparkSQL, which is an important concept of Spark, and explains how to deal with structured data formats using this.

Chapter 6, Other Spark Technologies and Libraries, talks about MLlib (Machine Learning library), which is very helpful if you want to work on data mining or machine learning-related jobs with Spark. This chapter also covers Spark Streaming and GraphX; technologies built on top of Spark.

Chapter 7, Where to Go From Here? - Learning More About Spark and Data Science, talks about some books related to Spark if the readers want to know more on this topic.

What you need for this book

For this book you’ll need a Python development environment (Python 3.5 or newer), a Canopy installer, Java Development Kit, and of course Spark itself (Spark 2.0 and beyond).

We'll show you how to install this software in first chapter of the book.

This book is based on the Windows operating system, so installations are provided according to it. If you have Mac or Linux, you can follow this URL http://media.sundog-soft.com/spark-python-install.pdf, which contains written instructions on getting everything set up on Mac OS and on Linux.

Who this book is for

I wrote this book for people who have at least some programming or scripting experience in their background. We're going to be using the Python programming language throughout this book, which is very easy to pick up, and I'm going to give you over 15 real hands-on examples of Spark Python scripts that you can run yourself, mess around with, and learn from. So, by the end of this book, you should have the skills needed to actually turn business problems into Spark problems, code up that Spark code on your own, and actually run it in the cluster on your own.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: Now, you'll need to remember the path that we installed the JDK into, which in our case was C:\jdk. A block of code is set as follows:

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

Any command-line input or output is written as follows:

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Now, if you're on Windows, I want you to right-click on the Enthought Canopy icon, go to Properties and then to Compatibility (this is on Windows 10), and make sure Run this program as an administrator is checked

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Frank-Kanes-Taming-Big-Data-with-Apache-Spark-and-Python. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/FrankKanesTamingBigDatawithApacheSparkandPython_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any

Enjoying the preview?

Page 1 of 1

Frank Kane's Taming Big Data with Apache Spark and Python

About this ebook

Frank Kane

Read more from Frank Kane

Related authors

Related to Frank Kane's Taming Big Data with Apache Spark and Python

Related ebooks

Data Modeling & Design For You

Related podcast episodes

Related articles

Related categories

Reviews for Frank Kane's Taming Big Data with Apache Spark and Python

What did you think?

Book preview

Frank Kane's Taming Big Data with Apache Spark and Python - Frank Kane

Frank Kane's Taming Big Data with Apache Spark and Python

Real-world examples to help you analyze large datasets with Apache Spark

Frank Kane

< html PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN http://www.w3.org/TR/REC-html40/loose.dtd>

Frank Kane's Taming Big Data with Apache Spark and Python

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

Credits

About the Author

www.PacktPub.com

Why subscribe?

Customer Feedback

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions