Exploring Hadoop Ecosystem (Volume 1): Batch Processing

Ebook492 pages3 hours

Exploring Hadoop Ecosystem (Volume 1): Batch Processing

Name: Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Author: Wei Liu
ISBN: 9781667186184

By Wei Liu

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The Hadoop ecosystem consists of many components. It is a headache for people who want to learn or understand them. This book can help data engineers or architects understand the internals of the big data technologies, starting from the basic HDFS and MapReduce to Kafka, Spark, etc. There are currently 2 volumes, the volume 1 mainly describes batch processing, and the volume 2 mainly describes stream processing.

Skip carousel

LanguageEnglish

PublisherLulu.com

Release dateMar 31, 2021

ISBN9781667186184

Author

Wei Liu

Wei Liu is Doctor of engineering at Beijing University of Aeronautics and Astronautics, Professor of Beijing University of Posts and Telecommunications, Visiting scholar of Cambridge University, Expert of Artificial Intelligence Group, Center for strategy and security, Tsinghua University and vice chairman of cognitive branch of the China Association of Command-and-Control His research interests include human-computer integration intelligence, cognitive engineering, human-machine- environment system engineering, future situation awareness mode and behavior analysis / prediction technology, etc. So far, he has published more than 70 papers, 4 monographs and 2 translations. At present, he is a distinguished expert of Expert Committee of China information and Electronic Engineering Science and technology development center, an appraisal expert of National Natural Science Foundation of China, a member of national ergonomics Standardization Technical Committee, and a senior member of the Chinese artificial intelligence society.

Related to Exploring Hadoop Ecosystem (Volume 1)

Related ebooks

Skip carousel

Learn Hive in 24 Hours
Ebook
Learn Hive in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Learn Hadoop in 24 Hours
Ebook
Learn Hadoop in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Learning HBase
Ebook
Learning HBase
byShashwat Shriparv
Rating: 0 out of 5 stars
0 ratings
Mastering Hadoop
Ebook
Mastering Hadoop
bySandeep Karanth
Rating: 0 out of 5 stars
0 ratings
Cloudera Administration Handbook
Ebook
Cloudera Administration Handbook
byRohit Menon
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Monitoring Hadoop
Ebook
Monitoring Hadoop
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Professional Hadoop Solutions
Ebook
Professional Hadoop Solutions
byBoris Lublinsky
Rating: 4 out of 5 stars
4/5
Learning Hadoop 2
Ebook
Learning Hadoop 2
byGarry Turkington
Rating: 4 out of 5 stars
4/5
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
Ebook
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
byDr. Madhavi Vaidya
Rating: 0 out of 5 stars
0 ratings
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Ebook
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
byEric Chou
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Hadoop: Data Processing and Modelling
Ebook
Hadoop: Data Processing and Modelling
byGarry Turkington
Rating: 0 out of 5 stars
0 ratings
Learn Hbase in 24 Hours
Ebook
Learn Hbase in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Apache Hive Essentials
Ebook
Apache Hive Essentials
byDayong Du
Rating: 0 out of 5 stars
0 ratings
Hands-On Azure Data Platform: Building Scalable Enterprise-Grade Relational and Non-Relational database Systems with Azure Data Services
Ebook
Hands-On Azure Data Platform: Building Scalable Enterprise-Grade Relational and Non-Relational database Systems with Azure Data Services
bySagar Lad
Rating: 0 out of 5 stars
0 ratings
YARN Essentials
Ebook
YARN Essentials
byAmol Fasale
Rating: 0 out of 5 stars
0 ratings
DynamoDB Applied Design Patterns
Ebook
DynamoDB Applied Design Patterns
byUchit Vyas
Rating: 3 out of 5 stars
3/5
Getting Started with Big Data Query using Apache Impala
Ebook
Getting Started with Big Data Query using Apache Impala
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
Apache Oozie Essentials
Ebook
Apache Oozie Essentials
bySingh Jagat Jasjit
Rating: 0 out of 5 stars
0 ratings
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Ebook
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings
Optimizing Hadoop for MapReduce
Ebook
Optimizing Hadoop for MapReduce
byKhaled Tannir
Rating: 0 out of 5 stars
0 ratings
Azure Data Lake A Complete Guide - 2019 Edition
Ebook
Azure Data Lake A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Apache Cassandra Essentials
Ebook
Apache Cassandra Essentials
byPadalia Nitin
Rating: 4 out of 5 stars
4/5
Practitioner’s Guide to Data Science: Streamlining Data Science Solutions using Python, Scikit-Learn, and Azure ML Service Platform
Ebook
Practitioner’s Guide to Data Science: Streamlining Data Science Solutions using Python, Scikit-Learn, and Azure ML Service Platform
byNasir Ali Mirza
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 5 out of 5 stars
5/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
The Complete Powershell Training for Beginners
Ebook
The Complete Powershell Training for Beginners
byAbdelfattah Benammi
Rating: 0 out of 5 stars
0 ratings
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Uncanny Valley: A Memoir
Ebook
Uncanny Valley: A Memoir
byAnna Wiener
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
The Huffington Post Complete Guide to Blogging
Ebook
The Huffington Post Complete Guide to Blogging
byThe editors of the Huffington Post
Rating: 3 out of 5 stars
3/5

Related podcast episodes

Skip carousel

Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Introduction to Data Mesh
Podcast episode
Introduction to Data Mesh
byThe Cloudcast
0 ratings
0% found this document useful
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
Podcast episode
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
byJava Pub House
0 ratings
0% found this document useful
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
Podcast episode
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
byAWS Podcast
0 ratings
0% found this document useful
CockroachDB In Depth with Peter Mattis - Episode 35
Podcast episode
CockroachDB In Depth with Peter Mattis - Episode 35
byData Engineering Podcast
0 ratings
0% found this document useful
Iceberg at Netflix and Beyond with Ryan Blue: Apache Iceberg is an open source high-performance format for huge data tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark and Hive to safely work with the same tables, at the same time.
Podcast episode
Iceberg at Netflix and Beyond with Ryan Blue: Apache Iceberg is an open source high-performance format for huge data tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark and Hive to safely work with the same tables, at the same time.
byData Archives - Software Engineering Daily
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
#536: [INTRODUCING] Amazon Redshift Serverless: With Amazon Redshift Serverless, all users—including data analysts, developers, and data scientists—
Podcast episode
#536: [INTRODUCING] Amazon Redshift Serverless: With Amazon Redshift Serverless, all users—including data analysts, developers, and data scientists—
byAWS Podcast
0 ratings
0% found this document useful
Software Architecture with Simon Brown: Software architecture address the challenge of communicating and navigating large, complex systems to stakeholders, both technical and non-technical. Over the years software architecture has gone in and out of fashion.
Podcast episode
Software Architecture with Simon Brown: Software architecture address the challenge of communicating and navigating large, complex systems to stakeholders, both technical and non-technical. Over the years software architecture has gone in and out of fashion.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
#71 - Strategic Monoliths and Microservices - Vaughn Vernon
Podcast episode
#71 - Strategic Monoliths and Microservices - Vaughn Vernon
byTech Lead Journal
0 ratings
0% found this document useful
#321: Understanding the AWS Serverless Application Model (SAM): Do you want to deploy Serverless applications faster, easier and more reliably? The AWS Serverless A
Podcast episode
#321: Understanding the AWS Serverless Application Model (SAM): Do you want to deploy Serverless applications faster, easier and more reliably? The AWS Serverless A
byAWS Podcast
0 ratings
0% found this document useful
Episode 5: The Last Mainframe with a Kickstart and a Double Clutch: How are companies evolving in a world where Cloud is on the rise? Where Cloud providers are bought out and absorbed into other companies? Today, we’re talking to Nell Shamrell-Harrington about Cloud infrastructure. She is a senior software engineer at Che
Podcast episode
Episode 5: The Last Mainframe with a Kickstart and a Double Clutch: How are companies evolving in a world where Cloud is on the rise? Where Cloud providers are bought out and absorbed into other companies? Today, we’re talking to Nell Shamrell-Harrington about Cloud infrastructure. She is a senior software engineer at Che
byScreaming in the Cloud
0 ratings
0% found this document useful
Move Your Database To The Data And Speed Up Your Analytics With DuckDB: An interview with Hannes Mühleisen about the DuckDB engine for in-process OLAP queries that lets you use the power of SQL and the flexibility of programming languages side by side.
Podcast episode
Move Your Database To The Data And Speed Up Your Analytics With DuckDB: An interview with Hannes Mühleisen about the DuckDB engine for in-process OLAP queries that lets you use the power of SQL and the flexibility of programming languages side by side.
byData Engineering Podcast
0 ratings
0% found this document useful
Performing Fast Data Analytics Using Apache Kudu - Episode 64: Bringing Fast Data To The Hadoop Ecosystem With Kudu (Interview)
Podcast episode
Performing Fast Data Analytics Using Apache Kudu - Episode 64: Bringing Fast Data To The Hadoop Ecosystem With Kudu (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Open Source and Fast Decision Making // Rob Hirschfeld // MLOps Podcast #164
Podcast episode
Open Source and Fast Decision Making // Rob Hirschfeld // MLOps Podcast #164
byMLOps.community
0 ratings
0% found this document useful
MLOps Meetup #34: Streaming Machine Learning with Apache Kafka and Tiered Storage // Kai Waehner, Confluent
Podcast episode
MLOps Meetup #34: Streaming Machine Learning with Apache Kafka and Tiered Storage // Kai Waehner, Confluent
byMLOps.community
0 ratings
0% found this document useful
Build Real Time Applications With Operational Simplicity Using Dozer: Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
Podcast episode
Build Real Time Applications With Operational Simplicity Using Dozer: Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
byData Engineering Podcast
0 ratings
0% found this document useful
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52: Iceberg: Improving The Utility Of Cloud-Native Big Data At Netflix (Interview)
Podcast episode
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52: Iceberg: Improving The Utility Of Cloud-Native Big Data At Netflix (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
344: Grains of Salt: Shell text processing, data rebalancing on ZFS mirrors, Add Security Headers with OpenBSD relayd, ZFS filesystem hierarchy in ZFS pools, speeding up ZSH, How Unix pipes work, grow ZFS pools over time, the real reason ifconfig on Linux is deprecated, clear your terminal in style, and more.
Podcast episode
344: Grains of Salt: Shell text processing, data rebalancing on ZFS mirrors, Add Security Headers with OpenBSD relayd, ZFS filesystem hierarchy in ZFS pools, speeding up ZSH, How Unix pipes work, grow ZFS pools over time, the real reason ifconfig on Linux is deprecated, clear your terminal in style, and more.
byBSD Now
0 ratings
0% found this document useful
Rust in Production Ep 1 - InfluxData's Paul Dix: Paul Dix, CTO of InfluxDB, talks about the open-source time series database's development, the decision to use Go and Rust, challenges of managing high data volumes, performance improvements, future plans, and the value of hands-on learning.
Podcast episode
Rust in Production Ep 1 - InfluxData's Paul Dix: Paul Dix, CTO of InfluxDB, talks about the open-source time series database's development, the decision to use Go and Rust, challenges of managing high data volumes, performance improvements, future plans, and the value of hands-on learning.
byRust in Production
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
SAP Devtoberfest How to Make State Management Work for You with Redux and Redux Toolkit: We are super excited to announce that Devtoberfest returns this year running from October 1st - 31st 2022. It's a celebration for SAP developers and an excellent way to get ready for SAP TechEd.
Podcast episode
SAP Devtoberfest How to Make State Management Work for You with Redux and Redux Toolkit: We are super excited to announce that Devtoberfest returns this year running from October 1st - 31st 2022. It's a celebration for SAP developers and an excellent way to get ready for SAP TechEd.
bySAP Developers
0 ratings
0% found this document useful
Data Lakehouses and Apache Hudi
Podcast episode
Data Lakehouses and Apache Hudi
byThe Cloudcast
0 ratings
0% found this document useful
Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234
Podcast episode
Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234
byMLOps.community
0 ratings
0% found this document useful
Declarative Deep Learning From Your Laptop To Production With Ludwig and Horovod: An interview with Travis Addair about the open source Ludwig and Horovod projects and how they simplify the work of going from idea to production with declarative deep learning and distributed training.
Podcast episode
Declarative Deep Learning From Your Laptop To Production With Ludwig and Horovod: An interview with Travis Addair about the open source Ludwig and Horovod projects and how they simplify the work of going from idea to production with declarative deep learning and distributed training.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Episode 38: Must be Willing to Defeat the JSON Heretics: Do you understand how tabs work? How spaces work? Are you willing to defeat the JSON heretics? Most people understand the power of the serverless paradigm, but need help to put it into a useful form. That’s where Stackery comes in to treat YAML as an ass
Podcast episode
Episode 38: Must be Willing to Defeat the JSON Heretics: Do you understand how tabs work? How spaces work? Are you willing to defeat the JSON heretics? Most people understand the power of the serverless paradigm, but need help to put it into a useful form. That’s where Stackery comes in to treat YAML as an ass
byScreaming in the Cloud
0 ratings
0% found this document useful
Apache Hudi: Large Scale Data Systems with Vinoth Chandar: Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.
Podcast episode
Apache Hudi: Large Scale Data Systems with Vinoth Chandar: Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.
byData Archives - Software Engineering Daily
0 ratings
0% found this document useful

Skip carousel

Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Data Fabric
PC Pro Magazine
Article
Data Fabric
Aug 13, 2020
3 min read
Your First Steps In Grafana
Linux Format
Article
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Why Are We Stuck With M.2 When U.2 Is So Much Better?
APC
Article
Why Are We Stuck With M.2 When U.2 Is So Much Better?
May 22, 2023
4 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Behold CentOS Stream
Linux Format
Article
Behold CentOS Stream
Jun 29, 2021
4 min read
Open Mandriva Lx 4.0
Linux Format
Article
Open Mandriva Lx 4.0
Jul 30, 2019
2 min read
HotPicks
Linux Format
Article
HotPicks
Jun 29, 2021
12 min read
CentOS Remembered
Linux Format
Article
CentOS Remembered
Jun 29, 2021
2 min read
Hot Picks
Linux Format
Article
Hot Picks
Mar 9, 2021
13 min read
The Verdict
Linux Format
Article
The Verdict
Sep 19, 2023
2 min read
ArchLabs 2022.02.12
APC
Article
ArchLabs 2022.02.12
Apr 18, 2022
2 min read
Personal Cloud Servers
Linux Format
Article
Personal Cloud Servers
Sep 19, 2023
1 min read
Automatically Provision Devices With Ansible
Linux Format
Article
Automatically Provision Devices With Ansible
Nov 15, 2022
Matt Holder has worked in IT support for over a decade, and always tries to utilise Linux alongside other installed systems. C loud computing is a term that means a number of things. Software as a Service (SaaS) is one such example of what can be hos
9 min read
Yarp: A Similar Framework
Linux Format
Article
Yarp: A Similar Framework
Jan 12, 2021
YARP is the framework for communications within robotics. It can replace the ROS master as a name server. You can also do this the other way around using the YARP as a name server. The name server will support the nodes and protocols across your syst
1 min read
AlmaLinux OS 8.3
Linux Format
Article
AlmaLinux OS 8.3
May 4, 2021
4 min read
HotPicks
Linux Format
Article
HotPicks
Nov 15, 2022
12 min read
Getting Things Done
Linux Format
Article
Getting Things Done
Oct 22, 2019
3 min read
Plotting applications The Verdict
Linux Format
Article
Plotting applications The Verdict
Mar 10, 2020
2 min read
HotPicks
Linux Format
Article
HotPicks
Jan 14, 2020
12 min read
HotPicks
Linux Format
Article
HotPicks
Feb 11, 2020
13 min read
Automotive Grade Linux
Linux Format
Article
Automotive Grade Linux
Nov 16, 2021
9 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Linux For The Soul
Linux Format
Article
Linux For The Soul
Oct 19, 2021
6 min read

Related categories

Skip carousel

Reviews for Exploring Hadoop Ecosystem (Volume 1)

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Exploring Hadoop Ecosystem (Volume 1) - Wei Liu

Infrastructure

Hadoop Ecosystem

Generally, when we refer to Hadoop, this means an extensive software package- also called Hadoop Ecosystem. From here, we can find the core components (Core Hadoop Framework) as well as various extensions which add various functions to the core framework for processing large amounts of data. Hadoop ecosystem includes both Apache open source projects and other wide variety of commercial tools and solutions. Each of the Hadoop ecosystem components has its own developer community and individual release cycle.

Core Framework

Core Hadoop Framework constitutes the basis of the Hadoop ecosystem. The framework itself is mostly written in the Java language, with some native code in C and command line utilities written as shell scripts, composed of the following modules,

图片包含屏幕截图描述已自动生成

Hadoop 2.x Architecture has one new component that is YARN. It is the game changing component. In Hadoop 1.x, both application management and resource management were done by the MapReduce but with Hadoop 2.x, MapReduce is managing application management and YARN is managing the resources.

Related projects

In addition to the core components, the Hadoop ecosystem encompasses a wide range of extensions. The following diagram shows a list of projects that will be covered in this book.

Hive

Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Tez

Apache TEZ is aimed at building an application framework which allows for a complex DAG of tasks for processing data.

HBase

Apache HBase is the Hadoop database, a distributed, scalable, big data store.

ZooKeeper

Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Oozie

Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Ambari

Apache Ambari is aimed at making Hadoop management simpler. Ambari provides an intuitive, easy-to-use Hadoop management web UI and RESTful APIs.

Spark

Apache Spark is a unified analytics engine for large-scale data processing.

Kafka

Apache Kafka is a distributed event streaming platform.

Hadoop Distributions

Hadoop is an open-source project under the Apache Software Foundation, and most components in the Hadoop ecosystem are also open-sourced. Several Hadoop vendors have stepped in to develop their own distributions on top of Hadoop framework to make it enterprise ready. They have added new functionalities by improving the code base and bundling it with easy to use and user-friendly management tools, technical support and continuous updates.

There are many distributions available in the market. These distributions pull together all the enhancement projects present in the Apache repository and present them as a unified product so that organizations don’t have to spend time on assembling these elements into a single functional component. The most recognized Hadoop Distributions available in the market are- Cloudera, Hortonworks and MapR. All the three- Cloudera, Hortonworks and MapR use the core Hadoop framework and bundle it for enterprise use. The features offered as a part of core distribution by these vendors include support service and subscription service model.

Cloudera

Cloudera is an open source Hadoop distribution that was founded in 2008. Cloudera is the oldest distribution available. People at Cloudera are committed to contributing to the open source community and they have contributed to the building of Hive, Impala, Hadoop, Pig, and other popular open-source projects. Cloudera comes with good tools packaged together to provide a good Hadoop experience. They also provide a nice GUI interface to manage and monitor clusters, known as Cloudera manager.

Hortonworks

Hortonworks was founded in 2011 and it comes with the Hortonworks Data Platform (HDP), which is an open-source Hadoop distribution. Hortonworks Distribution is widely used in organizations and it provides an Apache Ambari GUI-based interface to manage and monitor clusters. Hortonworks contributes to many open-source projects such as Apache tez, Hadoop, YARN, and Hive. Hortonworks has recently launched a Hortonworks Data Flow (HDF) platform for the purpose of data ingestion and storage. Hortonworks distribution also focuses on the security aspect of Hadoop and has integrated Ranger, Kerberos, and SSL-like security with the HDP and HDF platforms.

MapR

MapR was founded in 2009 and it has its own filesystem called MapR-FS, which is quite similar to HDFS but with some new features built by MapR. It boasts higher performance; it also consists of a few nice sets of tools to manage and administer a cluster, and it does not suffer from a single point of failure. It offers some useful features, such as mirroring and snapshots.

Cloudera and Hortonworks merged in 2019 and the new company uses the Cloudera brand.

There are a few popular distributions available for the cloud.

Microsoft Azure

Microsoft offers HDInsight as a Hadoop distribution. It also offers a cost-effective solution for Hadoop infrastructure setup, monitoring and managing cluster resources. Azure claims to provide a fully cloud-based cluster with 99.9% Service Level Agreements (SLA).

Amazon

Amazon provides Elastic MapReduce and many other Hadoop ecosystem tools in their distribution. They have the s3 File System, which is another alternative to HDFS. They offer a cost-effective setup for Hadoop on cloud and it is currently the most actively used cloud on Hadoop distributions.

All the examples in this book use HDP 3.1.4.

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. HDFS is an open source Java product similar to GFS, based on the paper Google published about their Google File System in 2003.

HDFS, short for Hadoop Distributed File System, is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds and even thousands of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN.

Hadoop FileSystem APIs

Hadoop has an abstract notion of filesystem. The specification of the Hadoop FileSystem APIs can be found here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html.

org.apache.hadoop.fs.FileSystem is an abstract base class for a fairly generic filesystem. It may be implemented as a distributed filesystem, or as a local system. The local implementation is LocalFileSystem and distributed implementation is DistributedFileSystem. The local implementation exists for small Hadoop instances and for testing. HDFS is a distributed file system that comes with Hadoop that implements this file interface.

Hadoop Compatible File Systems

There are other implementations for object stores and third party filesystems. All such filesystems must implement the org.apache.hadoop.fs.FileSystem class, which ensures that there is an API for applications such as MapReduce, Apache HBase, Apache Giraph and others can use. We can these filesystems as Hadoop Compatible File Systems, such as Azure Data Lake Storage, Azure Blob, Amazon S3, Aliyun OSS, and OpenStack Swift.

Example of Microsoft Azure

The hadoop-azure module provides support for integration with Azure Blob and Azure Data Lake Storage Gen2. The built jar file, named hadoop-azure.jar, includes two drivers. Windows Azure Storage Blob driver or WASB driver provides support for Azure Blob. Azure Blob File System driver or ABFS driver provides support for Azure Data Lake Storage Gen2.

The hadoop-azure-datalake module provides support for integration with the Azure Data Lake Storage Gen1. The JAR file azure-datalake-store.jar includes one driver: Azure Data Lake driver.

WASB, ABFS, and ADL driver, built on top of HDFS APIs, are all parts of Apache Hadoop and are included in many of the commercial distributions of Hadoop. We can mount Azure Storage manually to a Hadoop cluster that lives anywhere as long as it has Internet access to Azure Storage.

URI syntax

For the Azure Standard blob the URI is: wasb[s]://container@account_name.blob.core.windows.net///

For ADLS Gen2 the URI is: abfs[s]://file_system@account_name.dfs.core.windows.net///

For ADLS Gen1 the URI is: adl://.azuredatalakestore.net//

HDInsight

Meanwhile, these drivers are all built into HDInsight. During the HDInsight cluster creation process, we can specify a blob container in Azure Blob or Azure Data Lake Storage Gen2 / Gen1 as the default files system.

When migrating big data workloads to the cloud, one of the most commonly asked questions is how to evaluate HDFS versus the storage systems provided by cloud providers, such as Amazon S3, Microsoft Azure Blob, Microsoft Azure Data Lake Storage.

HDFS Architecture

图片包含屏幕截图描述已自动生成

HDFS has a master-slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the filesystem namespace and regulates access to files by clients. In addition, there are a number of DataNodes, which manage storage attached to the nodes that they run on.

Components of HDFS Architecture

Master daemon.

Maintains and manages DNs.

Records metadata.

Receives Heartbeat and Blockreport from DNs.

It is not a NN backup.

It merges the fsimage and the edits files periodically and keeps edits size within a limit. It is usually run one different machine than the NN.

Slave daemons.

Stores actual block and block metadata (length, checksum, timestamp, etc.). Although DNs do not contain metadata about the directories and files stored in an HDFS cluster, they do contain a small amount of metadata about the DN itself and its relationship to a cluster.

Serves read and write requests from HDFS Client.

HDFS currently provides 3 client interfaces: DistributedFileSystem, FsShell, and DFSAdmin.

HDFS Client

DistributedFileSystem provides APIs for users to develop HDFS-based applications.

FsShell allows users to perform common file system operations such as create, delete, etc. through the HDFS shell command- hdfs dfs or hadoop fs.

DFSAdmin provides system administrator with management shell command- hdfs dfsadmin, such as performing upgrades, managing security modes, and more.

DistributedFileSystem, FsShell, and DFSAdmin both manage and manipulate HDFS by directly or indirectly holding a reference to DFSClient and then calling the interface provided by DFSClient. DFSClient encapsulates complicated interaction logic, providing a simple interface to the outsides.

图片包含屏幕截图描述已自动生成

the class diagram

HDFS Communication Protocol

All HDFS communication protocols are layered on top of the TCP/IP protocol.

The HDFS Client talks to the NN using ClientProtocol.

The HDFS Client talks to the DNs using ClientDatanodeProtocol.

The Secondary NN talks to the NN using NamenodeProtocol.

DNs talk to the NN using the DatanodeProtocol.

DNs talk to each other using the InterDatanodeProtocol.

Main structure of NameNode

namespace

Namespace is a hierarchy of files and directories.

Every file, directory and block in HDFS is represented as an object in the NN memory. Files and directories are represented on the NN by INode. We call INode and Block as namespace objects. Each namespace object consumes approximately 150 bytes memory.

metadata

Namespace is nothing but the filesystem tree, while metadata represents the structure of HDFS directories and files in a tree. The NN maintains the filesystem tree and the metadata for all the files and directories in the tree.

Metadata contains various information related to directories and files like ownership, permissions, quotas, replication factor, mapping of blocks to files, and mapping of blocks to DNs, etc.

There are two files associated with metadata: fsimage (an image of the file system) and edits (a series of modifications made to the file system).

When NN crashes, we can use metadata to reconstruct the entire file system. So it’s important these metadata are safely persisted to stable storage.

Class Diagram

图片包含屏幕截图, 监视器描述已自动生成

the class diagram

FSNamesystem is a container of both transient and persisted namespace state, and does all the book-keeping work on a NameNode. Both FSDirectory and FSNamesystem manage the state of the namespace. FSDirectory is a pure in-memory data structure, all of whose operations happen entirely in memory. In contrast, FSNamesystem persists the operations to the disk.

FSImage handles checkpointing and logging of the namespace edits. FSEditLog maintains a log of the namespace modifications.

FSDirectory is responsible for maintaining the tree structure of the entire file system. But the efficiency of data lookup in the tree structure is very low. The introduction of BlocksMap makes the time complexity of fetching data equal to O(1). BlocksMap maintains the mapping of blocks to files and mapping of blocks to DNs.

fsimage and edits

When the NN is formatted, it creates a data structure that contains fsimage, edits, and VERSION. The NN uses files in its local host OS file system to store fsimage and edits. We can think of HDFS metadata as consisting of two parts: fsimage and edits.

fsimage

An fsimage file contains the complete state of the file system (except the mapping of blocks to DNs) at a point in time. The entire file system namespace, including the mapping of blocks to files and file system properties is stored in fsimage file.

edits

An edits file is a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.

NameNode Metadata directory

This screenshot is an example of an HDFS metadata directory taken from a NameNode. In this example, the same directory has been used for both fsimage and edits. Alternative configuration options are available that allow separating fsimage and edits into different directories.

图片包含文字, 报纸描述已自动生成

VERSION

Text file that contains the following elements:

Version of the HDFS metadata format. When we add new features that require a change to the metadata format, we change this number.

Unique identifiers of an HDFS cluster. These identifiers are used to prevent DataNodes from registering accidentally with an incorrect NameNode that is part of a different cluster. These identifiers also are particularly important in a federated deployment. Within a federated deployment, there are multiple NameNodes working independently. Each NameNode serves a unique portion of the namespace (namespaceID) and manages a unique set of blocks (blockpoolID). The clusterID ties the whole cluster together as a single logical unit. This structure is the same across all nodes in the cluster.

Always NAME_NODE for the NameNode, and never JOURNAL_NODE.

Creation time of file system state.

edits_start transaction ID-end transaction ID

Finalized and unmodifiable edit log segments. Each of these files contains all of the edit log transactions in the range defined by the file name. In an High Availability deployment, the standby can only read up through the finalized log segments. The standby NameNode is not up-to-date with the current edit log in progress. When an HA failover happens, the failover finalizes the current log segment so that it is completely caught up before switching to active.

edits_inprogress_start transaction ID

This is the current edit log in progress. All transactions starting from the specified transaction are in this file, and all new incoming transactions will get appended to this file.

fsimage_end transaction ID

Contains the complete metadata image up through. Each fsimage file also has a corresponding md5 file containing a MD5 checksum, which HDFS uses to guard against disk corruption.

seen_txid

Contains the last transaction ID of the last checkpoint (merge of edits into an fsimage) or edit log roll (finalization of current edits_inprogress and creation of a new one). This is not the last transaction ID accepted by the NameNode. The file is not updated on every transaction, only on a checkpoint or an edit log roll. The purpose of this file is to try to identify if edits are missing during startup. It is possible to configure the NameNode to use separate directories for fsimage and edits files. If the edits directory accidentally gets deleted, then all transactions since the last checkpoint would go away, and the NameNode starts up using just fsimage at an old state. To guard against this, NameNode startup also checks seen_txid to verify that it can load transactions at least up through that number. It aborts startup if it cannot verify the load transactions.

When restarting, the NN can establish complete namespace through fsimage, edits and Blockreport. After the NN loads fsimage(also edits), the mapping of blocks to DNs has not been there. This needs to be collected dynamically by the DN Blockreport.

Checkpointing

Checkpointing is a process that takes an fsimage and edits and compacts them into a new fsimage.

NN maintains namespace in HDFS. All metadata is stored in fsimage and edits at NN. Secondary NN does checkpointing for NN.

As checkpointing is the process to merge edits into fsimage, the process is resource intensive and it can impact ongoing request at NameNode.

When the NN starts up, or a checkpoint is triggered by a configurable threshold, it reads the fsimage and edits from disk, applies all the transactions from the edits to the in-memory representation of the fsimage, and flushes out this new version into a new fsimage on disk. It can then truncate the old edits because its transactions have been applied to the persistent fsimage.

A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of filesystem transactions have accumulated (dfs.namenode.checkpoint.txns, dfs.namenode.checkpoint.check.period). If both of these properties are set, the first threshold to be reached triggers a checkpoint.

Checkpointing Process

图片包含屏幕截图描述已自动生成

Secondary NN checks whether either of the two preconditions are met.

When it’s time to perform the checkpoint, the NN creates a new file to accept the file system changes. It names the new file edits.new.

The edits file, along with the fsimage file, are copied to the Secondary NN.

The Secondary NN merges these two files, creating a file named fsimage.ckpt.

The Secondary copies the fsimage.ckpt file to the NN.

The NN overwrites the file fsimage with fsimage.ckpt and renames the edits.new to edits.

Heartbeat vs Blockreport

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

Heartbeat

DNs send heartbeats to the NN to confirm that the DataNode is operating and the block replicas it hosts are available.

The default heartbeat interval is 3 seconds. Interval of Heartbeat is determined by configuration parameter dfs.heartbeat.interval in hdfs-site.xml. If the NN does not receive a heartbeat from a DN in 10 minutes the NN considers the DataNode to be out of service and the block replicas hosted by that DN to be unavailable. The NN then schedules creation of new replicas of those blocks on other DNs.

Heartbeats from a DN also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NN's block allocation and load balancing decisions.

The NN does not directly send requests to DNs. It uses replies to heartbeats to send instructions to the DNs. The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node.

Blockreport

When a DN starts up, it scans through its local file system, generates a list of all data blocks that correspond to each of these local files and sends this report to the NN.

The first block report is sent immediately after the DN registration. Subsequent block reports are sent periodically and provide the NN with an up-to-date view of where block replicas are located on the cluster. Interval of Blockreport is determined by configuration dfs.blockreport.intervalMsec in hdfs-site.xml. By default this is set to 21600000 milliseconds (6 hours).

Rack Awareness

In a large cluster, in order to improve the network traffic, while reading/writing HDFS file, NN chooses the DN which is closer to the same rack or nearby rack to Read/Write request. NN maintains Rack ids of each DN to achieve rack information. This process of choosing nearby DNs based on Rack ID is called as Rack Awareness.

By default, the NN has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack /default-rack.

NN obtains the rack id of the cluster nodes by invoking either an external script or java class as specified by configuration files,

net.topology.script.file.name parameter in the configuration file

net.topology.node.switch.mapping.impl parameter in the configuration file

Network Topology

distance(/d1/r1/n1, /d1/r1/n1) = 0 (same node)

distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)

distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)

distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

图片包含计算器, 电子产品, 计算机描述已自动生成

Data Block Replicas

HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack.

HDFS stores each file as a sequence of blocks. All blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.

The block size and replication factor are configurable per file. The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x). The default value of replication factor is 3. Maximum number of replicas created is the total number of DNs at that time as the NN does not allow one DN to have multiple replicas of the same block.

Hadoop divides a file into blocks based on bytes, without taking into account the logical records within the file. That means the start of an HDFS block typically contains a remainder of the

Enjoying the preview?

Page 1 of 1

Exploring Hadoop Ecosystem (Volume 1): Batch Processing

About this ebook

Wei Liu

Read more from Wei Liu

Related authors

Related to Exploring Hadoop Ecosystem (Volume 1)

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Exploring Hadoop Ecosystem (Volume 1)

What did you think?

Book preview

Exploring Hadoop Ecosystem (Volume 1) - Wei Liu

Hadoop Ecosystem

Hadoop Distributions

HDFS Architecture

Main structure of NameNode

fsimage and edits

Checkpointing

Heartbeat vs Blockreport

Rack Awareness

Data Block Replicas