Notes Bug Data and of Apache
Notes Bug Data and of Apache
The Apache Flume tool is designed mainly for ingesting a high volume of
event-based data, especially unstructured data, into Hadoop. Flume moves
these files to the Hadoop Distributed File System (HDFS) for further
processing and is flexible to write to other storage solutions like HBase or
Solr. Therefore, Apache Flume is an excellent data ingestion tool fit for
aggregating, storing, and analyzing the data with Hadoop. This article expects
you to have some basic knowledge of the Hadoop ecosystem and its
components.
There are a few clear reasons why a system like Apache Flume is needed.
These can be summarized as –
1. Reducing load and latency issues: When big data needs to be handled on a
Hadoop cluster, it is usually produced by a large number of servers, i.e.,
hundreds or possibly thousands of servers. Thus, multiple servers trying to
write data to an HDFS/HBase cluster can cause significant problems. Since
HDFS requires that only one client writes to a file, it means there could be a
large number of files being written to HDFS parallelly, resulting in multiple
complex sets of operations co-occurring on the name node. This increases
the load on the machine. Similarly, when thousands of machines are writing a
vast amount of data to fewer machines, the connecting network might get
overloaded and cause severe latency issues. To address this issue, a Flume
system is required.
3. Add flexibility and scalability to the entire system: Due to the flexible
structure of Flume agents, it is possible to have control over the data flows
with different configurations using multiple sources/channels/sinks
(explained later in the article). So, to add flexibility and scalability, a Flume
system might be needed.
6. Data Flow: It supports complex data flows such as multi-hop flows, fan-in
flows, fan-out flows, contextual routing, etc.
7. Ease of use: With Flume, we can ingest the stream data from multiple web
servers and store them in any of the centralized stores such as HBase,
Hadoop HDFS, etc.
8. Imported Data Size: It can efficiently ingest log data into a centralized
repository from various servers. It allows importing large volumes of data
generated by social networking sites and e-commerce sites into HDFS.
10. Streaming: Using Flume, we can collect data from online streaming data
from different sources (email messages, network traffic, log files, social
media, etc.) in real-time as well as in batch mode and transport it to HDFS.
Now, we will understand what the entities in the Flume architecture are.
1. Flume Event: A Flume event can be termed as the fundamental data unit
that must be moved from source to destination (sink).
Flume Agent
3. Source: A source receives data sent by the data generators. It pushes the
received data in the form of events to one or more channels. Apache Flume
supports various sources like Exec source, Thrift source, Avro source, etc.
Other than the ones mentioned above, a few more components are involved
in transferring the events. They are-
Interceptors: They are used to modify or examine flume events between the
flume source and channel.
Sink Processors: The Sink Processors call a specific sink from a set of sinks.
Apache Flume helps in moving the log data into HDFS and supports complex
data flow. Three main types of data flow in Apache Flume are –
Multi-hop Flow
2. Fan-out Flow: In a Fan-out Flow, the data moves from one flume source to
multiple channels and is held in multiple sinks. This type of flow is of two
types − replicating and multiplexing.
Fan-out Flow
3. Fan-in Flow: The data flow occurs from more than one source to a single
channel in a fan-in flow.
Fan-in Flow
Pros of Apache Flume
Though Apache Flume offers several benefits, it still has some drawbacks
like-
The Apache Flume tool is a good fit for the following requirements –
In cases where Flume is not suitable for an application, there are alternatives
like Web HDFS or the HBase HTTP API that can be used to write data.
Sometimes, there are only a few production servers, and the data is not
required to be written in real-time, then Apache Flume is also not a good
option. In such cases, it might be better to simply move the data to HDFS via
Web HDFS or NFS. Similarly, if the data is relatively small ( a few files of a few
GB every few hours), it can be moved to HDFS directly as planning,
configuring, and deploying Flume would require more effort.
Conclusion
we learned what Apache Flume is and why we need it? Further, we explored
the features, the architecture, the pros and cons of this tool, and real-world
applications.
Apache Flume is a part of the Hadoop ecosystem and is mainly used for
real-time data ingestion from different web applications into storage
like HDFS, HBase, etc.
Apache Flume reliably collects, aggregates, and transports big data
generated from external sources to the central store. These data
streams could be log files, Twitter data, network traffic, etc.
Apache Flume is a highly available and scalable tool. However, it cannot
manipulate the data.
Apache Flume has a variety of applications in different domains like e-
commerce, banking, finance, IoT applications, etc.
That’s it! I hope you found this interesting and informative. You can explore
Apache Flume in detail through the official documentation here.