关于Flume,官方定义如下:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Flume是分布式海量日志收集工具,根据不同的数据来源,Flume并不局限于对日志的收集。
flume有如下特性:
内置对多种source和目标类型的支持
支持水平扩展
支持多种传输方式,例如:multi-hop flows, fan-in fan-out flows, ****...
支持contextual routing
支持拦截器
可靠传递。在flume中每个事件有两个事务,分别在send和receive阶段。 sender发送事件给receiver。接收到数据后,receiver提交自己的事务并发送一个成功信号给sender。sender收到该信号后提交自己的事务。
话说Flume最初是为了从多个web服务把数据流复制到HDFS而设计的,那为什么不直接用put把数据放到HDFS? 假如我们有对快速增长的数据进行实时分析的需求,put过来的数据已经不是实时的了。
同样的,rsync
