日志收集框架Flume

传统数据收集方案:
(1)shell 脚本 cp的方式将log copy到集群的机器上, hadoop fs -put ........
        缺点:监考不方便;时效性低,IO开销大,容错和负载均衡差
Flume收集:
           在传统方案的基础上产生的Flume框架,主要通过配置文件对数据进行收集,很少写代码,操作简单

Flume网址:flume.apache.org

Flume特点:分布式的高效的,容错的数据收集,聚合,移动到目的地 ,比如: webserver===flume====hdfs

Flume简单架构:
日志收集框架Flume

一个Agent可以理解为一个flume(三大组件:Source,Channel,Sink),web server数据来源,HDFS数据存储目标位置

Flume设计目标:可靠性,扩展性,容错性,管理性

业界同类产品对比:
Flume:  JAVA 
Scribe: fackbook C/C++  不再维护
Chuwa: yahoo /Apache java 不再维护
Fluentd:Ruby 
logstach:ELK(ElasticSearch,Kibana)


重点:学会看官网,这点很重要


第二课
Flume的三大组件:
1.Source:收集, 查看收集的source来源flume.apache.org--documentation--FlumeUserGuide,常用的有:Avro,Thrift,Spooling directory(目录)、Netcat(端口ip),Kafak,Exec(linux指令,命令行)也可以自定义

2.Channel:聚集,种类:Memory,File,JDBC,kafak ,Channel通道,相当于一个缓冲池,数据临时存放的一个地方,这和操作系统写数据到磁盘类似,操作系统将数据存内存,当内存满了后在一次性存磁盘,channel的功能也类似,Source中收集的数据存Channel中,当Channel中数据存满后,一次性写入Sink

3.Sink:数据输出,作用:将Channel中读取的数据存到指定目的地 ,常用:hive,hdfs,Logger、File、hbase(同步、异步)、kafak、ElasticSearch 、Arvo


多个flumn复杂的模式:

Setting multi-agent flow

日志收集框架Flume

In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

Consolidation

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

日志收集框架Flume

This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.

Multiplexing the flow

Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.

日志收集框架Flume

The above example shows a source from agent “foo” fanning out the flow to three different channels. This fan out can be replicating or multiplexing. In case of replicating flow, each event is sent to all three channels. For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value. For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3. The mapping can be set in the agent’s configuration file.