Flume架构及应用

零、目录

flume 架构介绍
1. flume概念
2. flume特点
3. flume可靠性
4. flume核心概念
5. flume架构介绍
6. flume运行机制
7. flume广义用法
flume 应用 – 日志采集
1. flume 配置启动过程
2. 具体案例
总结

一、 flume 架构介绍

flume 概念
1. 在具体介绍flume 之前，先给大家看一下Hadoop业务的整体开发流程：从Hadoop 的业务流程中可以看出，数据采集是十分重要的一步，也是不可或缺的一步。
2. flume 作为 cloudera 开发的实时日志收集系统，受到了业界的认可与广泛应用。Flume 初始的发行版本目前被统称为 Flume OG（original generation），属于 cloudera。但随着 FLume 功能的扩展，Flume OG 代码工程臃肿、核心组件设计不合理、核心配置不标准等缺点暴露出来，尤其是在 Flume OG 的最后一个发行版本 0.94.0 中，日志传输不稳定的现象尤为严重，为了解决这些问题，2011 年 10 月 22 号，cloudera 完成了 Flume-728，对 Flume 进行了里程碑式的改动：重构核心组件、核心配置以及代码架构，重构后的版本统称为 Flume NG（next generation）；改动的另一原因是将 Flume 纳入 apache 旗下，cloudera Flume 改名为 Apache Flume。
flume 特点
1. flume是一个分布式、可靠地、高可用的海量日志采集、聚合和传输的系统。支持在日志系统中定制各类数据发送方，用于手机数据。同时Flume 提供对数据进行简单的处理，并写道各种数据接收方（HDFS、HBase…）的能力
2. flume的核心是把数据从数据源(source)收集过来，在将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel),待数据真正到达目的地(sink)后，flume在删除自己缓存的数据。
3. flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位，它携带日志数据(字节数组形式)并且携带有头信息，这些Event由Agent外部的Source生成，当Source捕获事件后会进行特定的格式化，然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区，它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。
flume 可靠性
1. 当节点出现故障时，日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障，从强到弱依次分别为：end-to-end（收到数据agent首先将event写到磁盘上，当数据传送成功后，再删除；如果数据发送失败，可以重新发送。），Store on failure（这也是scribe采用的策略，当数据接收方crash时，将数据写到本地，待恢复后，继续发送），Besteffort（数据发送到接收方后，不会进行确认）。
flume 核心概念
1. Agent：使用JVM 运行Flume。每台机器运行一个agent，但是可以在一个agent中包含多个sources和sinks。
2. Source：从Client收集数据，传递给Channel。
3. Sink：从Channel收集数据，运行在一个独立线程，并将数据发送到指定的地方去。
4. Channel：连接 sources 和 sinks ，这个有点像一个队列。
5. Events：在整个数据的传输的过程中，流动的是event，即事务保证是在event级别进行的。event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。 event数据流向图： . 一个完整的event包括：event headers、event body、event信息(即文本文件中的单行记录)，如下所示：
6. Client：生产数据，运行在一个独立的线程。
flume 架构介绍
1. flume之所以这么神奇，是源于它自身的一个设计，这个设计就是agent，agent本身是一个java进程，运行在日志收集节点—所谓日志收集节点就是服务器节点。
2. agent里面包含3个核心的组件：source—->channel—–>sink,类似生产者、仓库、消费者的架构。
3. source：source组件是专门用来收集数据的，可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义。 Source 类型：
  1. Avro Source: 支持Avro协议（实际上是Avro RPC），内置支持
  2. Thrift Source: 支持Thrift协议，内置支持
  3. Exec Source: 基于Unix的command在标准输出上生产数据
  4. JMS Source: 从JMS系统（消息、主题）中读取数据
  5. Spooling Directory Source: 监控指定目录内数据变更
  6. Twitter 1% firehose Source: 通过API持续下载Twitter数据，试验性质
  7. Netcat Source: 监控某个端口，将流经端口的每一个文本行数据作为Event输入
  8. Sequence Generator Source: 序列生成器数据源，生产序列数据
  9. Syslog Sources: 读取syslog数据，产生Event，支持UDP和TCP两种协议
  10. HTTP Source: 基于HTTP POST或GET方式的数据源，支持JSON、BLOB表示形式
  11. Legacy Sources: 兼容老的Flume OG中Source（0.9.x版本）
4. channel：source组件把数据收集来以后，临时存放在channel中，即channel组件在agent中是专门用来存放临时数据的——对采集到的数据进行简单的缓存，可以存放在memory、jdbc、file等等。 Channel类型：
  1. Memory Channel：Event数据存储在内存中
  2. JDBC Channel：Event数据存储在持久化存储中，当前Flume Channel内置支持Derby
  3. File Channel：Event数据存储在磁盘文件中
  4. Spillable Memory Channel：Event数据存储在内存中和磁盘上，当内存队列满了，会持久化到磁盘文件
  5. Pseudo Transaction Channel：测试用途
  6. Custom Channel：自定义Channel实现
5. sink：sink组件是用于把数据发送到目的地的组件，目的地包括hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定义。 Sink类型：
  1. HDFS Sink：数据写入HDFS
  2. Logger Sink：数据写入日志文件
  3. Avro Sink：数据被转换成Avro Event，然后发送到配置的RPC端口上
  4. Thrift Sink：数据被转换成Thrift Event，然后发送到配置的RPC端口上
  5. IRC Sink：数据在IRC上进行回放
  6. File Roll Sink：存储数据到本地文件系统
  7. Null Sink：丢弃到所有数据
  8. HBase Sink：数据写入HBase数据库
  9. Morphline Solr Sink：数据发送到Solr搜索服务器（集群）
  10. ElasticSearch Sink：数据发送到Elastic Search搜索服务器（集群）
  11. Kite Dataset Sink：写数据到Kite Dataset，试验性质的
  12. Custom Sink：自定义Sink实现
flume 运行机制
1. flume的核心就是一个agent，这个agent对外有两个进行交互的地方，一个是接受数据的输入——source，一个是数据的输出sink，sink负责将数据发送到外部指定的目的地。source接收到数据之后，将数据发送给channel，chanel作为一个数据缓冲区会临时存放这些数据，随后sink会将channel中的数据发送到指定的地方—-例如HDFS等，注意：只有在sink将channel中的数据成功发送出去之后，channel才会将临时数据进行删除，这种机制保证了数据传输的可靠性与安全性。
flume 广义用法
1. flume之所以这么神奇—-其原因也在于flume可以支持多级flume的agent，即flume可以前后相继，例如sink可以将数据写到下一个agent的source中，这样的话就可以连成串了，可以整体处理了。flume还支持扇入(fan-in)、扇出(fan-out)。所谓扇入就是source可以接受多个输入，所谓扇出就是sink可以将数据输出多个目的地destination中。

二、 flume 应用 – 日志采集

flume 配置启动过程
1. 对于flume的原理其实很容易理解，我们更应该掌握flume的具体使用方法，flume提供了大量内置的Source、Channel和Sink类型。而且不同类型的Source、Channel和Sink可以*组合—–组合方式基于用户设置的配置文件，非常灵活。比如：Channel可以把事件暂存在内存里，也可以持久化到本地硬盘上。Sink可以把日志写入HDFS, HBase，甚至是另外一个Source等等。下面我将用具体的案例详述flume的具体用法。
2. 其实flume的用法很简单—-书写一个配置文件，在配置文件当中描述source、channel与sink的具体实现，而后运行一个agent实例，在运行agent实例的过程中会读取配置文件的内容，这样flume就会采集到数据。
3. 配置文件编写原则：
  1. 从整体上描述代理agent中sources、sinks、channels所涉及到的组件
```
 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
```
  2. 详细描述agent中每一个source、sink与channel的具体实现：即在描述source的时候，需要指定source到底是什么类型的，即这个source是接受文件的、还是接受http的、还是接受thrift的；对于sink也是同理，需要指定结果是输出到HDFS中，还是Hbase中啊等等；对于channel需要指定是内存啊，还是数据库啊，还是文件啊等等。
```
  # Describe/configure the source 
 a1.sources.r1.type = netcat 
 a1.sources.r1.bind = localhost 
 a1.sources.r1.port = 44444 
 # Describe the sink 
 a1.sinks.k1.type = logger 
 # Use a channel which buffers events in memory 
 a1.channels.c1.type = memory 
 a1.channels.c1.capacity = 1000 
 a1.channels.c1.transactionCapacity = 100
```
  3. 通过channel将source与sink连接起来
```
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1
```
  4. 启动agent cd 到flume bin 目录
```
 ./flume-ng agent -c ../conf/ -f ../conf/你的配置文件名称  -n 你的agent名称 -Dflume.root.logger=INFO,console

 参数说明： -n 指定agent名称(与配置文件中代理的名字相同)
 -c 指定flume中配置文件的目录
 -f 指定配置文件
 -Dflume.root.logger=DEBUG,console 设置日志等级
```

具体案例：

案例1： NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。其中 Sink为logger Channel为memory

flume官网中NetCat Source描述：

 Property Name Default Description 
 channels – 
 type – The component type name, needs to be netcat 
 bind – 日志需要发送到的主机名或者Ip地址，该主机运行着netcat类型的source在监听 
 port – 日志需要发送到的端口号，该端口号要有netcat类型的source在监听

配置文件：

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = 192.168.80.80
 a1.sources.r1.port = 44444
 
 # Describe the sink
 a1.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

启动flume agent a1

 flume-ng  agent -n a1  -c ../conf  -f ../conf/netcat.conf   -Dflume.root.logger=DEBUG,console

使用telnet发送数据

 telnet  192.168.80.80  44444  big data world！（windows中运行的）

在控制台上查看flume收集到的日志数据：

案例2：NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。其中 Sink为hdfs Channel为file (相比于案例1的两个变化)

flume官网中HDFS Sink的描述：

编写配置文件：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /usr/flume/checkpoint
a1.channels.c1.dataDirs = /usr/flume/data

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume agent a1 服务端：

 flume-ng  agent -n a1  -c ../conf  -f ../conf/netcat.conf   -Dflume.root.logger=DEBUG,console

使用telnet发送数据

 telnet  192.168.80.80  44444  big data world！（windows中运行的）

在HDFS中查看flume收集到的日志数据：

案例3：Spooling Directory Source：监听一个指定的目录，即只要应用程序向这个指定的目录中添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，然后写入到channle。写入完成后，标记该文件已完成或者删除该文件。其中 Sink为logger, Channel为memory

flume官网中Spooling Directory Source描述：

 Property Name       Default      Description
 channels              –  
 type                  –          The component type name, needs to be spooldir.
 spoolDir              –          Spooling Directory Source监听的目录
 fileSuffix         .COMPLETED    文件内容写入到channel之后，标记该文件
 deletePolicy       never         文件内容写入到channel之后的删除策略: never or immediate
 fileHeader         false         Whether to add a header storing the absolute path filename.
 ignorePattern      ^$           Regular expression specifying which files to ignore (skip)
 interceptors          –          指定传输中event的head(头信息)，常用timestamp

Spooling Directory Source的两个注意事项：

 ①If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing. 
 即：拷贝到spool目录下的文件不可以再打开编辑
 ②If a file name is reused at a later time, Flume will print an error to its log file and stop processing.
 即：不能将具有相同文件名字的文件拷贝到这个目录下

编写配置文件：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/datainput
a1.sources.r1.fileHeader = true
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume agent a1 服务端:

 flume-ng  agent -n a1  -c ../conf  -f ../conf/spool.conf   -Dflume.root.logger=DEBUG,console

使用cp命令向Spooling Directory 中发送数据

 cp datafile  /usr/local/datainput   (注：datafile中的内容为：big data world！)

在控制台上查看flume收集到的日志数据：从控制台显示的结果可以看出event的头信息中包含了时间戳信息。同时我们查看一下Spooling Directory中的datafile信息—-文件内容写入到channel之后，该文件被标记了：
```
[[email protected] datainput]# ls
datafile.COMPLETED
```

案例4:Spooling Directory Source：监听一个指定的目录，即只要应用程序向这个指定的目录中添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，然后写入到channle。写入完成后，标记该文件已完成或者删除该文件。其中 Sink：hdfs Channel：file (相比于案例3的两个变化)

编写配置文件：

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = spooldir
 a1.sources.r1.spoolDir = /usr/local/datainput
 a1.sources.r1.fileHeader = true
 a1.sources.r1.interceptors = i1
 a1.sources.r1.interceptors.i1.type = timestamp
 
 # Describe the sink
 a1.sinks.k1.type = hdfs
 a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput
 a1.sinks.k1.hdfs.writeFormat = Text
 a1.sinks.k1.hdfs.fileType = DataStream
 a1.sinks.k1.hdfs.rollInterval = 10
 a1.sinks.k1.hdfs.rollSize = 0
 a1.sinks.k1.hdfs.rollCount = 0
 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
 a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
 # Use a channel which buffers events in file
 a1.channels.c1.type = file
 a1.channels.c1.checkpointDir = /usr/flume/checkpoint
 a1.channels.c1.dataDirs = /usr/flume/data
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

启动flume agent a1 服务端

flume-ng  agent -n a1  -c ../conf  -f ../conf/spool.conf   -Dflume.root.logger=DEBUG,console

使用cp命令向Spooling Directory 中发送数据

 cp datafile  /usr/local/datainput   (注：datafile中的内容为：big data world！)

在控制台上查看flume收集到的日志数据：
在HDFS中查看flume收集到的日志数据：

案例5：Exec Source：监听一个指定的命令，获取一条命令的结果作为它的数据源常用的是tail -F file指令，即只要应用程序向日志(文件)里面写数据，source组件就可以获取到日志(文件)中最新的内容。其中 Sink：hdfs Channel：file .这个案列为了方便显示Exec Source的运行效果，结合Hive中的external table进行来说明。

编写配置文件：

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = exec
 a1.sources.r1.command = tail -F /usr/local/log.file
 
 # Describe the sink
 a1.sinks.k1.type = hdfs
 a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput
 1.sinks.k1.hdfs.writeFormat = Text
 a1.sinks.k1.hdfs.fileType = DataStream
 a1.sinks.k1.hdfs.rollInterval = 10
 a1.sinks.k1.hdfs.rollSize = 0
 a1.sinks.k1.hdfs.rollCount = 0
 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
 a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
 # Use a channel which buffers events in file
 a1.channels.c1.type = file
 a1.channels.c1.checkpointDir = /usr/flume/checkpoint
 a1.channels.c1.dataDirs = /usr/flume/data
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

在hive中建立外部表—–hdfs://hadoop80:9000/dataoutput的目录，方便查看日志捕获内容

 hive> create external table t1(infor  string)
     > row format delimited
     > fields terminated by '\t'
     > location '/dataoutput/';
 OK
 Time taken: 0.284 seconds

启动flume agent a1 服务端：

 flume-ng  agent -n a1  -c ../conf  -f ../conf/exec.conf   -Dflume.root.logger=DEBUG,console

使用echo命令向/usr/local/datainput 中发送数据
```
 echo  big data > log.file
```

在HDFS和Hive分别中查看flume收集到的日志数据： Flume架构及应用

 hive> select * from t1;
 OK
 big data
 Time taken: 0.086 seconds

使用echo命令向/usr/local/datainput 中在追加一条数据
```
 echo big data world! >> log.file
```

在HDFS和Hive再次分别中查看flume收集到的日志数据： Flume架构及应用

 hive> select * from t1;
 OK
 big data
 big data world!
 Time taken: 0.511 seconds

Exec source：Exec source和Spooling Directory Source是两种常用的日志采集的方式，其中Exec source可以实现对日志的实时采集，Spooling Directory Source在对日志的实时采集上稍有欠缺，尽管Exec source可以实现对日志的实时采集，但是当Flume不运行或者指令执行出错时，Exec source将无法收集到日志数据，日志会出现丢失，从而无法保证收集日志的完整性。

案例6：Avro Source：监听一个指定的Avro 端口，通过Avro 端口可以获取到Avro client发送过来的文件。即只要应用程序通过Avro 端口发送文件，source组件就可以获取到该文件中的内容。其中 Sink：hdfs Channel：file (注：Avro和Thrift都是一些序列化的网络端口–通过这些网络端口可以接受或者发送信息，Avro可以发送一个给定的文件给Flume，Avro 源使用AVRO RPC机制)

Avro Source运行原理如下图：

flume官网中Avro Source的描述：

 Property     Name   Default Description
 channels      –  
 type          –     The component type name, needs to be avro
 bind          –     日志需要发送到的主机名或者ip，该主机运行着ARVO类型的source
 port          –     日志需要发送到的端口号，该端口要有ARVO类型的source在监听

编写配置文件

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 a1.sources.r1.type = avro
 a1.sources.r1.bind = 192.168.80.80
 a1.sources.r1.port = 4141
 
 # Describe the sink
 a1.sinks.k1.type = hdfs
 a1.sinks.k1.hdfs.path = hdfs://hadoop80:9000/dataoutput
 a1.sinks.k1.hdfs.writeFormat = Text
 a1.sinks.k1.hdfs.fileType = DataStream
 a1.sinks.k1.hdfs.rollInterval = 10
 a1.sinks.k1.hdfs.rollSize = 0
 a1.sinks.k1.hdfs.rollCount = 0
 a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
 a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
 # Use a channel which buffers events in file
 a1.channels.c1.type = file
 a1.channels.c1.checkpointDir = /usr/flume/checkpoint
 a1.channels.c1.dataDirs = /usr/flume/data
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

启动flume agent a1 服务端

 flume-ng  agent -n a1  -c ../conf  -f ../conf/avro.conf   -Dflume.root.logger=DEBUG,console

使用avro-client发送文件

 flume-ng avro-client -c  ../conf  -H 192.168.80.80  -p 4141 -F /usr/local/log.file

 注：log.file文件中的内容为：
 
 [[email protected] local]# more log.file
 big data
 big data world!

在HDFS中查看flume收集到的日志数据：

三、总结

NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。
Spooling Directory Source：监听一个指定的目录，即只要应用程序向这个指定的目录中添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，然后写入到channle。写入完成后，标记该文件已完成或者删除该文件。
Exec Source：监听一个指定的命令，获取一条命令的结果作为它的数据源。常用的是tail -F file指令，即只要应用程序向日志(文件)里面写数据，source组件就可以获取到日志(文件)中最新的内容。
Avro Source：监听一个指定的Avro 端口，通过Avro 端口可以获取到Avro client发送过来的文件。即只要应用程序通过Avro 端口发送文件，source组件就可以获取到该文件中的内容。

Flume架构及应用