Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)

Getting Started With Kudu
1.1 loT(the Internet of Things物联网)
支持：
1.查询当前设备的实时状态
2.数据分析，需要存储层支持的特性：
(1)Row-by-row inserts
(2)Low-latency random reads
(3)Fast analytical scans
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)

1.2 Current Approaches to Real-time analytics
Implement a real-time streaming analytics use case without kudu.
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)
Iteration1: Hadoop distributed file system
In our first iteration, we’re going to try to keep things simple and save our
data in Hadoop Distributed File System (HDFS) as Avro files.
缺点：随着系统的运行，会产生大量的小文件，会严重影响spark job及impala查询的性能。
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)

Iteration2: hdfs + compactions
Adding an hdfs compaction process
缺点：
1.伪实时；
2.hdfs没有主键的概念，会产生重复数据；
3.对于延迟的数据，没有compaction到当前的数据文件中，这样同样增加了更多的小文件；
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)
Interation3: hbase + hdfs
这种架构：可提供更新，随机读写，快速scan能力。
Hbase: 提供低延迟的读写能力；（the speed layer）
Hdfs: 提供快速的分析数据scan能力；（the batch layer）
数据先写入hbase中，当hbase中数据增加，a flusher process定时将数据由hbase复制到hdfs/parquet中。
缺点：这种系统架构的开发，使用，以及运维都很复杂，需要维护两套存储系统，以及两套数据系统之间的数据一致性。
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)
1.3 Real-Time Processing
批处理，流处理组件：Apache Flume, Storm , Spark Streaming, Flink
该架构存储层需要支持的特性：

Row-by-row inserts
Low-latency random reads
Fast analytical scans
Updates

缺点：可以使用不同的子存储层实现上述特性，这使得系统架构变得非常复杂。

1.4 Hardware Landscape
硬盘IO、cpu

1.5 Kudu’s Unique Place in the Big Data Ecosystem
Kudu是一个可扩展的、容错的、独立的列式存储系统，不依赖其他hadoop生态圈组件，不提供计算，计算由外部组件实现，如：MapReduce、spark、impala.
Kudu自身不是一个sql引擎，由指定数据类型的、固定的列组成，并有这些列的子集构成kudu的主键。
Kudu的主键约束行一致性。
Kudu的列式存储，针对列可以提供高效的编码，以及快速的数据scan能力。
Kudu扩展和容错，通过水平扩展的方式，将table拆分为tablets，并备份这些tablets，tablets之间通过raft协议保持一致性。
Kudu可以同时提供1.3中要求的特性，低延迟随机读写，按行写入，更新，快速的分析scan能力，同时满足HBase和HDFS的特性。
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)
但Kudu没有申明在一个特定的负载领域，比HBase或HDFS更加高效。

针对1.2中提出的实时分析系统，采用kudu作为存储系统，不要同时维护多个存储系统，即可同时获取实时的或历史数据，并提供快速的scan能力，这可以简化系统架构。
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)

1.6 Comparing Kudu with Other Ecosystem Components
OLTP数据库，通常使用行式存储，行式存储适用于获取整行的数据，以及更新操作。
OLAP数据库，通常使用列式存储，列式存储适用于大数据量的scan列的子集。
Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)

Kudu同时具有OLTP和OLAP的特性。
Kudu与传统的关系型数据库有相似的特性，不同于HBase或Cassandra，kudu的数据表里的数据有唯一的主键（primary key）。但kudu暂不支持事务、外键、以及非主键索引。
1.7 Big Data-HDFS,HBase,Cassandra
Kudu的设计目标是既有hdfs/parquet的数据scan能力，又有接近HBase或Cassandra的随机读能力。
（HBase,Cassandra是基于列簇的存储，不是完全的列式存储。）

Getting Started With Kudu - Chapter1 Why Kudu?(为什么选择kudu)

相关推荐