Bigtable: A Distributed Storage System for Structured Data : part7 Performance Evaluation

7 Performance Evaluation
We set up a Bigtable cluster with N tablet servers to measure the performance and scalability of Bigtable as N is varied. 
The tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each.
N client machines generated the Bigtable load used for these tests. 
(We used the same number of clients as tablet servers to ensure that clients were never a bottleneck.)
Each machine had two dual-core Opteron 2 GHz chips, enough physical memory to hold the working set of all running processes, and a single gigabit Ethernet link.
The machines were arranged in a two-level tree-shaped switched network with approximately 100-200 Gbps of aggregate bandwidth available at the root. 
All of the machines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond.
The tablet servers and master, test clients, and GFS servers all ran on the same set of machines. 

Every machine ran a GFS server. 
Some of the machines also ran either a tablet server, or a client process, or processes from other jobs that were using the pool at the same time as these experiments.

7性能评估
我们设置了一个带有N个 tablet 服务器的Bigtable集群,以测量Bigtable的性能和可扩展性,因为N是多样的。
 tablet 服务器配置为使用1 GB内存,并写入由1786台机器组成的GFS单元,每台机器具有两个400 GB IDE硬盘驱动器。
N个客户机生成了用于这些测试的Bigtable负载。
(我们使用与 tablet 服务器相同数量的客户端,以确保客户端不会成为瓶颈。)
每台机器都有两个双核Opteron 2 GHz芯片,足够的物理内存来保存所有运行进程的工作集,以及一个千兆以太网链路。
这些机器被安排在两层树状交换网络中,其根部可用的总带宽约为100-200Gbps。
所有的机器都在同一个托管设备中,因此任何一台机器之间的往返时间都不到一毫秒。
 tablet 服务器和主服务器,测试客户端和GFS服务器都运行在同一台机器上。

每台机器都运行一个GFS服务器。
某些计算机还运行 tablet 服务器或客户端进程,或者在与这些实验同时使用池的其他作业进行处理。

R is the distinct number of Bigtable row keys involved in the test. 
R was chosen so that each benchmark read or wrote approximately 1 GB of data per tablet server.
The sequential write benchmark used row keys with names 0 to R − 1. 
This space of row keys was partitioned into 10N equal-sized ranges. 
These ranges were assigned to the N clients by a central scheduler that assigned the next available range to a client as soon as the client finished processing the previous range assigned to it. 
This dynamic assignment helped mitigate the effects of performance variations caused by other processes running on the client machines. 
We wrote a single string under each row key. 
Each string was generated randomly and was therefore uncompressible. 
In addition, string under different row key were distinct, so no cross-row compression was possible. 
The random write benchmark was similar except that the row key was hashed modulo R immediately before writing so that the write load was spread roughly uniformly across the entire row space for the entire duration of the benchmark.

The sequential read benchmark generated row keys in exactly the same way as the sequential write benchmark, but instead of writing under the row key, it read the string stored under the row key (which was written by an earlier invocation of the sequential write benchmark). 
Similarly,the random read benchmark shadowed the operation of the random write benchmark.

R是测试中涉及到的Bigtable行键的不同数量。
选择R,以便每个基准测试读取或写入每个 tablet 服务器大约1GB的数据。
顺序写入基准测试使用名称为0到R-1的行键。
行键的这个空间被分割成10N相等大小的范​​围。
只要客户端完成处理分配给它的上一个范围,*调度程序将这些范围分配给N个客户端,该*调度程序为客户端分配了下一个可用范围。
此动态分配有助于缓解由客户端计算机上运行的其他进程引起的性能变化的影响。
我们在每行键下写了一个字符串。
每个字符串都是随机产生的,因此是不可压缩的。
另外,不同行键下的字符串是不同的,所以不能进行跨行压缩。
随机写入基准测试是相似的,只是行键在写入之前立即散列为模R,以便写入负载在基准的整个持续时间内在整个行空间中大致均匀地扩散。
顺序读取基准测试生成的行键与顺序写入基准完全相同,但不是在行键下写入,它读取存储在行键下的字符串(由先前调用顺序写入基准写入的字符串) 。
类似地,随机读取基准影响了随机写入基准的操作。

The scan benchmark is similar to the sequential read benchmark, but uses support provided by the Bigtable API for scanning over all values in a row range. 

Using a scan reduces the number of RPCs executed by the benchmark since a single RPC fetches a large sequence of values from a tablet server.
The random reads (mem) benchmark is similar to the random read benchmark, but the locality group that contains the benchmark data is marked as in-memory, and therefore the reads are satisfied from the tablet server’s memory instead of requiring a GFS read. 
For just this benchmark, we reduced the amount of data per tablet server from 1 GB to 100 MB so that it would fit comfortably in the memory available to the tablet server.
Bigtable: A Distributed Storage System for Structured Data : part7 Performance Evaluation

Bigtable: A Distributed Storage System for Structured Data : part7 Performance Evaluation

Figure 6 shows two views on the performance of our benchmarks when reading and writing 1000-byte values to Bigtable. 

The table shows the number of operations per second per tablet server; the graph shows the aggregate number of operations per second.

扫描基准测试与顺序读取基准测试类似,但使用Bigtable API提供的支持来扫描行范围内的所有值。

使用扫描可以减少基准测试执行的RPC数量,因为单个RPC从 tablet 服务器获取大量的值。
随机读取(mem)基准测试与随机读取基准测试类似,但包含基准数据的位置组被标记为内存,因此从 tablet 服务器的内存中可以看到读数,而不需要GFS读取。
对于这个基准测试,我们将每个 tablet 服务器的数据量从1 GB减少到100 MB,以便舒适地适应 tablet 服务器的内存。
图6显示了对BigTable读取和写入1000字节值时,我们的基准测试的性能的两个视图。
该表显示每个 tablet 服务器每秒的操作次数; 该图显示了每秒的总操作次数。

Single tablet-server performance
Let us first consider performance with just one tablet server. 
Random reads are slower than all other operations by an order of magnitude or more. 
Each random read involves the transfer of a 64 KB SSTable block over the network from GFS to a tablet server, out of which only a single 1000-byte value is used. 
The tablet server executes approximately 1200 reads per second, which translates into approximately 75 MB/s of data read from GFS. 
This bandwidth is enough to saturate the tablet server CPUs because of overheads in our networking stack, SSTable parsing, and Bigtable code, and is also almost enough to saturate the network links used in our system. 
Most Bigtable applications with this type of an access pattern reduce the block size to a smaller value, typically 8KB.
Random reads from memory are much faster since each 1000-byte read is satisfied from the tablet server’s local memory without fetching a large 64 KB block from GFS.

单个 tablet 服务器性能
让我们首先考虑一下只需一台 tablet 的性能。
随机读取比所有其他操作慢一个数量级或更多。
每个随机读取涉及通过网络将64 KB SSTable块从GFS传输到 tablet 服务器,其中仅使用单个1000字节值。
 tablet 服务器每秒执行大约1200次读取,这转换为从GFS读取的大约75 MB / s的数据。
这个带宽足以使 tablet 服务器CPU饱和,因为我们的网络堆栈,SSTable解析和Bigtable代码的开销,也足以使我们系统中使用的网络链接饱和。
大多数Bigtable应用程序使用这种类型的访问模式将块大小减小到较小的值,通常为8KB。
从内存中随机读取速度要快得多,因为从 tablet 服务器的本地内存中可以满足每个1000字节的读取,而不会从GFS中获取大型64 KB块。

Random and sequential writes perform better than random reads since each tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes efficiently to GFS. 
There is no significant difference between the performance of random writes and sequential writes; in both cases, all writes to the tablet server are recorded in the same commit log.
Sequential reads perform better than random reads since every 64 KB SSTable block that is fetched from GFS is stored into our block cache, where it is used to serve the next 64 read requests.
Scans are even faster since the tablet server can return a large number of values in response to a single client RPC, and therefore RPC overhead is amortized over a large number of values.

随机和顺序写入性能优于随机读取,因为每个 tablet 服务器将所有传入写入附加到单个提交日志,并使用组提交将这些写入有效地流式传输到GFS。
随机写入和顺序写入的性能没有显着差异; 在这两种情况下,对 tablet 服务器的所有写入都将记录在同一个提交日志中。
顺序读取比随机读取更好,因为从GFS获取的每个64 KB SSTable块都存储在我们的块高速缓存中,用于提供下一个64个读取请求。
扫描速度甚至更快,因为 tablet 服务器可以响应于单个客户端RPC返回大量值,因此RPC开销在大量值上进行分摊。

Scaling
Aggregate throughput increases dramatically, by over a factor of a hundred, as we increase the number of tablet servers in the system from 1 to 500. 

缩放
随着我们将系统中的 tablet 服务器数量从1个增加到500个,总吞吐量大幅提升了一百倍。



Bigtable: A Distributed Storage System for Structured Data : part7 Performance Evaluation

Table 1: Distribution of number of tablet servers in Bigtable clusters.


For example, the performance of random reads from memory increases by almost a factor of 300 as the number of tablet server increases by a factor of 500. 
This behavior occurs because the bottleneck on performance for this benchmark is the individual tablet server CPU.
However, performance does not increase linearly. 
For most benchmarks, there is a significant drop in per-server throughput when going from 1 to 50 tablet servers. 
This drop is caused by imbalance in load in multiple server configurations, often due to other processes contending for CPU and network. 
Our load balancing algorithm attempts to deal with this imbalance, but cannot do a perfect job for two main reasons: rebalancing is throttled to
reduce the number of tablet movements (a tablet is unavailable for a short time, typically less than one second, when it is moved), and the load generated by our benchmarks shifts around as the benchmark progresses.
The random read benchmark shows the worst scaling 
(an increase in aggregate throughput by only a factor of 100 for a 500-fold increase in number of servers). 
This behavior occurs because (as explained above) we transfer one large 64KB block over the network for every 1000-byte read. 
This transfer saturates various shared 1 Gigabit links in our network and as a result, the per-server throughput drops significantly as we increase the number of machines.

表1:Bigtable群集中的 tablet 服务器数量的分布。

例如,由于 tablet 服务器的数量增加了500倍,所以从内存中随机读取的性能提高了近300倍。
出现此行为是因为该基准测试的性能瓶颈是单个 tablet 服务器CPU。
然而,性能不会线性增加。
对于大多数基准测试,从1到50个 tablet 服务器,每个服务器的吞吐量显着下降。
这种下降是由于多个服务器配置中的负载不平衡导致的,这通常是由于其他与CPU和网络竞争的过程。
我们的负载平衡算法试图处理这种不平衡,但是由于两个主要原因,不能做一个完美的工作:重新平衡被限制到
减少 tablet 移动的数量(一个 tablet 在短时间内不可用,通常在移动时通常不到一秒),并且基准测试产生的负载随着基准测试的进行而发生变化。
随机读取基准显示最差的缩放
(聚合吞吐量仅增加100倍,服务器数量增加了500倍)。
出现这种情况是因为(如上所述),我们每隔1000字节读取一次通过网络传输一个大的64KB块。
这种传输使我们网络中的各种共享的1千兆位链路饱和,因此,随着机器数量的增加,每服务器吞吐量显着下降。