《数据库理论与技术》期终考试复习题
一、填空题:
- 数据仓库是面向(主题的)、(集成的) 、(相对稳定的)、(反映历史变化)的数据集合.
- 多维数据模型由(维表)和(事实表)定义,其常见的形式有(星型)、(雪花型) 及(事实星座型);在多维立方体的某一维上选定一个维成员值的操作称为(切片)。
- 在ORDB中,从已知的属性值找未知的属性值时沿途经过的属性名构成的式子称为(路径表达式);将一个嵌套关系转换为1NF的过程称为(解除嵌套);在ODMG对象模型中,由类中所有持久对象构成的集合称为(类外延);
- 描述时间数据的最小时间单位称为(时间粒度);而(系统时间量子(Chronon))是计算机系统所支持的最小的、不可分割的时间间隔。时态数据模型中的三种时间分别是(用户自定义时间(user-defined time))、(有效时间(valid time))、(事务时间(transaction time))。
- 分布式数据库的六级模式结构分别是(全局外模式)、(全局概念模式)、(分片模式)、(分配模式)、(局部概念模式)、(局部内模式);分布式数据库系统中,查询代价主要由(CPU代价)、(I/O代价)、(通信代价)三部分构成。
- 两关系间的连接运算可通过(嵌套循环联接)、(块嵌套循环联接)、索引嵌套循环联接
- (归并联接)及(散列联接)、等不同方法实现,当采用块嵌套循环方法时,通常将较大的关系作为嵌套循环计算过程中的(内)关系。
- 线性散列是为了解决可扩展散列面临的桶的(桶的数目过快增长,但其利用率不足)及( )问题而引入的。
- 乐观的并发控制协议中,事务的执行可划分为(读与执行阶段)、(有效性检查阶段)、(写阶段)等三个阶段。
- 影响并行数据库系统性能的3个因素分别是(启动代价)、(干扰)和(倾斜)。
- 在并行数据库中,将互相独立的多个操作或者一个操作内互相独立的多个子操作分配给不同的处理机并行执行称为(流水线(水平)并行),将存在流水线方式依赖关系的多个操作分配给不同处理机并行执行称为(独立(垂直)并行)。
- CAP定理中的C、A、P分别是指(一致性)、(可用性)、(分区容忍性)。
- 区块链是一个(去中心化)、(去信任)的公开透明分布式账本。
- 按照开放程度,区块链可分为(公有链)、(私有链)和(联盟链)等三种类型;区块链的基础架构模型包括(应用层)、(合约层)、(激励层)、(共识层)、(网络层)、(数据层)等六层。
二、选择题:
-
The separation of the data definition from the program is known as: ______________b
a) Data dictionary
b) Data independence
c) Data integrity
d) Referential integrity -
In order to reduce the overhead in retrieving the records from the storage space we use: ____________b
a) Logs
b) DB buffer
c) Medieval space
d) Lower records
Answer: b
Explanation: The output to stable storage is in units of blocks. -
The main reason why data is retrieved in “blocks” is to: a
a) minimize the number of actual disk accesses
b) save space
c) make programming easier
d) increase the lifetime of hardware -
The order of log records in the stable storage ____________ as the order in which they were written to the log buffer.a
a) Must be exactly the same b) Can be different c) Is opposite
d) Can be partially same
Answer: a
Explanation: As a result of log buffering, a log record may reside in only main memory (volatile storage) for a considerable time before it is output to stable storage. -
The difference between files storing spanned(跨块的)versus unspanned (非跨块的)records is that:______________
a) a file with spanned records will use less disk space for storing records than with unspanned records, if an integral number of records do not fit in a block.
b) a file with spanned records can have records that are stored on more than one disk block.
c) a file with spanned records must be used when the size of a record is larger than the block size.
d) all of the above. e) NOA. -
An index is used in relational database systems to:______________b
a) improve the efficiency of normalizing relations.
b) improve the efficiency of retrieving records from a relation.
c) improve the efficiency of the Create Table statement.
d) b and c only. e) NOA. -
Which of the following would not be considered a benefit of indexing? ____________b
a) To improve performance during data sorting.
b) To insure uniqueness of key values.
c) To improve performance during large sequential table scans.
d) To avoid reading the records when processing queries that retrieve only indexed columns.
e) To help query optimizer in cost estimation. -
The difference between a dense(稠密) index and a sparse(稀疏) index is that: __________c
a) a dense index contains keys and pointers for a subset of the records whereas a sparse index contains keys and pointers for every record.
b) a dense index can only be a primary index whereas a sparse index can only be a secondary index.
c) a dense index contains keys and pointers for each record whereas a sparse index contains keys and pointers for a subset of the records.
d) the size of dense index is always smaller than the size of sparse index. -
In ordered indices the file containing the records is sequentially ordered, a ___________ is an index whose search key also defines the sequential order of the file.
a) Clustered index b) Structured index c) Unstructured index d) Nonclustered index
Answer: a
Explanation: Clustering index are also called primary indices; the term primary index may appear to denote an index on a primary key, but such indices can in fact be built on any search key. -
Indices whose search key specifies an order different from the sequential order of the file are called ___________ indices.
a) Nonclustered b) Secondary c) All of the mentioned d) None of the mentioned
Answer: c
Explanation: Nonclustering index is also called secondary indices. -
Consider the table: Rented(cust_no, vid_no, start_date, return_date).Suppose you have created an index with the command:
create index indx1 on rented (start_date, returned_date);
start_date is the main ordering criterion for the index entries; returned_date becomes only important if two entries have the same start_date. Which of the following conditions can be evaluated using the index? (Circle ALL that apply)
a) start_date between ‘15-10-99’ and ‘20-10-99’ b) returned_date >= ‘20-10-99’
c) start_date=‘15-10-99’ and returned_date>'16-10-99’
d) start_date=‘15-10-99’ or returned_date>‘16-10-99’ -
When searching a B±tree for a range of key values: ______________
a) the search always starts at the root node. b) the search always ends at a leaf node.
c) multiple leaf nodes may be accessed. d) all of the above.
e) a and b only. -
The insertion of a record in a B±tree will always cause the height of the tree to increase by one when: ______________
a) the tree consists of only a root node. b)the record is to be inserted into a full leaf node.
c) all the nodes in the path from the root to the desired leaf node are full before insertion.
d) all the nodes in the B±tree are half full. e) NOA. -
In a B±tree index ______ for each value, we would normally maintain a list of all records with that value for the indexed attribute.
a) Leaf b) Node c) Root d) Link
Answer: a
Explanation: Bitmaps are combined and stored in a B+ tree. -
Bitmap indices are a specialized type of index designed for easy querying on: ________
a) Bit values b) Binary digits c) Multiple keys d) Single keys
Answer: c
Explanation: Each bitmap index is built on a single key. -
SELECT *
FROM r
WHERE gender = ’f’ AND income level = ’L2’;
In this selection, we fetch the bitmaps for gender value f and the bitmap for income level value L2, and perform an ______________of the two bitmaps.
a) Union 并 b) Addition c) Combination d) Intersection 交
Answer: d
Explanation: We compute a new bitmap where bit i has value 1 if the ith bit of the two bitmaps are both 1, and has a value 0 otherwise. -
To identify the deleted records we use the _________
a) Existence bitmap b) Current bitmap c) Final bitmap d) Deleted bitmap
Answer: a
Explanation: The bitmaps which are deleted are denoted by 0. -
Bitmaps can be used as a compressed storage mechanism at the leaf nodes of ________ for those values that occur very frequently.
a) B-trees b) B±trees c) Bit trees d) Both B-trees and B±trees
Answer: b
Explanation: Bitmaps are combined and stored in a B+ tree. -
A(n) _________ can be used to preserve the integrity of a document or a message.
a) Message digest b) Message summary c) Encrypted message
d) None of the mentioned
Answer: c
Explanation: Encryption algorithms are used to keep the contents safe. -
In the client / server model, the database: _________
a) Is downloaded to the client upon request b) Is shared by both the client and server
c) Resides on the client side d) Resides on the server side
Answer: d
Explanation: The server has all the database information and the client access it. -
Before a block of data in main memory can be output to the database, all log records pertaining to data in that block must have been output to stable storage. This is: _________
a) Read-write logging b) Read-ahead logging
c) Write-ahead logging d) None of the mentioned
Answer: c
Explanation: The WAL rule requires only that the undo information in the log has been output to stable storage, and it permits the redo information to be written later. -
Writing the buffered log to __________ is sometimes referred to as a log force.
a) Memory b) Backup c) Redo memory d) Disk
Answer: d
Explanation: If there are insufficient log records to fill the block, all log records in main memory are combined into a partially full block and are output to stable storage. -
The ___________ policy, allows a transaction to commit even if it has modified some blocks that have not yet been written back to disk.
a) Force b) No-force c) Steal d) No-steal
Answer: b
Explanation: No-force policy allows faster commit of transactions. -
The __________ policy allows multiple updates to accumulate on a block before it is output to stable storage, which can reduce the number of output operations greatly for frequently updated blocks.
a) Force b) No-force c) Steal d) No-steal
Answer: b
Explanation: No-force policy allows faster commit of transactions. -
The ___________ policy, allows the system to write modified blocks to disk even if the transactions that made those modifications have not all committed.
a) Force b) No-force c) Steal d) No-steal
Answer: c
Explanation: The no-steal policy does not work with transactions that perform a large number of updates. -
The operating system reserves space on disk for storing virtual-memory pages that are not currently in main memory; this space is called
a) Latches b) Swap Space c) Dirty Block d) None of the mentioned
Answer: b
Explanation: Almost all current-generation operating systems retain complete control of virtual memory. -
Locks on buffer blocks are unrelated to locks used for concurrency-control of transactions, and releasing them in a non-two-phase manner does not have any implications on transaction serializability. This is:
a) Latches b) Swap Space c) Dirty Block d) None of the mentioned
Answer: a
Explanation: These locks, and other similar locks that are held for a short duration. -
ARIES uses a ___________ to identify log records, and stores it in database pages.
a) Log sequence number b) Log number c) Lock number d) Sequence number
Answer: a
Explanation: LSN is used to identify which operations have been applied to a database page. -
ARIES supports ___________ operations, which are physical in that the affected page is physically identified, but can be logical within the page.
a) Physiological redo b) Physiological undo c) Logical redo d) Logical undo
Answer: a
Explanation: The deletion of a record from a page may result in many other records in the page being shifted, if a slotted page structure is used. -
The ______________ is used to minimize unnecessary redos during recovery.
a) Dirty page table b) Page table c) Dirty redo d) All of the mentioned
Answer: a
Explanation: Dirty pages are those that have been updated in memory, and the disk version is not up-to-date. -
If a piece of data is stored in two places in the database, then __________
a) Storage space is wasted
b) Changing the data in one spot will cause data inconsistency
c) It can be more easily accessed
d) Storage space is wasted & Changing the data in one spot will cause data inconsistency -
When a database is implemented on a single server regardless of the number of physical locations that may require access to it is known as:
a) centralization b)horizontal distribution 水平 c) vertical distribution 垂直
d) replication e) none of these -
When a table or entire rows in a table are assigned to different database servers and locations, it is known as:
a) centralization b) horizontal distribution c) vertical distribution
d) replication e) none of these -
By sending messages through Paxos, processes can ensure that concurrent messages become:
a) Global time ordered multicasts. b) Totally ordered multicasts.
c) Causally ordered multicasts. d) Unordered multicasts.
Paxos ensures that each message gets a unique sequence number. Any proposal for a lower sequence number than one that has already been used or promised will be rejected. -
Based on their vector timestamps, which event causally precedes (4, 2, 8, 5)?
a) (3, 1, 7, 7) b) (5, 1, 6, 2) c) (4, 2, 8, 4) d) (4, 3, 8, 5) -
With Paxos, message sequence numbers are assigned by the:
a) Acceptor b) Client c) Learner d) Proposer -
Paxos reaches agreement when:
a) All proposers agree on a value to send to the acceptors.
b) All acceptors agree to a proposed value.
c) The majority proposers agree on a value to send to the acceptors.
d) The majority of acceptors agree to a proposed value. -
A write-ahead log does NOT enable a process to:
a) Undo changes in case of an abort. b) Record that a transaction has committed.
c) Record the response that was sent to a vote in a commit protocol.
d) Achieve higher performance by prefetching data. -
To abort a distributed transaction:
a) At least one participant must vote to abort.
b) The majority of participants must vote to abort.
c) All of the participants must vote to abort.
d) All live participants must vote to abort. -
The motivation for an eventual consistency model was:
a) It is impossible to have highly available replicated data that is fully consistent in a system that can survive network partitioning.
b) Data inconsistencies are highly undesirable, so the primary focus should be on ensuring that all replicas are consistent.
c) Distributed commit protocols can never work reliably, so data is bound to become inconsistent on some participants.
d) Data might be in an inconsistent state during the execution of transactions but will be made consistent when they commit. -
Which of these operations is most efficiently implemented on a large-scale GFS (Google File System) system?
a) Read one 1 TB file. b) Read 1 million 1 MB files.
c) Write one 1 TB file. d) Write 1 million 1 MB files. -
The partitioning function in MapReduce is
a) Determines which shard will be assigned to a specific map worker.
b) Filters out unnecessary input data prior to being processed by the map worker.
c) Determines the division of available servers into map workers and reduce workers.
d) Determines which reduce worker will process data associated with a particular key. -
Which of the following is NOT a responsibility of a map worker in the MapReduce framework:
a) Generate (key, value) pairs. Y
b) Target (key, value) data for one of Rreduce workers. N
c) Partition original data into shards. Y d) Discard data of no interest. -
A key design principle of the Google cluster architecture is
a) Merging is fast; break up databases into lots of pieces and use lots of processes, each working on a piece of the data.
b) Minimize the number of phases in a task; create many replicas of a database and allow each system full access to it without the need for sharing or locking.
c) Minimize context switch overhead; machines are cheap and it’s more efficient to devote a node to one task exclusively.
d) Machines and software fail; run the same task redundantly in parallel to ensure successful execution.
三、简答题:
-
[Buffer Management]Suppose you have a Buffer Pool that can hold 3 pages. There are 26 blocks on your disk; for simplicity we will refer to each block with a letter of the English alphabet between A and Z. An “access pattern” is a string of letters that log the requests to the Buffer Pool for pages. For each access pattern below, write down next to it the number of I/Os that would occur starting with an empty buffer pool each time, for both the LRU and MRU replacement policies. (Spaces in the access patterns are there just for legibility.)
LRU算法执行过程网上资料很多,这里不过多叙述,知道一点即可(发生中断的次数和页面置换次数不同,I/O次数和中断次数保持一致)。MRU算法的执行过程如下: -
[Disk Manager.]Consider a disk with a sector size of 512 bytes, 2000 tracks per surface, 50 sectors per track, five double-sided platters, and average seek time of 10 msec. Suppose a block size of 1024 bytes is chosen. A file containing 100k records of 100 bytes each is to be stored on such a disk and that no record is allowed to span two blocks.
(1) If the disk platters rotate 5400 rpm, what is the maximum rotational delay? (1)如果磁盘盘旋转5400 rpm 转/分,最大旋转延迟是多少?
影响机械磁盘速度的因素:寻道时间、旋转延迟、数据传输时间;
1.寻道时间:磁头从开始移动到数据所在磁道所需要的时间
2.旋转延迟:盘片旋转将请求数据所在扇区移至读写磁头下方所需要的时间,旋转延迟取决于磁盘转速
3.数据传输时间:完成传输所请求的数据所需要的时间
所谓旋转延迟时间就是从当前扇区到目标扇区的时间,最长自然是一个旋转一个磁道的时间:5400转/分=90转/秒;1/90 =0.01…秒
(2) If one track of data can be transferred per revolution, what is the transfer rate? (2)如果每转可以传输一个磁道数据,传输速率是多少?
数据/时间=50*512B/(1/90 s)=2250 KB/S
(3) What is the time required to read the file sequentially? What if the disk were capable of reading/writing from all heads in parallel?
(3)顺序读取文件需要多少时间? 如果磁盘能够并行读取/写入所有磁盘,该怎么办?
分三个时间:寻道,旋转延迟,数据传输
寻道时间:取平均寻道时间10ms,相邻磁道的寻道时间不计
旋转延迟:访问第一个块的时候需要,取平均旋转延迟:(1/90)/2 s=0.005555…s
数据传输时间:4.444…S
总时间:0.01+0.0055…+4.444…=4.4599…S
而关于并行的问题,太复杂,不予考虑了。。。
(4) What is the time required to read each block in the file in a random order? (assuming that each block request incurs the average seek time and rotational delay) (4)以随机顺序读取文件中的每个块需要多少时间? (假设每个块请求都会产生平均搜索时间和轮换延迟)
寻道时间:取平均寻道时间10ms*10000(块数)=100s
旋转延迟:(1/90)/2 s *10000=55.5555…s
数据传输时间:4.444…S
总时间:159.99…s
3. [B+ Trees.]Consider a B+ tree containing the elements 2,4,8,9,11,12,13,14,16,17,18,20. The tree has order d = 2. This means that internal nodes have at least 2 keys and 3 pointers, and at most 4 keys and 5 pointers; leaf nodes have at least 2 data entries and at most 4 data entries.
(1) In the picture below, draw what the tree looks like if the leaves are as densely packed as possible. Draw directly on the picture.
(2) Based on your original tree from 1, what is the minimum number of inserts that could cause a root split?
(2)根据您从1开始的原始树,可导致根分裂的最小插入数是多少?
因为树的根节点至多有4个key,目前已经有2个key了,只需要再引入两个key,就4个key了,再引入第三个key时,根节点一定分裂。而引入的这三个key一定是通过较根节点次一级的节点分裂引入的。观察1中的图,根节点下有三个叶子节点,如果这三个叶子节点都分裂的话,就可以通过最少的插入为根节点引入三个key.既然如此,给出一种可能的插入情形:6,15,21
(3) Draw the tree just after the root has split with a minimum of insertions as you described in 2. Use any value you like that would cause the split. We have started you off by drawing the root and internal nodes; you need to fill them in and draw the leaves.
(4) Given the B+ tree shown in Figure 1, draw the tree after deleting keys 19, 20, 24 in the given order.
https://www.jianshu.com/p/71700a464e97
- 下图为插入关键字0000、0101、1111、1010后的线性散列索引结果,其中:i-当前使用的散列函数的位数,n-当前的桶,r-当前散列表中的记录总数,要求 r<=1.7n。
试分别给出插入以下关键字后的线性散列索引结果。
a) 0001
b) 0111
c) 1011
5. Consider the following log where the DPT represents the Dirty Page Table and TT represents the Active Transaction Table
Answer the following questions (using the ARIES-like algorithm we studied in class):
- What is the smallest LSN accessed during the Analysis phase?
- Fill in the contents of the Dirty Page Table and the Transaction Table at the end of the Analysis phase. (you may not need all the space we give you)
- At which LSN does the Redo phase begin?
- What entries (specify only LSNs) do get undone as part of the Undo phase?
- 假设一个事务应用是用SQL嵌入到C中编写的,其中80%的时间花费在运行SQL语句上,其余20%的时间花费在运行C代码上,如果只对SQL进行了并行化,那么可以期望得到多大的加速比? 为什么?
c语言中嵌入sql语句,假设总的执行时间100s,并行化操之前:则SQL执行占了80s,c执行占了20s;并行化之后,SQL执行的具体时间无法确定(因条件不足,比如说没有给出有多少台处理器,它们处理速度如何(处理速度应该是按每台执行SQL80s),本题中的SQL语句多大程度上可以并行化,这与具体的SQL语句有关),但我们可以给出并行化后SQL语句的执行时间范围。0 s<t< 80s,则根据加速比公式:
100/(80+20)=1<加速比<5=100/(0+20)