InnoDB 索引页物理结构和逻辑结构

一、The physical structure of InnoDB index pages

Every index page has an overall structure as follows:

InnoDB 索引页物理结构和逻辑结构
InnoDB’s data storage model uses “spaces”, often called “tablespaces” in the context of MySQL, and sometimes called “file spaces” in InnoDB itself. A space may consist of multiple actual files at the operating system level (e.g. ibdata1, ibdata2, etc.) but it is just a single logical file — multiple physical files are just treated as though they were physically concatenated together.
InnoDB的数据存储模型使用“spaces”,在MySQL上下文中通常称为“tablespaces”,或"file spaces"。一个space可能在操作系统级别由多个实际文件组成(例如ibdata1,ibdata2等),但它只是一个逻辑文件。

Each space in InnoDB is assigned a 32-bit integer space ID, which is used in many different places to refer to the space. InnoDB always has a “system space”, which is always assigned the space ID of 0. The system space is used for various special bookkeeping that InnoDB requires. Through MySQL, InnoDB currently only supports additional spaces in the form of “file per table” spaces, which create an .ibd file for each MySQL table. Internally, this .ibd file is actually a fully functional space which could contain multiple tables, but in the implementation with MySQL, they will only contain a single table.
InnoDB中的每个space都分配有一个32位integer空间ID,该ID在许多不同的地方用于引用该空间。 InnoDB始终有一个“系统空间”,该空间始终分配有空间ID0。该系统空间用于InnoDB所需的各种特殊簿记。通过MySQL,InnoDB当前仅支持“file per table”空间形式的空间,这些空间为每个MySQL表创建一个.ibd文件。在内部,此.ibd文件实际上是一个功能齐全的空间,可以包含多个表,但是在使用MySQL的实现中,它们将仅包含一个表。

Each space is divided into pages, normally 16 KiB each (this can differ for two reasons: if the compile-time define UNIV_PAGE_SIZE is changed, or if InnoDB compression is used). Each page within a space is assigned a 32-bit integer page number, often called “offset”, which is actually just the page’s offset from the beginning of the space (not necessarily the file, for multi-file spaces). So, page 0 is located at file offset 0, page 1 at file offset 16384, and so on. (The astute may remember that InnoDB has a limit of 64TiB of data; this is actually a limit per space, and is due primarily to the page number being a 32-bit integer combined with the default page size: 2^32 x 16 KiB = 64 TiB.)
每个空间都分为几页,通常每页16 KiB。 空间中的每个页面都分配有一个32bit的Integer页码,通常称为“偏移量”,它实际上只是页面与空间开头的偏移量(对于多文件空间,不一定是文件的偏移量)。 因此,页面0位于文件偏移量0,页面1位于文件偏移量16384,依此类推。 (InnoDB的数据限制为64TiB;这实际上是每个空间的限制,这主要是由于页码是32位整数与默认页大小的组合:2^32 x 16 KiB = 64 TiB。)

总结:
The index is physically stored in pages, which are containers of sizeof 16 KiB. The pages are stored in tablespace that resides in ibdataX files (global tablespace) or in *.ibd files if the file-per-table feature is active.

1.The FIL header and trailer

This is typical and common to all page types. One aspect somewhat unique to INDEX pages is that the “previous page” and “next page” pointers in the FIL header point to the previous and next pages at the same level of the index, and in ascending order based on the index’s key. This forms a doubly-linked list of all pages at each level.

FIL header指向同级别的privious page,FIL trailer指向同级别的next page.,从而在同级别的所有page构成双向链表结构,如下图。
InnoDB 索引页物理结构和逻辑结构

2.System records

Every INDEX page contains two system records, called infimum and supremum, at fixed locations (offset 99 and offset 112 respectively) within the page.

  • The infimum record:The infimum record represents a value lower than any possible key in the page. Its “next record” pointer points to the user record with the lowest key in the page. Infimum serves as a fixed entry point for sequentially scanning user records.
    infimum record代表其值小于此page的任何值,它的"next record"指针指向此page的最小key.Infimum用作扫描记录的入口。

  • The supremum record:The supremum record represents a key higher than any possible key in the page. Its “next record” pointer is always zero (which represents NULL, and is always an invalid position for an actual record, due to the page headers). The “next record” pointer of the user record with the highest key on the page always points to supremum.
    supremum record代表其值大于此page的任何值,它的"next record"指针指向null。此page最大记录指向"next record".

3.User records

The actual data.Every record has a variable-width header and the actual column data itself. The header contains a “next record” pointer which stores the offset to the next record within the page in ascending order, forming a singly-linked list.
真实数据。其中"next record"指针存储next record的偏移地址,从而构成单向链表。在此page中,record升序排序,

Further, using the “next page” pointer in the FIL header, it’s easy to scan from page to page through the entire index in ascending order. This means an ascending-order table scan is also trivial to implement:
使用page的"next page"和record的"next record",就能以递增的顺序遍历整个索引结构.

  1. Start at the first (lowest key) page in the index. (This page is found through B+tree traversal)
  2. Read infimum, and follow its “next record” pointer.
  3. If the record is supremum, proceed to step 5. If not, read and process the record contents.
  4. Follow the “next record” pointer and return to step 3.
  5. If the “next page” pointer points to NULL, exit. If not, follow the “next page” pointer and return to step 2.

把User Record的组织形式和若干Page组合起来,Record节点遍历:
InnoDB 索引页物理结构和逻辑结构

4.The INDEX header

Contains many fields related to INDEX pages and record management.
InnoDB 索引页物理结构和逻辑结构
(Garbage,结合二、Data Removal看)

  • Number of Heap Records: The total number of records in the page, including the infimum and supremum system records, and garbage (deleted) records.
  • Number of Records: The number of non-deleted user records in the page.
  • Heap Top Position: The byte offset of the “end” of the currently used space. All space between the heap top and the end of the page directory is free space.
  • First Garbage Record Offset: A pointer to the first entry in the list of garbage (deleted) records. The list is singly-linked together using the “next record” pointers in each record header. (This is called “free” within InnoDB, but this name is somewhat confusing.)
  • Garbage Space: The total number of bytes consumed by the deleted records in the garbage record list.
  • Last Insert Position: The byte offset of the record that was last inserted into the page. Number of Inserts in Page Direction: Once the page direction is set, any following inserts that don’t reset the direction (due to their direction differing) will instead increment this value.
  • Page Level: The level of this page in the index. Leaf pages are at level 0, and the level increments up the B+tree from there. In a typical 3-level B+tree, the root will be level 2, some number of internal non-leaf pages will be level 1, and leaf pages will be level 0.

二、Data Removal

For performance reasons, InnoDB does not physically delete records. In fact, data records persist after deletion as a result of delete flags. These garbage records are overwritten in the future if the space is needed. As mentioned above, InnoDB uses a singly-linked list for navigation within a page. Specifically, InnoDB uses two INDEX header fields: (i) a pointer to the start of the free record list of a page; and (ii) a field that stores the number of bytes of deleted records. Figures 1 and 2 illustrate the deletion process of a data record (note that “@x” denotes an x-byte page offset, i.e., the physical data address in the filesystem). The garbage offset points to the first deleted record on the current page. As in the case of stored data records, InnoDB uses a singly-linked list for deleted records; the last deleted record points to itself, which signals the end of the list.
出于性能考虑,InnoDB不会物理删除records。会用"delete flag"标记为删除。若此page空间不够,则会覆写garbage records。在page内部,InnoDB会用单链表导航。InnoDB在"INDEX Header"中两个字段(i)指向页面空闲记录的开始的指针(First Garbage Record Offset);(ii)存储删除记录的字节数的字段(Garbage Space)。

图1和图2示出了数据记录的删除过程(“ @ x”表示x字节的页偏移,即文件系统中的物理数据地址)。 "First Garbage Record Offset"指向当前page的第一个删除记录。 与存储record一样,InnoDB对删除的记录使用单链接列表。 最后删除的记录指向自身,表示列表结束。
InnoDB 索引页物理结构和逻辑结构
When a record is deleted, InnoDB changes the deleted flag to one. Also, it updates the next pointer of the previous record in ascending order and points to the next record or the supremum if the last record on the page is deleted. Additionally, the INDEX header is updated, i.e.,the garbage size field is increased by the size of the deleted record and the pointer to the last inserted record is overwritten with 0x00000 (Offset:0x0A). Internally, the last record of the deleted record list now points to the currently deleted record instead of to itself.
删除记录(假设为B)后,InnoDB会将已"delete flag"更改。 同样,它会更新A的"next record"指针指向C或者supremum record。"the INDEX header"也会更改,“Garbage Space"增加,指向” last inserted record"的指针用0x00000(Offset:0x0A)覆盖。 上一个被删除的记录的next record指针指向当前的记录。

三、B+Tree index structures in InnoDB

InnoDB uses a B+Tree structure for its indexes. A B+Tree is particularly efficient when data doesn’t fit in memory and must be read from the disk, as it ensures that a fixed maximum number of reads would be required to access any data requested, based only on the depth of the tree, which scales nicely.
InnoDB用B + Tree结构做索引。当数据不能容纳在内存中且必须从磁盘读取时,B + Tree尤其有效,因为访问任何数据时读取次数(树的深度)固定,可以很好地扩展。

An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.
索引树从“根”页面开始,该页面的位置是固定的(并永久存储在InnoDB的数据字典中),作为访问树的起点。该树可能小到只有root page,也可能大到多级树中的数百万page。

Pages are referred to as being “leaf” pages or “non-leaf” pages (also called “internal” or “node” pages in some contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.
页面被称为"leaf page"或"non-leaf page"。"leaf page"包含实际的行数据。"non-leaf page"仅包含指向其他"non-leaf page"或"leaf page"的指针。树是平衡的,因此树的所有分支具有相同的深度。

InnoDB assigns each page in the tree a “level”: leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.
InnoDB为树中的每个page分配一个“级别”:"leaf page"级别为0,级别自底向上递增。根页面级别基于树的深度。

1.Leaf and non-leaf pages

For both leaf and non-leaf pages, each record (including the infimum and supremum system records) contain a “next record” pointer, which stores an offset (within the page) to the next record. The linked list starts at infimum and links all records in ascending order by key, terminating at supremum. The records are not physically ordered within the page (they take whatever space is available at the time of insertion); their only order comes from their position in the linked list.
对于叶子页面和非叶子页面,每个record (包括infimum and supremum)都包含"next record"指针,该指针存储下一个record 的偏移量。链接列表从无数开始,并按键升序链接所有记录,在最高处终止。记录在页面内没有实际排序(插入时会占用任何可用空间);它们的唯一顺序来自它们在链接列表中的位置。

Leaf pages contain the non-key values as part of the “data” contained in each record:
"leaf page"用非键值(也就是除键外的其他字段),作为每个record中包含的"数据"的一部分:
InnoDB 索引页物理结构和逻辑结构
Non-leaf pages have an identical structure, but instead of non-key fields, their “data” is the page number of the child page, and instead of an exact key, they represent the minimum key on the child page they point to:
非叶子页面具有相同的结构,但是它们的"数据"是子页面的页码。他的key是"child page"中最小key:
InnoDB 索引页物理结构和逻辑结构
InnoDB 索引页物理结构和逻辑结构

2.Pages at the same level

Most indexes contain more than one page, so multiple pages are linked together in ascending and descending order:
多数索引会包含多个page,以递增和递减的顺序链接:
InnoDB 索引页物理结构和逻辑结构
Each page contains pointers (in the FIL header) for “previous page” and “next page”, which for INDEX pages are used to form a doubly-linked list of pages at the same level (e.g. leaf pages, at level 0 form one list, level 1 pages form a separate list, etc.).
每个页面都包含“previous page”和“next page”的指针(在FIL标头中),对于INDEX页面,它们用于形成同一级别的page的双向链接列表(例如,叶子页面,级别为0的页面) 列表,第1级页面构成一个单独的列表等)。

3.A detailed look at a single-page table

InnoDB 索引页物理结构和逻辑结构

4.Innodb为主键创建索引,是强制的

In InnoDB, data is stored in the form of an index tree based on the primary index, which is mandatory for every table. The data and the primary index are thus closely intertwined and directly affect each other; secondary indices are quite different because they solely exist for the purpose of speeding up specific searches. When creating a table, InnoDB generates an index for the primary key (an auto-incremented id-column is generated if no column is specifically selected). The actual data records are then stored directly inside the B±tree structure of the index. An additional index tree is generated for each secondary key, this index tree holds pointers to the respective pages in the primary key. InnoDB uses a B+tree to locate pages.
在InnoDB中,数据以基于主索引的索引树的形式存储,这对于每个表都是必需的。因此,数据和主索引紧密地交织在一起,并且直接相互影响;二级索引有很大的不同,因为它们仅出于加速特定搜索的目的而存在。创建表时,InnoDB会为主键生成索引(如果未特别选择任何列,则会生成一个自动递增的id列)。然后,实际的数据记录将直接存储在索引的B +树结构内部。为每个辅助键生成一个附加的索引树,该索引树保存指向主键中各个页面的指针。
(若表没有主键,也会创建索引,但不考虑这种情况,实际应用中,每个表都有主键~)

为什么是会创建递增的id
自增主键是连续的,在插入过程中尽量减少页分裂,即使要进行页分裂,也只会分裂很少一部分。并且能减少数据的移动,每次插入都是插入到最后。总之就是减少分裂和移动的频率。

  • 连续
    InnoDB 索引页物理结构和逻辑结构
  • 非连续
    InnoDB 索引页物理结构和逻辑结构
    参考:
    https://link.springer.com/content/pdf/10.1007%2F978-3-319-24123-4_11.pdf
    https://blog.jcole.us/2013/01/07/the-physical-structure-of-innodb-index-pages/