Linux内核源码学习（1）- 从实模式到保护模式

2019独角兽企业重金招聘Python工程师标准>>> Linux内核源码学习（1）- 从实模式到保护模式

在查找资料的过程发现了一份关于linux内核启动的课件，在这里附上。(本笔记参考了众多资料，向原作者致敬)

下载

鉴于本人对于操作系统已经有了一些初步的认识，所以本人从系统启动的入口点开始分析linux内核源码。

一、linux-2.6.34.13/arch/x86/boot/setup.ld

Linux中与x86体系结构相关的源码在linux/arch/x86/中，boot/setup.ld脚本中指定了内核入口点，具体内容如下

* setup.ld

* Linker script for the i386 setup code

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")

OUTPUT_ARCH(i386)

ENTRY(_start)

脚本中的ENTRY(_start)指定了入口点为_start，此外，脚本中还指定了输出文件的段属性等等。

SECTIONS

{

. = 0;

.bstext : { *(.bstext) } /* 这是引导扇区代码段 */

.bsdata : { *(.bsdata) } /* 这是引导扇区所包含的数据段 */

接下来的这一句是指示链接器将header链接到偏移0x1f1，即十进制497，这是严格规定的。

. = 497;

.header : { *(.header) }

.entrytext : { *(.entrytext) }

.inittext : { *(.inittext) }

.initdata : { *(.initdata) }

__end_init = .;

.text : { *(.text) }

.text32 : { *(.text32) }

. = ALIGN(16);

.rodata : { *(.rodata*) }

.videocards : {

video_cards = .;

*(.videocards)

video_cards_end = .;

}

. = ALIGN(16);

.data : { *(.data*) }

.signature : {

setup_sig = .;

LONG(0x5a5aaa55)

}

这里有一个签名，用于内核启动过程中的验证。

. = ALIGN(16);

.bss :

{

__bss_start = .;

*(.bss)

__bss_end = .;

}

. = ALIGN(16);

_end = .;

/DISCARD/ : { *(.note*) }

* The ASSERT() sink to . is intentional, for binutils 2.14 compatibility:

. = ASSERT(_end <= 0x8000, "Setup too big!");

. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");

/* Necessary for the very-old-loader check to work... */

. = ASSERT(__end_init <= 5*512, "init sections too big!");

}

二、 linux-2.6.34.13/arch/ x86/boot/ header.S

在bootloader执行完后的内存布局如下图所示。

原本我以为内核是从引导扇区开始运行的，但是引导扇区的代码告诉我事实并不是这样，linux内核映像不能被引导执行，它的引导扇区功能只是提示错误信息指导重启。以下为内核映像的“引导扇区”（这个扇区并不是真正的系统引导扇区，它只是内核映像的一部分，它能在内核映像被引导时提示错误）。

.code16

.section ".bstext", "ax"

.global bootsect_start

bootsect_start:

# Normalize the start address

ljmp $BOOTSEG, $start2 # 这里实际上就是跳转到start2，ljmp的语法为 ljmp 段,偏移

这里的一个长跳转是用来设置代码段寄存器的，接下来的几条语句用于设置数据段寄存器、扩展段寄存器、堆栈段寄存器等。

BOOTSEG = 0x07C0 /* original address of boot-sector */

BOOTSEG的定义在文件开头。

start2:

movw %cs, %ax

movw %ax, %ds

movw %ax, %es

movw %ax, %ss

xorw %sp, %sp

sti

cld

movw $bugger_off_msg, %si

bugger_off_msg在接下来的代码中定义。

msg_loop:

lodsb

andb %al, %al

jz bs_die

movb $0xe, %ah

movw $7, %bx

int $0x10 # 显示错误信息

jmp msg_loop

bs_die:

# Allow the user to press a key, then reboot

xorw %ax, %ax

int $0x16 # 该中断在ax为0时等待用户按下任意键

int $0x19 # 该中断会重启计算机

# int 0x19 should never return. In case it does anyway,

# invoke the BIOS reset code...

ljmp $0xf000,$0xfff0

.section ".bsdata", "a"

bugger_off_msg:

.ascii "Direct booting from floppy is no longer supported.\r\n"

.ascii "Please use a boot loader program instead.\r\n"

.ascii "\n"

.ascii "Remove disk and press any key to reboot . . .\r\n"

.byte 0

hdr的偏移是497（0x1F1），这是固定的，这里是/* Filled in by build.c */

ram_size: .word 0

# header, from the old boot sector.

.section ".header", "a"

.globl hdr

hdr:

setup_sects: .byte 0 /* Filled in by build.c */

root_flags: .word ROOT_RDONLY

syssize: .long 0 /* Filled in by build.c */

ram_size: .word 0 /* Obsolete */

vid_mode: .word SVGA_MODE

root_dev: .word 0 /* Filled in by build.c */

boot_flag: .word 0xAA55

# offset 512, entry point

ROOT_RDONLY、SVGA_MODE在文件开头定义，如下所示。

#define ROOT_RDONLY 1

#define ASK_VGA 0xfffd

#define SVGA_MODE ASK_VGA

这里是夹在数据中的入口点，其实就是两字节构成的跳转到start_of_setup的指令。

.globl _start

_start:

# Explicitly enter this as bytes, or the assembler

# tries to generate a 3-byte jump here, which causes

# everything else to push off to the wrong offset.

.byte 0xeb # short (2-byte) jump

.byte start_of_setup-1f

上面的跳转指令并不是直接用jmp语句，而是通过构造一条jmp指令，这令我有点疑惑，为什么要这么做呢？或许是为了保证指令只有两个字节吧。

在gcc汇编中可以使用数字作为标号（symbol）名称，但是只能作为局部标号的名称，所以在上面的start_of_setup-1f就是start_of_setup到标号1的偏移，1后面的f表示查找标号时往前查找。

这里是header的第二部分。

# Part 2 of the header, from the old setup.S

.ascii "HdrS" # header signature

.word 0x020a # header version number (>= 0x0105)

# or else old loadlin-1.5 will fail)

realmode_switch是一个hook函数的地址，这里为0表示不使用hook。

.globl realmode_swtch

realmode_swtch: .word 0, 0 # default_switch, SETUPSEG

start_sys_seg: .word SYSSEG # obsolete and meaningless, but just

# in case something decided to "use" it

SYSSEG在header.S文件开头处定义。

SYSSEG = 0x1000 /* historical load address >> 4 */

kernel_version在文件linux/arch/x86/boot/version.c中定义。

const char kernel_version[] =

UTS_RELEASE " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") "

UTS_VERSION;

这里是kernel_version的地址。

.word kernel_version-512 # pointing to kernel version string

# above section of header is compatible

# with loadlin-1.5 (header v1.5). Don't

# change it.

type_of_loader: .byte 0 # 0 means ancient bootloader, newer

# bootloaders know to change this.

# See Documentation/i386/boot.txt for

# assigned ids

# flags, unused bits must be zero (RFU) bit within loadflags

loadflags:

LOADED_HIGH = 1 # If set, the kernel is loaded high

CAN_USE_HEAP = 0x80 # If set, the loader also has set

# heap_end_ptr to tell how much

# space behind setup.S can be used for

# heap purposes.

# Only the loader knows what is free

.byte LOADED_HIGH

setup_move_size: .word 0x8000 # size to move, when setup is not

# loaded at 0x90000. We will move setup

# to 0x90000 then just before jumping

# into the kernel. However, only the

# loader knows how much data behind

# us also needs to be loaded.

这里的code32_start很关键，它是保护模式的入口点，bootloader可以通过修改此处数据进行hook。

code32_start: # here loaders can put a different

# start address for 32-bit code.

.long 0x100000 # 0x100000 = default for big kernel

ramdisk_image: .long 0 # address of loaded ramdisk image

# Here the loader puts the 32-bit

# address where it loaded the image.

# This only will be read by the kernel.

ramdisk_size: .long 0 # its size in bytes

bootsect_kludge:

.long 0 # obsolete

STACK_SIZE在linux/arch/x86/boot/boot.h中定义。

#define STACK_SIZE 512 /* Minimum number of bytes for stack */

heap_end_ptr: .word _end+STACK_SIZE-512

# (Header version 0x0201 or later)

# space from here (exclusive) down to

# end of setup code can be used by setup

# for local heap purposes.

ext_loader_ver:

.byte 0 # Extended boot loader version

ext_loader_type:

.byte 0 # Extended boot loader type

cmd_line_ptr: .long 0 # (Header version 0x0202 or later)

# If nonzero, a 32-bit pointer

# to the kernel command line.

# The command line should be

# located between the start of

# setup and the end of low

# memory (0xa0000), or it may

# get overwritten before it

# gets read. If this field is

# used, there is no longer

# anything magical about the

# 0x90000 segment; the setup

# can be located anywhere in

# low memory 0x10000 or higher.

ramdisk_max: .long 0x7fffffff

# (Header version 0x0203 or later)

# The highest safe address for

# the contents of an initrd

# The current kernel allows up to 4 GB,

# but leave it at 2 GB to avoid

# possible bootloader bugs.

kernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment

#required for protected mode

#kernel

CONFIG_PHYSICAL_ALIGN在linux/arch/x86/configs/i386_defconfig中定义。

CONFIG_PHYSICAL_ALIGN=0x1000000

CONFIG_RELOCATABLE在linux/arch/x86/configs/i386_defconfig中定义。

CONFIG_RELOCATABLE=y

#ifdef CONFIG_RELOCATABLE

relocatable_kernel: .byte 1

#else

relocatable_kernel: .byte 0

#endif

min_alignment: .byte MIN_KERNEL_ALIGN_LG2 # minimum alignment

MIN_KERNEL_ALIGN_LG2在linux/arch/x86/include/asm/boot.h中定义。

/* Minimum kernel alignment, as a power of two */

#ifdef CONFIG_X86_64

#define MIN_KERNEL_ALIGN_LG2 PMD_SHIFT

#else

#define MIN_KERNEL_ALIGN_LG2 (PAGE_SHIFT + THREAD_ORDER)

#endif

其中，PAGE_SHIFT在linux/arch/x86/include/asm-generic/page.h中定义。

#define PAGE_SHIFT 12

THREAD_ORDER定义在linux/arch/x86/include/asm/下的page_32_types.h和page_64_types.h中，以下为32位系统下的定义。

#ifdef CONFIG_4KSTACKS

#define THREAD_ORDER 0

#else

#define THREAD_ORDER 1

#endif

在64位系统中，THREAD_ORDER定义为1。

* PMD_SHIFT determines the size of the area a middle-level

* page table can map

#define PMD_SHIFT 21

在64位系统中，PMD_SHIFT定义为21。

如此，在32位系统上，设置了4k栈的系统上，MIN_KERNEL_ALIGN_LG2的值为12，否则为13。

pad3: .word 0

cmdline_size: .long COMMAND_LINE_SIZE-1 #length of the command line,

#added with boot protocol

#version 2.06

hardware_subarch: .long 0 # subarchitecture, added with 2.07

# default to 0 for normal x86 PC

hardware_subarch_data: .quad 0

payload_offset: .long ZO_input_data

payload_length: .long ZO_z_input_len

setup_data: .quad 0 # 64-bit physical pointer to

# single linked list of

# struct setup_data

pref_address: .quad LOAD_PHYSICAL_ADDR # preferred load addr

#define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset)

#define VO_INIT_SIZE (VO__end - VO__text)

#if ZO_INIT_SIZE > VO_INIT_SIZE

#define INIT_SIZE ZO_INIT_SIZE

#else

#define INIT_SIZE VO_INIT_SIZE

#endif

init_size: .long INIT_SIZE # kernel initialization size

下面来看看start_of_setup。

.section ".entrytext", "ax"

start_of_setup:

#ifdef SAFE_RESET_DISK_CONTROLLER

# Reset the disk controller.

movw $0x0000, %ax # Reset disk controller

movb $0x80, %dl # All disks

int $0x13

#endif

如果配置了需要安全重置磁盘控制器，那么首先做的事就是重置所有磁盘的控制器。

start_of_setup在最开始部分会将扩展段设置与数据段相同。

# Force %es = %ds

movw %ds, %ax

movw %ax, %es

cld

# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,

# which happened to work by accident for the old code. Recalculate the stack

# pointer if %ss is invalid. Otherwise leave it alone, LOADLIN sets up the

# stack behind its own code, so we can't blindly put it directly past the heap.

movw %ss, %dx

cmpw %ax, %dx # %ds == %ss?

movw %sp, %dx

je 2f # -> assume %sp is reasonably set

# Invalid %ss, make up a new stack

movw $_end, %dx

testb $CAN_USE_HEAP, loadflags

jz 1f

movw heap_end_ptr, %dx

1: addw $STACK_SIZE, %dx

jnc 2f

xorw %dx, %dx # Prevent wraparound

2: # Now %dx should point to the end of our stack space

andw $~3, %dx # dword align (might as well...)

jnz 3f

movw $0xfffc, %dx # Make sure we're not zero

3: movw %ax, %ss

movzwl %dx, %esp # Clear upper half of %esp

sti # Now we should have a working stack

以上部分代码是用来初始化堆栈的，有了堆栈之后就能运行C代码了。

# We will have entered with %cs = %ds+0x20, normalize %cs so

# it is on par with the other segments.

pushw %ds

pushw $6f

lretw

# Check signature at end of setup

cmpl $0x5a5aaa55, setup_sig

jne setup_bad

以上代码通过push、ret设置了代码段寄存器，接下来的cmp来检查setup末尾的签名，如果不为0x5a5aaa55那么说明setup是坏的。

接下来会清空bss段，bss段是未初始化的数据段。

# Zero the bss

movw $__bss_start, %di

movw $_end+3, %cx

xorl %eax, %eax

subw %di, %cx

shrw $2, %cx

rep; stosl

每次清空四个字节，所以cx右移了两位，而cx加3的目的是为了向上取整。

# Jump to C code (should not return)

calll main

在这里跳转到了C代码中的main函数，main是不返回的。

# Setup corrupt somehow...

setup_bad:

movl $setup_corrupt, %eax

calll puts

# Fall through...

.globl die

.type die, @function

die:

hlt

jmp die

.size die, .-die

.section ".initdata", "a"

setup_corrupt:

.byte 7

.string "No setup signature found...\n"

setup的最后一部分代码是出错时处理相关的。

三、 linux-2.6.34.13/arch/ x86/boot/ main.c

在main.c中完成了要在实模式中所做的工作，最后会进入保护模式。

void main(void)

{

/* First, copy the boot header into the "zeropage" */

copy_boot_params();

/* End of heap check */

init_heap();

/* Make sure we have all the proper CPU support */

if (validate_cpu()) {

puts("Unable to boot - please use a kernel appropriate "

"for your CPU.\n");

die();

}

/* Tell the BIOS what CPU mode we intend to run in. */

set_bios_mode();

/* Detect memory layout */

detect_memory();

/* Set keyboard repeat rate (why?) */

keyboard_set_repeat();

/* Query MCA information */

query_mca();

/* Query Intel SpeedStep (IST) information */

query_ist();

/* Query APM information */

#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)

query_apm_bios();

#endif

/* Query EDD information */

#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)

query_edd();

#endif

/* Set the video mode */

set_video();

/* Parse command line for 'quiet' and pass it to decompressor. */

if (cmdline_find_option_bool("quiet"))

boot_params.hdr.loadflags |= QUIET_FLAG;

/* Do the last things and invoke protected mode */

go_to_protected_mode();

}

首先，在main中要做的事情是初始化boot_params，这是在copy_boot_params()中完成的，代码如下。

static void copy_boot_params(void)

{

struct old_cmdline {

u16 cl_magic;

u16 cl_offset;

};

const struct old_cmdline * const oldcmd =

(const struct old_cmdline *)OLD_CL_ADDRESS;

BUILD_BUG_ON(sizeof boot_params != 4096);

memcpy(&boot_params.hdr, &hdr, sizeof hdr);

这里将hdr拷贝到boot_params.hdr中，hdr是在header.S中定义的数据。注意，这个变量全局变量，且未被初始化，所以位于bss段，它就位于_bss_start的开始位置。而在之后当启动保护模式的分页功能后，第一个页面就是从它开始的（注意，不是从0x0开始的喔）。所以内核注释它为“zeropage”，即所谓的0号页面，足见这个boot_params的重要性。

if (!boot_params.hdr.cmd_line_ptr &&

oldcmd->cl_magic == OLD_CL_MAGIC) {

/* Old-style command line protocol. */

u16 cmdline_seg;

如果老的bootloader没有指定命令行参数，那么就将hdr的命令行参数指针指向老的命令行。

/* Figure out if the command line falls in the region

of memory that an old kernel would have copied up

to 0x90000... */

if (oldcmd->cl_offset < boot_params.hdr.setup_move_size)

cmdline_seg = ds();

else

cmdline_seg = 0x9000;

boot_params.hdr.cmd_line_ptr =

(cmdline_seg << 4) + oldcmd->cl_offset;

}

boot_params是未初始化的全局变量，编译器会将它放在bss段，而在进入main之前已经将bss段清零，所以在执行copy_boot_params()之前它是空的。上述代码初始化了boot_params。

接下来所做的工作是初始化堆，调用了init_heap()，代码如下。

static void init_heap(void)

{

char *stack_end;

/* 如果bootloader告诉kernel需要使用heap, bootloader需要把hdr.loadflags的CAN_US_HEAP位置1. */

if (boot_params.hdr.loadflags & CAN_USE_HEAP) {

/* esp是当前堆栈的底，堆栈的大小是STACK_SIZE，由此计算出堆栈的顶stack_end是esp-STACK_SIZE */

asm("leal %P1(%%esp),%0"

: "=r" (stack_end) : "i" (-STACK_SIZE));

/* 堆的底是由boot_params.hdr.heap_end_ptr指定。这个值应该是由bootloader填入的，堆的大小是0x200。那么heap_end就是heap_end_ptr+0x200 */

heap_end = (char *)

((size_t)boot_params.hdr.heap_end_ptr + 0x200);

/* 如果堆栈和堆有重叠，那么就减小堆的大小 */

if (heap_end > stack_end)

heap_end = stack_end;

} else {

/* Boot protocol 2.00 only, no heap available */

puts("WARNING: Ancient bootloader, some functionality "

"may be limited!\n");

}

初始化了堆之后接着要检查CPU是否支持，如果内核要求的CPU等级高于当前CPU那么就终止。其中，对CPU进行检查的代码在cpucheck.c中。

int validate_cpu(void)

{

u32 *err_flags;

int cpu_level, req_level;

const unsigned char *msg_strs;

check_cpu(&cpu_level, &req_level, &err_flags);

if (cpu_level < req_level) {

printf("This kernel requires an %s CPU, ",

cpu_name(req_level));

printf("but only detected an %s CPU.\n",

cpu_name(cpu_level));

return -1;

}

if (err_flags) {

int i, j;

puts("This kernel requires the following features "

"not present on the CPU:\n");

msg_strs = (const unsigned char *)x86_cap_strs;

for (i = 0; i < NCAPINTS; i++) {

u32 e = err_flags[i];

for (j = 0; j < 32; j++) {

if (msg_strs[0] < i ||

(msg_strs[0] == i && msg_strs[1] < j)) {

/* Skip to the next string */

msg_strs += 2;

while (*msg_strs++)

;

}

if (e & 1) {

if (msg_strs[0] == i &&

msg_strs[1] == j &&

msg_strs[2])

printf("%s ", msg_strs+2);

else

printf("%d:%d ", i, j);

}

e >>= 1;

}

putchar('\n');

return -1;

} else {

return 0;

}

紧接着设置bios模式，告诉CPU我们想要进入什么模式，通过代码可以看出，在32位系统中是不做改变的，而在64位系统中要通过中断改变模式。

static void set_bios_mode(void)

{

#ifdef CONFIG_X86_64

struct biosregs ireg;

initregs(&ireg);

ireg.ax = 0xec00;

ireg.bx = 2;

intcall(0x15, &ireg, NULL);

#endif

}

然后要检查内存，detect_memory()函数代码非常简单，linux内核会分别尝试调用detect_memory_e820()、detcct_memory_e801()、detect_memory_88()获得系统物理内存布局

int detect_memory(void)

{

int err = -1;

if (detect_memory_e820() > 0)

err = 0;

if (!detect_memory_e801())

err = 0;

if (!detect_memory_88())

err = 0;

return err;

}

detect_memory_e820()、detcct_memory_e801()、detect_memory_88()这3个函数内部其实都会以内联汇编的形式调用bios中断以取得内存信息，该中断调用形式为int 0x15，同时调用前分别把AX寄存器设置为0xe820h、0xe801h、0x88h，这里以e820为例说明。

由于历史原因，一些i/o设备也会占据一部分内存物理地址空间，因此系统可以使用的物理内存空间是不连续的，系统内存被分成了很多段，每个段的属性也是不一样的。int 0x15 查询物理内存时每次返回一个内存段的信息，因此要想返回系统中所有的物理内存，我们必须以迭代的方式去查询。detect_memory_e820()函数把int 0x15放到一个do-while循环里，每次得到的一个内存段放到struct e820entry里，而struct e820entry的结构正是e820返回结果的结构！而像其它启动时获得的结果一样，最终都会被放到boot_params里，e820被放到了 boot_params.e820_map。

static int detect_memory_e820(void)

{

int count = 0;

struct biosregs ireg, oreg;

struct e820entry *desc = boot_params.e820_map;

static struct e820entry buf; /* static so it is zeroed */

initregs(&ireg);

ireg.ax = 0xe820;

ireg.cx = sizeof buf;

ireg.edx = SMAP;

ireg.di = (size_t)&buf;

* Note: at least one BIOS is known which assumes that the

* buffer pointed to by one e820 call is the same one as

* the previous call, and only changes modified fields. Therefore,

* we use a temporary buffer and copy the results entry by entry.

* This routine deliberately does not try to account for

* ACPI 3+ extended attributes. This is because there are

* BIOSes in the field which report zero for the valid bit for

* all ranges, and we don't currently make any use of the

* other attribute bits. Revisit this if we see the extended

* attribute bits deployed in a meaningful way in the future.

do {

intcall(0x15, &ireg, &oreg);

ireg.ebx = oreg.ebx; /* for next iteration... */

/* BIOSes which terminate the chain with CF = 1 as opposed

to %ebx = 0 don't always report the SMAP signature on

the final, failing, probe. */

if (oreg.eflags & X86_EFLAGS_CF)

break;

/* Some BIOSes stop returning SMAP in the middle of

the search loop. We don't know exactly how the BIOS

screwed up the map at that point, we might have a

partial map, the full map, or complete garbage, so

just return failure. */

if (oreg.eax != SMAP) {

count = 0;

break;

}

*desc++ = buf;

count++;

} while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map));

return boot_params.e820_entries = count;

}

detcct_memory_e801()也是用于获取内存的布局。

static int detect_memory_e801(void)

{

struct biosregs ireg, oreg;

initregs(&ireg);

ireg.ax = 0xe801;

intcall(0x15, &ireg, &oreg);

if (oreg.eflags & X86_EFLAGS_CF)

return -1;

/* Do we really need to do this? */

if (oreg.cx || oreg.dx) {

oreg.ax = oreg.cx;

oreg.bx = oreg.dx;

}

if (oreg.ax > 15*1024) {

return -1; /* Bogus! */

} else if (oreg.ax == 15*1024) {

boot_params.alt_mem_k = (oreg.dx << 6) + oreg.ax;

} else {

* This ignores memory above 16MB if we have a memory

* hole there. If someone actually finds a machine

* with a memory hole at 16MB and no support for

* 0E820h they should probably generate a fake e820

* map.

boot_params.alt_mem_k = oreg.ax;

}

return 0;

}

detcct_memory_88()同样是用于获取内存的布局。

static int detect_memory_88(void)

{

struct biosregs ireg, oreg;

initregs(&ireg);

ireg.ah = 0x88;

intcall(0x15, &ireg, &oreg);

boot_params.screen_info.ext_mem_k = oreg.ax;

return -(oreg.eflags & X86_EFLAGS_CF); /* 0 or -1 */

}

接下来要设置键盘的重复率，但是貌似是可有可无的。在对keyboard_set_repeat的说明中，有这么一段注释“Set the keyboard repeat rate to maximum. Unclear why this is done here; this might be possible to kill off as stale code.”所以对这个操作的解释是有疑问的。

在紧接着的query_mca()中，这实际上是通过int 15h,ah=0c0h中断来获取MCA（Micro Channel Architecture）系统描述表，详情可有查阅该中断的说明。

int query_mca(void)

{

struct biosregs ireg, oreg;

u16 len;

initregs(&ireg);

ireg.ah = 0xc0;

intcall(0x15, &ireg, &oreg);

if (oreg.eflags & X86_EFLAGS_CF)

return -1; /* No MCA present */

set_fs(oreg.es);

len = rdfs16(oreg.bx);

if (len > sizeof(boot_params.sys_desc_table))

len = sizeof(boot_params.sys_desc_table);

copy_from_fs(&boot_params.sys_desc_table, oreg.bx, len);

return 0;

}

再接下来的query_ist()中通过int 15h,ax=0e980h中断来获取Intel Speed Step信息。

static void query_ist(void)

{

struct biosregs ireg, oreg;

/* Some older BIOSes apparently crash on this call, so filter

it from machines too old to have SpeedStep at all. */

if (cpu.level < 6)

return;

initregs(&ireg);

ireg.ax = 0xe980; /* IST Support */

ireg.edx = 0x47534943; /* Request value */

intcall(0x15, &ireg, &oreg);

boot_params.ist_info.signature = oreg.eax;

boot_params.ist_info.command = oreg.ebx;

boot_params.ist_info.event = oreg.ecx;

boot_params.ist_info.perf_level = oreg.edx;

}

根据配置，还需要获取APM信息或EDD信息，获取方法与IST类似。

最后在进入保护模式之前设置视频模式，set_video()在video.c中定义。

void set_video(void)

{

u16 mode = boot_params.hdr.vid_mode;

RESET_HEAP();

store_mode_params();

save_screen();

probe_cards(0);

for (;;) {

if (mode == ASK_VGA)

mode = mode_menu();

if (!set_mode(mode))

break;

printf("Undefined video mode number: %x\n", mode);

mode = ASK_VGA;

}

boot_params.hdr.vid_mode = mode;

vesa_store_edid();

store_mode_params();

if (do_restore)

restore_screen();

}

根据hdr得到视频模式，存储到内部变量mode中。在header.S中设置的vid_mode值是SVGA_MODE。随后，调用store_mode_params()来设置boot_params的screen_info字段。

* Store the video mode parameters for later usage by the kernel.

* This is done by asking the BIOS except for the rows/columns

* parameters in the default 80x25 mode -- these are set directly,

* because some very obscure BIOSes supply insane values.

static void store_mode_params(void)

{

u16 font_size;

int x, y;

/* For graphics mode, it is up to the mode-setting driver

(currently only video-vesa.c) to store the parameters */

if (graphic_mode)

return;

store_cursor_position();

store_video_mode();

if (boot_params.screen_info.orig_video_mode == 0x07) {

/* MDA, HGC, or VGA in monochrome mode */

video_segment = 0xb000;

} else {

/* CGA, EGA, VGA and so forth */

video_segment = 0xb800;

}

set_fs(0);

font_size = rdfs16(0x485); /* Font size, BIOS area */

boot_params.screen_info.orig_video_points = font_size;

x = rdfs16(0x44a);

y = (adapter == ADAPTER_CGA) ? 25 : rdfs8(0x484)+1;

if (force_x)

x = force_x;

if (force_y)

y = force_y;

boot_params.screen_info.orig_video_cols = x;

boot_params.screen_info.orig_video_lines = y;

}

在store_mode_params()函数中调用了store_cursor_position和store_video_mode来获得光标位置和视频模式。

static void store_cursor_position(void)

{

struct biosregs ireg, oreg;

initregs(&ireg);

ireg.ah = 0x03;

intcall(0x10, &ireg, &oreg);

boot_params.screen_info.orig_x = oreg.dl;

boot_params.screen_info.orig_y = oreg.dh;

if (oreg.ch & 0x20)

boot_params.screen_info.flags |= VIDEO_FLAGS_NOCURSOR;

if ((oreg.ch & 0x1f) > (oreg.cl & 0x1f))

boot_params.screen_info.flags |= VIDEO_FLAGS_NOCURSOR;

}

static void store_video_mode(void)

{

struct biosregs ireg, oreg;

/* N.B.: the saving of the video page here is a bit silly,

since we pretty much assume page 0 everywhere. */

initregs(&ireg);

ireg.ah = 0x0f;

intcall(0x10, &ireg, &oreg);

/* Not all BIOSes are clean with respect to the top bit */

boot_params.screen_info.orig_video_mode = oreg.al & 0x7f;

boot_params.screen_info.orig_video_page = oreg.bh;

}

以上是store_cursor_position和store_video_mode，它们都是通过中断来获取信息的。

接下来调用save_screen来保存屏幕内容。

/* Save screen content to the heap */

static struct saved_screen {

int x, y;

int curx, cury;

u16 *data;

} saved;

static void save_screen(void)

{

/* Should be called after store_mode_params() */

saved.x = boot_params.screen_info.orig_video_cols;

saved.y = boot_params.screen_info.orig_video_lines;

saved.curx = boot_params.screen_info.orig_x;

saved.cury = boot_params.screen_info.orig_y;

if (!heap_free(saved.x*saved.y*sizeof(u16)+512))

return; /* Not enough heap to save the screen */

saved.data = GET_HEAP(u16, saved.x*saved.y);

set_fs(video_segment);

copy_from_fs(saved.data, 0, saved.x*saved.y*sizeof(u16));

}

以上代码的具体工作是从video_segment读取数据然后保存到堆。

然后扫描整个显卡列表，video_cards和video_cards_end都是bootloader传递过来的显卡列表。

/* Probe the video drivers and have them generate their mode lists. */

void probe_cards(int unsafe)

{

struct card_info *card;

static u8 probed[2];

if (probed[unsafe])

return;

probed[unsafe] = 1;

for (card = video_cards; card < video_cards_end; card++) {

if (card->unsafe == unsafe) {

if (card->probe)

card->nmodes = card->probe();

else

card->nmodes = 0;

}

如果bootloader设置hdr的vid_mode为ASK_VGA，就进行一些交互式的工作，在header.S中定义的vid_mode是SVGA_MODE，也就是 ASK_VGA。

然后调用vesa_store_edid()函数，它是对EDID的设置。EDID是一种VESA标准数据格式，其中包含有关监视器及其性能的参数，包括供应商信息、最大图像大小、颜色设置、厂商预设置、频率范围的限制以及显示器名和***的字符串。

接下来会再次执行store_mode_params()来保存数据，最后调用restore_screen()恢复屏幕内容。

static void restore_screen(void)

{

/* Should be called after store_mode_params() */

int xs = boot_params.screen_info.orig_video_cols;

int ys = boot_params.screen_info.orig_video_lines;

int y;

addr_t dst = 0;

u16 *src = saved.data;

struct biosregs ireg;

if (graphic_mode)

return; /* Can't restore onto a graphic mode */

if (!src)

return; /* No saved screen contents */

/* Restore screen contents */

set_fs(video_segment);

for (y = 0; y < ys; y++) {

int npad;

if (y < saved.y) {

int copy = (xs < saved.x) ? xs : saved.x;

copy_to_fs(dst, src, copy*sizeof(u16));

dst += copy*sizeof(u16);

src += saved.x;

npad = (xs < saved.x) ? 0 : xs-saved.x;

} else {

npad = xs;

}

/* Writes "npad" blank characters to

video_segment:dst and advances dst */

asm volatile("pushw %%es ; "

"movw %2,%%es ; "

"shrw %%cx ; "

"jnc 1f ; "

"stosw \n\t"

"1: rep;stosl ; "

"popw %%es"

: "+D" (dst), "+c" (npad)

: "bdS" (video_segment),

"a" (0x07200720));

}

/* Restore cursor position */

if (saved.curx >= xs)

saved.curx = xs-1;

if (saved.cury >= ys)

saved.cury = ys-1;

initregs(&ireg);

ireg.ah = 0x02; /* Set cursor position */

ireg.dh = saved.cury;

ireg.dl = saved.curx;

intcall(0x10, &ireg, NULL);

store_cursor_position();

}

到这里video就设置完毕了，进入保护模式前的准备工作就做好了。

四、 linux-2.6.34.13/arch/ x86/boot/ pm.c

进入保护模式的代码在boot/pm.c中，在main的最后调用了go_to_protected_mode()，这是一个不会返回的函数。

void go_to_protected_mode(void)

{

/* Hook before leaving real mode, also disables interrupts */

realmode_switch_hook();

/* Enable the A20 gate */

if (enable_a20()) {

puts("A20 gate not responding, unable to boot...\n");

die();

}

/* Reset coprocessor (IGNNE#) */

reset_coprocessor();

/* Mask all interrupts in the PIC */

mask_all_interrupts();

/* Actual transition to protected mode... */

setup_idt();

setup_gdt();

protected_mode_jump(boot_params.hdr.code32_start,

(u32)&boot_params + (ds() << 4));

}

在进入保护模式之前要先检查有没有hook代码，有则调用，没有则关闭中断、禁用不可屏蔽中断。

static void realmode_switch_hook(void)

{

if (boot_params.hdr.realmode_swtch) {

asm volatile("lcallw *%0"

: : "m" (boot_params.hdr.realmode_swtch)

: "eax", "ebx", "ecx", "edx");

} else {

asm volatile("cli");

outb(0x80, 0x70); /* Disable NMI */

io_delay();

}

然后打开a20地址线，如果打开失败则直接die掉。那么什么是a20 地址线呢？在8086中是用SEG：OFFSET这样的模式来分段的，所以能表示的最大内存是FFFF：FFFF，也就是10FFEFh。可是在8086中只有20位的地址总线，所以只能寻址到1MB，如果试图访问超过1MB的地址时会怎么样呢？实际上系统不会发生异常，而是回卷（wrap）回去，重新从地址零开始寻址。可是到了80286时，真的可以访问超过1MB的地址，如果遇到同样的情况，系统不会再回卷寻址，这样就造成了向下不兼容，威客可保证兼容性，IBM使用8042键盘控制器来控制第20个（从0开始数）地址位，这就是a20地址线，如果不被打开，第20个地址为将会总是为零。

下图就是关于实模式下A20禁用与使用的区别。

static void enable_a20_bios(void)

{

struct biosregs ireg;

initregs(&ireg);

ireg.ax = 0x2401;

intcall(0x15, &ireg, NULL);

}

static void enable_a20_kbc(void)

{

empty_8042();

outb(0xd1, 0x64); /* Command write */

empty_8042();

outb(0xdf, 0x60); /* A20 on */

empty_8042();

outb(0xff, 0x64); /* Null command, but UHCI wants it */

empty_8042();

}

static void enable_a20_fast(void)

{

u8 port_a;

port_a = inb(0x92); /* Configuration port A */

port_a |= 0x02; /* Enable A20 */

port_a &= ~0x01; /* Do not reset machine */

outb(port_a, 0x92);

}

打开a20地址线不止一种方法，在该版本内核中采用了三种方法来，从而尽可能避免打开失败。

紧接着重置数学协处理器，这里就是向端口0xf0和0xf1写一个0。

static void reset_coprocessor(void)

{

outb(0, 0xf0);

io_delay();

outb(0, 0xf1);

io_delay();

}

还要标记PIC上的所有中断，这里也是通过向0xa1和0x21端口写数据完成的。

static void mask_all_interrupts(void)

{

outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */

io_delay();

outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */

io_delay();

}

进入保护模式之前最关键的动作时设置gdt和idt。

struct gdt_ptr {

u16 len;

u32 ptr;

} __attribute__((packed));

static void setup_gdt(void)

{

/* There are machines which are known to not boot with the GDT

being 8-byte unaligned. Intel recommends 16 byte alignment. */

static const u64 boot_gdt[] __attribute__((aligned(16))) = {

/* CS: code, read/execute, 4 GB, base 0 */

[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),

这里的GDT_ENTRY(flags,base,limit)在asm/segment.h中定义，flags是标志位，base是基址，limit是段界限。

flags的各个位的代表内容如下：

第0-3位为TYPE（描述符类型），第4位为S（1表示数据段和代码段描述符，0表示系统段描述符和门描述符），第5、6位为DPL（段的特权等级），第7位为P（1表示段在内存中存在，0表示段在内存中不存在），第8-11位为段界限的16-19位，第12位为AVL（保留并且可以被操作系统使用），第13位为保留位（总是0），第14位为D/B，第15位为G（0表示段界限粒度为字节，1表示段界限粒度为4KB）。

从这里可以看出，CS段定义的flags为0xC09B，G位置1表示段界限粒度为4KB，段界限为0xFFFFF，总计可以寻址4GB。

/* DS: data, read/write, 4 GB, base 0 */

[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),

DS段定义的flags为0xC093，G位置1表示段界限粒度为4KB，段界限为0xFFFFF，总计可以寻址4GB。

/* TSS: 32-bit tss, 104 bytes, base 4096 */

/* We only have a TSS here to keep Intel VT happy;

we don't actually use it for anything. */

[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),

这里虽然定义了TSS段，但是按照注释TSS段应该没有被使用。

};

/* Xen HVM incorrectly stores a pointer to the gdt_ptr, instead

of the gdt_ptr contents. Thus, make it static so it will

stay in memory, at least long enough that we switch to the

proper kernel GDT. */

static struct gdt_ptr gdt;

这里将全局描述符表的长度、地址信息保存到gdt_ptr结构，最后调用lgdt指令设置GDT。

gdt.len = sizeof(boot_gdt)-1;

gdt.ptr = (u32)&boot_gdt + (ds() << 4);

asm volatile("lgdtl %0" : : "m" (gdt));

}

与GDT设置相比，IDT的设置在这个阶段就比较简单了。

static void setup_idt(void)

{

static const struct gdt_ptr null_idt = {0, 0};

asm volatile("lidtl %0" : : "m" (null_idt));

}

实际上，只是调用lidt指令设置一个空表。

设置完gdt、idt后，调用protected_mode_jump()跳转到code32_start， code32_start 在header.S中定义的值为0x100000，也可以由bootloader指定。

五、 linux-2.6.34.13/arch/ x86/boot/ pmjump.S

protected_mode_jump 在pmjump.S中定义，是使用汇编编写的，它的工作是在进入保护模式并跳转到code32_start。

最开始是16位汇编代码。

.text

.code16

* void protected_mode_jump(u32 entrypoint, u32 bootparams);

GLOBAL(protected_mode_jump)

movl %edx, %esi # Pointer to boot_params table

edx的内容为bootparams，这是因为内核中参数传递是fastcall类型的，优先通过寄存器传参，eax的值为entrypoint。

xorl %ebx, %ebx

movw %cs, %bx

shll $4, %ebx

addl %ebx, 2f

这几句代码的作用计算并设置下文中的32位jmp指令将要跳转到的地址。

jmp 1f # Short jump to serialize on 386/486

movw $__BOOT_DS, %cx

movw $__BOOT_TSS, %di

movl %cr0, %edx

orb $X86_CR0_PE, %dl # Protected mode

movl %edx, %cr0

上面三条指令设置了cr0寄存器的PE标识，这样CPU就进入保护模式工作。

接下来是一条32位指令，它的作用是跳转到in_pm32。

# Transition to 32-bit mode

.byte 0x66, 0xea # ljmpl opcode

2: .long in_pm32 # offset

.word __BOOT_CS # segment

ENDPROC(protected_mode_jump)

in_pm32的作用主要是设置寄存器和跳转到entrypoint。

.code32

.section ".text32","ax"

GLOBAL(in_pm32)

# Set up data segments for flat 32-bit mode

movl %ecx, %ds

movl %ecx, %es

movl %ecx, %fs

movl %ecx, %gs

movl %ecx, %ss

# The 32-bit code sets up its own stack, but this way we do have

# a valid stack if some debugging hack wants to use it.

addl %ebx, %esp

# Set up TR to make Intel VT happy

ltr %di

# Clear registers to allow for future extensions to the

# 32-bit boot protocol

xorl %ecx, %ecx

xorl %edx, %edx

xorl %ebx, %ebx

xorl %ebp, %ebp

xorl %edi, %edi

# Set up LDTR to make Intel VT happy

lldt %cx

jmpl *%eax # Jump to the 32-bit entrypoint

ENDPROC(in_pm32)

随着最后的一个jmp指令的执行，我们终于到了保护模式。

转载于:https://my.oschina.net/u/135465/blog/99590

Linux内核源码学习 （1）- 从实模式到保护模式

相关推荐

Linux内核源码学习（1）- 从实模式到保护模式