如何在编译X86时防止函数对齐到16字节边界？

问题描述：

我正在一个类似嵌入式的环境中工作，每个字节都非常珍贵，比没有对齐访问的附加周期要多得多。我有一个从操作系统的开发实例一些简单的防锈代码：如何在编译X86时防止函数对齐到16字节边界？

#![feature(lang_items)] 
#![no_std] 
extern crate rlibc; 
#[no_mangle] 
pub extern fn rust_main() { 

    // ATTENTION: we have a very small stack and no guard page 

    let hello = b"Hello World!"; 
    let color_byte = 0x1f; // white foreground, blue background 

    let mut hello_colored = [color_byte; 24]; 
    for (i, char_byte) in hello.into_iter().enumerate() { 
     hello_colored[i*2] = *char_byte; 
    } 

    // write `Hello World!` to the center of the VGA text buffer 
    let buffer_ptr = (0xb8000 + 1988) as *mut _; 
    unsafe { *buffer_ptr = hello_colored }; 

    loop{} 

} 

#[lang = "eh_personality"] extern fn eh_personality() {} 
#[lang = "panic_fmt"] #[no_mangle] pub extern fn panic_fmt() -> ! {loop{}}

我也用这个连接器脚本：

OUTPUT_FORMAT("binary") 
ENTRY(rust_main) 
phys = 0x0000; 
SECTIONS 
{ 
    .text phys : AT(phys) { 
    code = .; 
    *(.text.start); 
    *(.text*) 
    *(.rodata) 
    . = ALIGN(4); 
    } 
    __text_end=.; 
    .data : AT(phys + (data - code)) 
    { 
    data = .; 
    *(.data) 
    . = ALIGN(4); 
    } 
    __data_end=.; 
    .bss : AT(phys + (bss - code)) 
    { 
    bss = .; 
    *(.bss) 
    . = ALIGN(4); 
    } 
    __binary_end = .; 
}

我opt-level: 3优化，并使用LTO的i586的有针对性的编译器和GNU LD链接包括链接器命令中的-O3。我也在链接器上尝试了opt-level: z和耦合-Os，但是这导致代码更大（它没有展开循环）。就目前而言，opt-level: 3的尺寸似乎相当合理。

在将函数对齐到某个边界时，看起来浪费了很多字节。在展开的循环之后，插入了7条nop指令，然后如预期的那样出现无限循环。在此之后，似乎还有另一个无限循环，其前面是7个16位覆盖指令（即xchg ax,ax而不是xchg eax,eax）。这在196字节的平面二进制文件中浪费了大约26个字节。

优化器究竟在做什么？
我有什么选择可以禁用它？
为什么无法访问的代码被包含在二进制文件中？

的完整组件下方列表：

0: c6 05 c4 87 0b 00 48 movb $0x48,0xb87c4 
    7: c6 05 c5 87 0b 00 1f movb $0x1f,0xb87c5 
    e: c6 05 c6 87 0b 00 65 movb $0x65,0xb87c6 
    15: c6 05 c7 87 0b 00 1f movb $0x1f,0xb87c7 
    1c: c6 05 c8 87 0b 00 6c movb $0x6c,0xb87c8 
    23: c6 05 c9 87 0b 00 1f movb $0x1f,0xb87c9 
    2a: c6 05 ca 87 0b 00 6c movb $0x6c,0xb87ca 
    31: c6 05 cb 87 0b 00 1f movb $0x1f,0xb87cb 
    38: c6 05 cc 87 0b 00 6f movb $0x6f,0xb87cc 
    3f: c6 05 cd 87 0b 00 1f movb $0x1f,0xb87cd 
    46: c6 05 ce 87 0b 00 20 movb $0x20,0xb87ce 
    4d: c6 05 cf 87 0b 00 1f movb $0x1f,0xb87cf 
    54: c6 05 d0 87 0b 00 57 movb $0x57,0xb87d0 
    5b: c6 05 d1 87 0b 00 1f movb $0x1f,0xb87d1 
    62: c6 05 d2 87 0b 00 6f movb $0x6f,0xb87d2 
    69: c6 05 d3 87 0b 00 1f movb $0x1f,0xb87d3 
    70: c6 05 d4 87 0b 00 72 movb $0x72,0xb87d4 
    77: c6 05 d5 87 0b 00 1f movb $0x1f,0xb87d5 
    7e: c6 05 d6 87 0b 00 6c movb $0x6c,0xb87d6 
    85: c6 05 d7 87 0b 00 1f movb $0x1f,0xb87d7 
    8c: c6 05 d8 87 0b 00 64 movb $0x64,0xb87d8 
    93: c6 05 d9 87 0b 00 1f movb $0x1f,0xb87d9 
    9a: c6 05 da 87 0b 00 21 movb $0x21,0xb87da 
    a1: c6 05 db 87 0b 00 1f movb $0x1f,0xb87db 
    a8: 90      nop 
    a9: 90      nop 
    aa: 90      nop 
    ab: 90      nop 
    ac: 90      nop 
    ad: 90      nop 
    ae: 90      nop 
    af: 90      nop 
    b0: eb fe     jmp 0xb0 
    b2: 66 90     xchg %ax,%ax 
    b4: 66 90     xchg %ax,%ax 
    b6: 66 90     xchg %ax,%ax 
    b8: 66 90     xchg %ax,%ax 
    ba: 66 90     xchg %ax,%ax 
    bc: 66 90     xchg %ax,%ax 
    be: 66 90     xchg %ax,%ax 
    c0: eb fe     jmp 0xc0 
    c2: 66 90     xchg %ax,%ax

我不知道Rust，但反汇编中的第二个无限循环可能是源代码中的第二个无限循环。给循环分支目标16字节对齐是一种非常常见的性能优化，但显然无限循环的性能可能并不重要。 –

尝试将'-C llvm-args = -align-all-blocks = 1'添加到'rustc'选项。 – red75prime

'pub extern panic_fmt（）'的代码包含在二进制文件中，可能是因为您将其声明为导出的公共函数或因为[未声明'panic_fmt' correcly]（https://doc.rust-lang.org /核心/＃如何使用的最核心库）。我目前无法构建您的代码，所以我无法验证这一点。 – red75prime

答

作为Ross states，调心功能和分支点为16个字节是由Intel推荐一个共同的x86优化，虽然它可以偶尔会降低效率，如在你的情况。对于编译器来优化决定是否对齐是一个难题，我相信LLVM只是选择始终对齐。 See more info on Performance optimisations of x86-64 assembly - Alignment and branch prediction。

由于red75prime's comment hints（但没有说明），LLVM使用align-all-blocks的值作为分支点的字节对齐方式，因此将其设置为1将禁用对齐方式。请注意，这适用于全球范围，建议使用比较基准。

如何在编译X86时防止函数对齐到16字节边界？

相关推荐