分析内核中对nvme测试性能下降的一个优化

优化commit如下，在blk_poll函数最后加上__set_current_state(TASK_RUNNING);

目的是如果当前正在poll的线程要被抢占（need_resched() = true）,就先把当前线程状态设置为TASK_RUNNING，这样线程在被抢占后就不会从cpu的runqueue队列中删除，等到该线程vruntime变为最小时，就又可以被cpu运行了，这样该线程就不依赖任何线程或中断对其进行唤醒操作了。

而如果不设置__set_current_state(TASK_RUNNING)会怎么样？如下面代码所示，在for循环中调用blk_poll之前调用了set_current_state(TASK_UNINTERRUPTIBLE)，将线程状态设置为了TASK_UNINTERRUPTIBLE，在blk_poll中判断当前线程需要被抢占后，该线程的状态仍然是TASK_UNINTERRUPTIBLE，那么在blk_poll返回后在for循环中将调用blk_io_schedule->schedule-> __schedule(false)将线程从cpu的runqueue队列中删除，由于该线程没有被放入任何的等待队列中（例如，如果该线程需要等待一个mutex锁，调用mutex_lock时将会将其放到该锁的等待队列中，等锁其他线程调用mutex_unlock时将其唤醒），将没有任何线程会唤醒他，只要等到IO完成后，在IO的回调（如bio的end handler）中将该线程唤醒。

IO polling机制的优势在于在IO将要完成前，poiing的线程就要处于就位运行状态，这样IO一旦完成，polling线程就能成功返回给用户态。而在该commit前，IO完成后才会将线程唤醒，然后返回用户态，显然增加了延迟，IOPS会下降。

git show 67b4110f8c8d16e588d7730db8e8b01b32c1bd8b

blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task

state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion.

Earlier, polling task used to sleep, relying on interrupt to wake it up.

This made some IO take very long when interrupt-coalescing is enabled in NVMe.

Reference:

http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html

Enabling interrupt coalescing in nvme device, kernel 4.9
significantly reduces performance when using polling mode.
When I enable coalescing, IOPS drops from 100K to 35K and
latency jumps from 7 usec to 25 usec.

Shouldn't we expect performance boost in polling mode when
interrupt coalescing enabled?

Device is Intel DC P3600
Coalescing enabled: nvme set-feature /dev/nvme0 -f 8 -v 0x00ffff
fio-2.16 file:
[global]
iodepth=1
direct=1
ioengine=pvsync2
hipri
group_reporting
time_based
blocksize=4k
norandommap=1

[job1]
rw=read
filename=/dev/nvme0n1
name=raw=sequential-read
numjobs=1
runtime=60

分析内核中对nvme测试性能下降的一个优化

分析内核中对nvme测试性能下降的一个优化

相关推荐