分析内核中对nvme测试性能下降的一个优化

优化commit如下,在blk_poll函数最后加上__set_current_state(TASK_RUNNING);

目的是如果当前正在poll的线程要被抢占(need_resched() = true),就先把当前线程状态设置为TASK_RUNNING,这样线程在被抢占后就不会从cpu的runqueue队列中删除,等到该线程vruntime变为最小时,就又可以被cpu运行了,这样该线程就不依赖任何线程或中断对其进行唤醒操作了。

而如果不设置__set_current_state(TASK_RUNNING)会怎么样?如下面代码所示,在for循环中调用blk_poll之前调用了set_current_state(TASK_UNINTERRUPTIBLE),将线程状态设置为了TASK_UNINTERRUPTIBLE,在blk_poll中判断当前线程需要被抢占后,该线程的状态仍然是TASK_UNINTERRUPTIBLE,那么在blk_poll返回后在for循环中将调用blk_io_schedule->schedule-> __schedule(false)将线程从cpu的runqueue队列中删除,由于该线程没有被放入任何的等待队列中(例如,如果该线程需要等待一个mutex锁,调用mutex_lock时将会将其放到该锁的等待队列中,等锁其他线程调用mutex_unlock时将其唤醒),将没有任何线程会唤醒他,只要等到IO完成后,在IO的回调(如bio的end handler)中将该线程唤醒。

IO polling机制的优势在于在IO将要完成前,poiing的线程就要处于就位运行状态,这样IO一旦完成,polling线程就能成功返回给用户态。而在该commit前,IO完成后才会将线程唤醒,然后返回用户态,显然增加了延迟,IOPS会下降。

git show 67b4110f8c8d16e588d7730db8e8b01b32c1bd8b

    blk: optimization for classic polling

    This removes the dependency on interrupts to wake up task. Set task

    state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion.

    Earlier, polling task used to sleep, relying on interrupt to wake it up.

    This made some IO take very long when interrupt-coalescing is enabled in  NVMe.

   Reference:

    http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html

Enabling interrupt coalescing in nvme device, kernel 4.9
significantly reduces performance when using polling mode.
When I enable coalescing, IOPS drops from 100K to 35K and
latency jumps from 7 usec to 25 usec.

Shouldn't we expect performance boost in polling mode when
interrupt coalescing enabled?

Device is Intel DC P3600
Coalescing enabled:  nvme set-feature /dev/nvme0 -f 8 -v 0x00ffff
fio-2.16 file:
[global]
iodepth=1
direct=1
ioengine=pvsync2
hipri
group_reporting
time_based
blocksize=4k
norandommap=1

[job1]
rw=read
filename=/dev/nvme0n1
name=raw=sequential-read
numjobs=1
runtime=60

分析内核中对nvme测试性能下降的一个优化

 分析内核中对nvme测试性能下降的一个优化