I/O scheduler (aka dsched)

Alex Hornung ahornung at gmail.com
Thu Apr 1 06:17:59 PDT 2010

: > I think the principle of least surprise would suggest that it should
: > work exactly like nice, rather than flip it around

: nice is actually intuitive - the higher the number, the 'nicer' the
: processes are to the rest of the system..
: [...]
: it's just *system* relative - not process relative..

Right now this is the least of my concerns. If someone wants to use dsched
right now, he should know that he's using something experimental and hence
should read up on whatever documentation is available, including the nice
level inversion.

Eventually I might change it to be similar to the normal nice levels, but it
also would be only in one direction (i.e. -10 to 0, not -20 to 20).


On another note, I really want to emphasize the fact that if you want to try
dsched, don't use the patch that was attached to the first email, pull from
my iosched-current branch on leaf which is up to date.

I've done quite a few improvements since the original patch, mainly:

- changing the algorithm to estimate the disk usage percent. Now it's done
right, by measuring the time the disk spends idle in one balancing period.
(thanks to corecode for the idea)

- due to the previous change, I have also been able to add a feedback
mechanism that tries to dispatch more requests if the disk becomes idle,
even if all processes have already reached their rate limit by increasing
the limit if needed.

- moving the heavier balancing calculations out of the fq_balance thread and
into the context of the processes/threads that do I/O, as far as this is
possible. Some of the heavy balancing calculations will still occur in the
dispatch thread instead of the issuing context. (thanks to Aggelos for the

- ironing out a few bugs related to int32_t overflow.

- general cleanup & refactoring


I also forgot to mention in my original email that there are some other
interesting tools/settings, mainly:

sysctl kern.dsched_debug: the higher the level, the more debug you'll get.
By default no debug will be printed. At level 4, only the disk busy-% will
be printed, and at 7 all details about the balancing will be shown.

test/dsched_fq: If you build fqstats (just using 'make' in this directory),
you'll be able to read some of the statistics that dsched_fq keeps track of,
such as number of allocated structures of each type and number of
processes/threads that were rate limited, number of issued transactions,
completed transactions and cancelled transactions.

Alex Hornung

More information about the Kernel mailing list