HEADS UP: I/O scheduler (dsched) now in master

Thu Apr 15 13:47:19 PDT 2010

Hi all,

I've now committed the current state of my work to master. The 'fq' policy
should still be considered experimental and hence isn't active by default.
The rest of this mail is basically some notes put together from my previous
2 or 3 mails, so that they are all in one place. If there's something
missing, let me know.

Just to emphasize this: the default noop policy should have no visible
effect on the performance when compared to before these commits and is the
default choice. If you feel more adventurous, you can enable the
*EXPERIMENTAL* fq (fair queuing) policy as outlined below. It should give
you major improvements in concurrent I/Os, especially in read latency.

Let me know how it works out for you and suggestions on how to improve it.
The next few weeks I'll have somewhat more limited time again due to exams
coming up, but I'll try to be responsive.

The fq policy has a bunch of XXX in the code that could/would improve
performance and fairness, and I hope to address them over time, however most
of them are no priority. You could also take this opportunity and write your
own scheduling policy; either from scratch or by using dsched_fq as a base,
as it quite nicely tracks all processes/threads/bufs in the system as is
required for most modern I/O scheduling policies.

--

The work basically consists of 4 parts:
- General system interfacing
- I/O scheduler framework (dsched)
- I/O scheduler fair queuing policy (dsched_fq or fq)
- userland tools (dschedctl and ionice)

--

By default you still won't notice any difference, as the default scheduler
is the so-called noop (no operation) scheduler; it emulates our previous
behaviour. This can be confirmed by dschedctl -l:
# dschedctl -l
cd0     =>      noop
da0     =>      noop
acd0    =>      noop
ad0     =>      noop
fd0     =>      noop
md0     =>      noop

--

To enable the fq policy on a disk you have two options:

1) set dsched_pol_{diskname}="fq" in /boot/loader.conf; e.g. if it should be
enabled for da0, then dsched_pol_da0="fq". You could also apply the fq
policy to all disks of a certain type (dsched_pol_da) or to all disks
(dsched_pol). Note that sernos are not supported (yet).

2) use dschedctl:
# dschedctl -s fq -d da0
Switched scheduler policy of da0 successfully to fq. After this, dschedctl
-l should list the scheduler of da0 as fq.

Another use of dschedctl is to list the available scheduling policies, which
is of limited use right now, but I'll show it's use anyways:
# dschedctl -p
        >       noop
        >       fq

--

The ionice priority is similar to nice, but the levels nice values range
from 0 to 10, and unlike the usual nice, 10 is the highest priority and 0
the lowest. Usage is exactly the same as nice:
# ionice -n 10 sh read_big

--

A brief description of the inner workings of the FQ policy follows:
- all requests (bios) are let through by default without any queuing.

- for each process/thread in the system, the average latency of its I/O and
the tps is calculated.

- A thread that runs every several 100 ms checks if the disk bandwidth is
full, and if so, it allocates a fair share of maximum transactions to each
process/thread in the system that is doing I/O. This is done taking into
account the latency and the tps. Processes/threads exceeding their share get
rate limited to a number of tps.

- Processes/threads sharing a ioprio get each an equal amount of the pie.

- Once a process/thread is rate limited, only the given amount of bios go
through. All bios exceeding the fair share of the thread/process in the
scheduling time quantum are queued in a per process/thread queue. Reads are
queued at the front while writes are added to the back.

- Before dispatching a bio for a process/thread, it is checked if its queue
is non-empty. If this is the case, these bios are dispatched first before
the new bio is dispatched.

- A dispatcher thread runs every ~20ms dispatching bios for all
processes/threads that have queued bios, up to the maximum number allowed.

--

Recent changes:
- changing the algorithm to estimate the disk usage percent. Now it's done
right, by measuring the time the disk spends idle in one balancing period.

- due to the previous change, I have also been able to add a feedback
mechanism that tries to dispatch more requests if the disk becomes idle,
even if all processes have already reached their rate limit by increasing
the limit if needed.

- moving the heavier balancing calculations out of the fq_balance thread and
into the context of the processes/threads that do I/O, as far as this is
possible. Some of the heavy balancing calculations will still occur in the
dispatch thread instead of the issuing context. (thanks to Aggelos for the
idea)

--

Recent bugfixes:
- The issue that existed before with a panic after a few live policy
switches has been fixed.

- The only-write performance has also been improved since the previous
version; when only writes are occurring the full disk bandwidth is now used.

- Several other panics mostly related to int64 overflows :)

--

There are some other interesting tools/settings, mainly:

sysctl kern.dsched_debug: the higher the level, the more debug you'll get.
By default no debug will be printed. At level 4, only the disk busy-% will
be printed, and at 7 all details about the balancing will be shown. If you
hit a bug, it would almost certainly be helpful if you could provide the
relevant dsched debug at level 7.

test/dsched_fq: If you build fqstats (just using 'make' in this directory),
you'll be able to read some of the statistics that dsched_fq keeps track of,
such as number of allocated structures of each type and number of
processes/threads that were rate limited, number of issued transactions,
completed transactions and cancelled transactions.

Cheers,
Alex Hornung