Qestions about dsched and fq policy

Alex Hornung ahornung at gmail.com
Sun Mar 20 00:14:13 PDT 2011

On 20 March 2011 05:57, Brills Peng <brillsp at gmail.com> wrote:
> 1. The fq scheduler seems to create one thread for each disk, but I found a
> piece of comment at fq_core.c:114 indicating that there may be more than one
> tdio assigned to a diskctx. I could not find a chance to create such threads
> when thinking through the whole procedure of scheduling:
>        - The fq_prepare() function creates a dispatcher thread when attaching to
>        a disk
>        - An io request comes from higher level, and invokes dsched_queue() to
>          find a proper thread to handle this and then calls fq_queue() which may
>          dispatch the request or insert it into the thread's queue. Note that
>          there is only one thread related is that dispatcher.
>        - The dispatcher thread checks the queue when the disk is idle, and tries
>          to dispatch the request in the queue.

dsched_fq creates two threads per disk, the so-called dispatcher
thread (fq_dispatch()) which handles old requests that haven't been
taken care of, and the balancer thread fq_balance_thread(), which does
around  half of the calculations to determine the new transaction
limit and so on of each thread/process.

BUT: each I/O that comes from a higher level, for example a user
program, has its own THREAD CONTEXT and gets its own tdio. Unlike in
some other operating systems such as FreeBSD, the thread context of
who starts the I/O remains the same throughout the whole process of
doing I/O. The tdio passed to the fq_queue() function is unique for
that particular process/thread on that disk. In other words, ssh will
have one tdio on disk ad0, and so will httpd on that disk; and both of
those are other threads than the fq_dispatch() thread.

 When a thread/process hits its hard transaction limit, its BIOs (I/O
requests) get queued for later dispatch in a queue in the tdio
structure. There is one such structure for each thread/process and for
each disk; and the diskctx structure contains a list of all the tdios
for that particular disk.

The fq_dispatcher thread only takes care of (some of) the queued
requests, others are handled by the original processes/threads
themselves during the next scheduling interval when the quota is empty

> 2. The 'fair share' should relate to processes sending io requests, but I
> have no idea on how the implementation in fq_core.c works. Each process
> should have its own queue and a budgets to consume but the queue and budgets
> seem to be per-thread. If there is only one such thread for each disk, then
> the scheduler acts like a FIFO queue.

Each process HAS its own queue and budget. As I mentioned before, this
is in the tdio (queue is tdio->head.queue and the budget is
tdio->max_tp [- tdio->transactions]).

The way it works is simply as a feedback controller: After a
scheduling interval during which the disk was 100% (or close) busy, a
disk-wide budget is calculated. This is done by accumulating the
product of average latency * number of transactions of each
thread/process. This product is then split up into fair shares for
each process. For simplicity's sake, let's assume that all
threads/processes have the same priority and that there are 4 threads.
So if the final product for the disk was Q_{disk}, Each thread/process
gets a piece of cake of size Q_{tdio} = Q_{disk}/4. The actual quota
for each thread/process is in number of transactions per second. This
is simply Q_{tdio}/average_latency, where average_latency is the
average latency of that thread.

> 3. Is it possible and proper to implement disk's arm scheduling in dsched?
> e.g, a queue sorted by arm position to increase the throughput of FQ
> scheduler.

Maybe not by disk arm position, but you can access the LBA on the
underlying device. Then again I'm not particularly fond of schedulers
doing only this; it doesn't offer much improvement and especially not
on SSDs. And as Venkatesh mentioned, there is such an implementation
already in place a few levels below dsched.

> And another questions about the 'Implement further dsched disk scheduling
> policies' on GSoC idea list:
> DragonFlyBSD has a FQ scheduler right now, but why another policy to
> improve interactivity, since the 'fair' guarantees a not bad interactivity?

Fact is I am an electronic engineer and not a computer scientist. My
approach to the FQ scheduler was an "engineery" one: a simple feedback
controller. It was mainly a proof of concept to show how the dsched
framework makes scheduling easy and painless.

 FQ tends to improve the interactivity of the system noticeably,
especially when heavy background writer tasks are running such as the
regular HAMMER maintenance. But: there is room for a lot of
improvement, interactivity is nowhere near as good as it could be.
Venkatesh has done some benchmarks with interbench, and the results
are not all that promising. It would have been a miracle that a
scheduler as simple as FQ performs on par with complex schedulers like
Linux'  BFQ or CFQ, and reality is that it's nowhere near.

> And the final question:
> Justin mentioned an application form on GSoC site, but according to Google's
> plan, students' handing in their applications starts at March 29. Should I
> send an application to the community in advance and then another to Google
> after March 29?

You only need to send in one application form, the one to Google. The
"form" outlined by Justin just should be included in your proposal,
which you submit to Google, as it answers some of the questions we
need to deal with. What you should definitely do is discuss your ideas
with the community first, but there are no formalities involved in

Hope that helps,
Alex Hornung

PS: In this discussion I use thread and process interchangeably. This
is because dsched assigns a new tdio for each kernel thread and each
userland process, so it is a bit of a mix. The execution unit is of
course a thread even for userland processes, but all the threads in a
userland process share the same tdio.

More information about the Kernel mailing list