<div dir="ltr">Matt, Sephe,<div><br></div><div>Thank you for the prompt and detailed feedback. I will take a look at the ipitest script, and the dedicated IPI vector path. I am interested in the overall latencies of IPIs themselves, and whether (at some point) they'd be useful in my research to trigger from user-level for certain purposes.</div><div><br></div><div>-Alex</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 3, 2016 at 1:56 AM, Matthew Dillon <span dir="ltr"><<a href="mailto:dillon@backplane.com" target="_blank">dillon@backplane.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">The main improvement was to the MMU page-invalidation routines. In 4.4 these routines used the kernel's generic IPI messaging API. But page invalidations require all cpus to synchronize their changes and since the generic IPI messaging API honors critical sections that could cause most of the cpu's to stall for longer than intended if just one of the cpus happened to be in a long critical section. In 4.6 the page-invalidation code runs through a dedicated IPI vector which ignores critical sections and does not have this stall issue.<div><br></div><div>A second optimization as also implemented, but could not be tested well enough to enable for the release. This optimization can be turned on with a sysctl in the 4.6 release (sysctl machdep.optimized_invltlb=1). This optimization is meant to avoid sending IPIs to CPUs which are idle and thus might be in a low-power mode. Such cpus will respond much more slowly to the IPI vector and not only increase latency for the cpus running at full power, but also have a tendancy to kick the cpus in low-power mode out of low-power mode. This second optimization is most useful on systems with a lot of cores (multi-socket systems and systems with > 4 cores). This one will eventually be turned on by default once sufficient testing of the feature has been done.</div><div><br></div><div>There were numerous other optimizations to reduce the amount of IPI signalling needed. Probably the biggest one is that the buffer cache no longer synchronizes the MMU when throwing away a buffer cache buffer. It only synchronizes the MMU when allocating one. This cut out roughly half of the system-wide IPIs in a nominally loaded system. We have an additional optimization, also disabled by default, which nearly eliminates buffer-cache-related IPIs in situations where filesystem disk bandwidth is very high (sysctl vfs.repurpose_enable=1). It's only really applicable under heavy loads when consolidated disk I/O bandwidth exceeds 200 MBytes/sec.</div><div><br></div><div>-Matt</div><div><br></div></div>
</blockquote></div><br></div>