Large pages; page coloring

Matthew Dillon dillon at backplane.com
Mon Jun 1 18:31:38 PDT 2015


I don't know the exact hardware algorithm used for the complex indexing.
The page coloring has less and less of an effect as cpu caches get better
at distributing the data.  The overall effectiveness of the combination can
only get better, even if it is only a slight improvement.

-Matt

On Mon, Jun 1, 2015 at 9:32 AM, Alex Merritt <merritt.alex at gmail.com> wrote:

> Correction ---
>
> A direct-mapping function to index a cache, to me, means the use of
> specific well-known index bits within the physical address to determine the
> set into which it will be placed when cached.
>
> becomes
>
> A direct-mapping function to index a cache, to me, means the use of
> specific well-known index bits within the address (virtual or physical) to
> determine the set into which it will be placed when cached.
>
> Apologies.
>
> On Mon, Jun 1, 2015 at 9:30 AM, Alex Merritt <merritt.alex at gmail.com>
> wrote:
>
>> Matt,
>>
>> Thank you for the insight into the L1 operation. As L1 uses a
>> direct-mapped indexing function, we can apply known techniques such as use
>> of offsets to ensure specific placement within the cache, as you mention.
>>
>> My question is not in regard to whether virtual addresses or physical
>> addresses are used for indexing, but rather the function itself that the
>> hardware uses to perform indexing. Below is sample code which parses CPUID
>> to extract this information. On an Intel Haswell, it shows the L3 to have
>> "complex indexing" whereas L1/L2 to have direct. On an Intel Westmere, all
>> caches use direct mapping. I noticed processors since Sandy Bridge have
>> complex indexing in the LLC.
>>
>> A direct-mapping function to index a cache, to me, means the use of
>> specific well-known index bits within the physical address to determine the
>> set into which it will be placed when cached. Complex indexing suggests
>> this is no longer true. If true, how can we be sure the coloring strategy
>> used by the kernel to sort pages, based on specific index bits, will
>> continue to have the same effect on modern processors?
>>
>> -Alex
>>
>> /* Intel Programmer Manual instruction set reference
>>  * CPUID, table 3-17.
>>  */
>> #include <stdio.h>
>> static const char *CACHE_STR[4] = {
>>     "NULL",
>>     "Data cache",
>>     "Instruction cache",
>>     "Unified cache",
>> };
>> int main(void)
>> {
>>     unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
>>     unsigned int func = 4, val;
>>     while (1) {
>>         func = 4;
>>         unsigned int _ecx = ecx++;
>>         __asm__("cpuid \n\t"
>>                 : "=a"(eax), "=b"(ebx), "=c"(_ecx), "=d"(edx)
>>                 : "a"(func), "b"(ebx), "c"(_ecx), "d"(edx)
>>                 :);
>>         /* check if a cache type is specified */
>>         if (!(val = (eax & 0x1f)))
>>             break;
>>         printf("\ntype:               %s\n", CACHE_STR[val]);
>>
>>         val = ((eax >> 5) & 0x7);
>>         printf("level:              %d\n", val);
>>
>>         val = ((ebx >> 22) & 0x3ff);
>>         printf("ways:               %d\n", val+1);
>>
>>         val = (_ecx & 0xffffffff);
>>         printf("number of sets:     %d\n", val+1);
>>
>>         val = ((edx >> 2) & 0x1);
>>         printf("complex index:      %d (%s)\n",
>>                 val, (val ? "complex indexing" : "direct-mapped"));
>>
>>         printf("\n");
>>     }
>> }
>>
>>
>> On Sat, May 30, 2015 at 7:21 PM, Matthew Dillon <dillon at backplane.com>
>> wrote:
>>
>>> I think what you are describing is Intel's virtually indexed physical
>>> cache.  It is designed to allow the L1 cache access to occur concurrent
>>> with the PTE (page table entry) lookup, which is much more efficient than
>>> having to wait for the page table lookup first and then start the memory
>>> access on the L1 cache.
>>>
>>> The downside of this is that being virtually indexed, many programs tend
>>> to load at the same virtual memory address and memory map operations also
>>> tend to map at the same virtual memory address.  When these represent
>>> private data rather than shared data, the cpu caches can wind up not being
>>> fully utilized.  They are still N-way set associative so all is not lost,
>>> but they aren't optimally used.
>>>
>>> The general solution is to implement an offset in the userland memory
>>> allocator (not so much in the kernel) which is what we do for larger memory
>>> allocations.
>>>
>>> -Matt
>>>
>>>
>>> On Fri, May 29, 2015 at 8:52 AM, Alex Merritt <merritt.alex at gmail.com>
>>> wrote:
>>>
>>>> I learned this recently, having gained access to newer Intel
>>>> processors: these CPUs (Sandybridge, Haswell) use a form of indexing into
>>>> the LLC which is no longer direct (i.e. taking specific bits from a
>>>> physical address to determine which set the cache line in the LLC it goes
>>>> into), but rather what they call "complex indexing"[1]. Presumably this is
>>>> some proprietary hashing.
>>>>
>>>> I wanted to ask -- does page coloring, using direct indexing logic by
>>>> the kernel, have an advantage if such hashing is used, also if we are
>>>> unaware of the specific algorithm used to index the LLC? If we are unable
>>>> to determine which pages will conflict in the cache without careful study,
>>>> and assuming this algorithm may change between microarchitectures, it seems
>>>> there may be less benefit to applying the technique.
>>>>
>>>> [1]  Intel Manual Vol.2A Table 3-17, cpuid command 04H
>>>>
>>>> -Alex
>>>>
>>>> On Tue, Apr 14, 2015 at 10:47 AM, Matthew Dillon <dillon at backplane.com>
>>>> wrote:
>>>>>
>>>>> --
>>>>>
>>>>> If I recall, FreeBSD mostly removed page coloring from their VM page
>>>>> allocation subsystem.  DragonFly kept it and integrated it into the
>>>>> fine-grained-locked VM page allocator.  There's no advantage to
>>>>> manipulating the parameters for two reasons.
>>>>>
>>>>> First, all page coloring really does is try to avoid degenerate
>>>>> situations in the cpu caches.  The cpu caches are already 4-way or 8-way
>>>>> set-associative.  The page coloring improves this but frankly even the set
>>>>> associativeness in the base cpu caches gets us most of the way there.  So
>>>>> adjusting the page coloring algorithms will not yield any improvements.
>>>>>
>>>>> Secondly, the L1 cache is a physical memory cache but it is also
>>>>> virtually indexed.  This is a cpu hardware optimization that allows the
>>>>> cache lookup to be initiated concurrent with the TLB lookup.  Because of
>>>>> this, physical set associatively does not actually solve all the problems
>>>>> which can occur with a virtually indexed cache.
>>>>>
>>>>> So the userland memory allocator implements an offsetting feature for
>>>>> allocations which attempts to address the virtually indexed cache issues.
>>>>> This feature is just as important as the physical page coloring feature for
>>>>> performance purposes.
>>>>>
>>>>> -Matt
>>>>>
>>>>>
>>>>> On Tue, Apr 14, 2015 at 10:10 AM, Alex Merritt <merritt.alex at gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> I am interested in learning whether Dragonfly supports large pages
>>>>>> (2M and 1G), and secondly, what mechanisms exist for applications to have
>>>>>> influence over the colors used to assign the physical pages backing their
>>>>>> memory, specifically for private anonymous mmap'd regions. Regarding
>>>>>> coloring, I'd like to be able to evaluate applications with a small number
>>>>>> of colors (restricting their access to the last-level cache) and compare
>>>>>> their performance to more/all colors available. I am initially looking to
>>>>>> work in hacks to achieve this to perform some preliminary experiments,
>>>>>> perhaps by way of a kernel module or something.
>>>>>>
>>>>>> A cursory search of the code showed no hints at support for large
>>>>>> pages, but I did find there are more internal functions governing the
>>>>>> allocation of pages based on colors, compared to FreeBSD (10.1). In FreeBSD
>>>>>> it seems colors are only considered for regions which are added that are
>>>>>> backed by a file, but I am not 100% certain.
>>>>>>
>>>>>> I appreciate any help!
>>>>>>
>>>>>> Thanks,
>>>>>> Alex
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/kernel/attachments/20150601/593900e4/attachment-0003.htm>


More information about the Kernel mailing list