Large pages; page coloring

Mon Jun 1 09:32:44 PDT 2015

Correction ---

A direct-mapping function to index a cache, to me, means the use of
specific well-known index bits within the physical address to determine the
set into which it will be placed when cached.

becomes

A direct-mapping function to index a cache, to me, means the use of
specific well-known index bits within the address (virtual or physical) to
determine the set into which it will be placed when cached.

Apologies.

On Mon, Jun 1, 2015 at 9:30 AM, Alex Merritt <merritt.alex at gmail.com> wrote:

> Matt,
>
> Thank you for the insight into the L1 operation. As L1 uses a
> direct-mapped indexing function, we can apply known techniques such as use
> of offsets to ensure specific placement within the cache, as you mention.
>
> My question is not in regard to whether virtual addresses or physical
> addresses are used for indexing, but rather the function itself that the
> hardware uses to perform indexing. Below is sample code which parses CPUID
> to extract this information. On an Intel Haswell, it shows the L3 to have
> "complex indexing" whereas L1/L2 to have direct. On an Intel Westmere, all
> caches use direct mapping. I noticed processors since Sandy Bridge have
> complex indexing in the LLC.
>
> A direct-mapping function to index a cache, to me, means the use of
> specific well-known index bits within the physical address to determine the
> set into which it will be placed when cached. Complex indexing suggests
> this is no longer true. If true, how can we be sure the coloring strategy
> used by the kernel to sort pages, based on specific index bits, will
> continue to have the same effect on modern processors?
>
> -Alex
>
> /* Intel Programmer Manual instruction set reference
>  * CPUID, table 3-17.
>  */
> #include <stdio.h>
> static const char *CACHE_STR[4] = {
>     "NULL",
>     "Data cache",
>     "Instruction cache",
>     "Unified cache",
> };
> int main(void)
> {
>     unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
>     unsigned int func = 4, val;
>     while (1) {
>         func = 4;
>         unsigned int _ecx = ecx++;
>         __asm__("cpuid \n\t"
>                 : "=a"(eax), "=b"(ebx), "=c"(_ecx), "=d"(edx)
>                 : "a"(func), "b"(ebx), "c"(_ecx), "d"(edx)
>                 :);
>         /* check if a cache type is specified */
>         if (!(val = (eax & 0x1f)))
>             break;
>         printf("\ntype:               %s\n", CACHE_STR[val]);
>
>         val = ((eax >> 5) & 0x7);
>         printf("level:              %d\n", val);
>
>         val = ((ebx >> 22) & 0x3ff);
>         printf("ways:               %d\n", val+1);
>
>         val = (_ecx & 0xffffffff);
>         printf("number of sets:     %d\n", val+1);
>
>         val = ((edx >> 2) & 0x1);
>         printf("complex index:      %d (%s)\n",
>                 val, (val ? "complex indexing" : "direct-mapped"));
>
>         printf("\n");
>     }
> }
>
>
> On Sat, May 30, 2015 at 7:21 PM, Matthew Dillon <dillon at backplane.com>
> wrote:
>
>> I think what you are describing is Intel's virtually indexed physical
>> cache.  It is designed to allow the L1 cache access to occur concurrent
>> with the PTE (page table entry) lookup, which is much more efficient than
>> having to wait for the page table lookup first and then start the memory
>> access on the L1 cache.
>>
>> The downside of this is that being virtually indexed, many programs tend
>> to load at the same virtual memory address and memory map operations also
>> tend to map at the same virtual memory address.  When these represent
>> private data rather than shared data, the cpu caches can wind up not being
>> fully utilized.  They are still N-way set associative so all is not lost,
>> but they aren't optimally used.
>>
>> The general solution is to implement an offset in the userland memory
>> allocator (not so much in the kernel) which is what we do for larger memory
>> allocations.
>>
>> -Matt
>>
>>
>> On Fri, May 29, 2015 at 8:52 AM, Alex Merritt <merritt.alex at gmail.com>
>> wrote:
>>
>>> I learned this recently, having gained access to newer Intel processors:
>>> these CPUs (Sandybridge, Haswell) use a form of indexing into the LLC which
>>> is no longer direct (i.e. taking specific bits from a physical address to
>>> determine which set the cache line in the LLC it goes into), but rather
>>> what they call "complex indexing"[1]. Presumably this is some proprietary
>>> hashing.
>>>
>>> I wanted to ask -- does page coloring, using direct indexing logic by
>>> the kernel, have an advantage if such hashing is used, also if we are
>>> unaware of the specific algorithm used to index the LLC? If we are unable
>>> to determine which pages will conflict in the cache without careful study,
>>> and assuming this algorithm may change between microarchitectures, it seems
>>> there may be less benefit to applying the technique.
>>>
>>> [1]  Intel Manual Vol.2A Table 3-17, cpuid command 04H
>>>
>>> -Alex
>>>
>>> On Tue, Apr 14, 2015 at 10:47 AM, Matthew Dillon <dillon at backplane.com>
>>> wrote:
>>>>
>>>> --
>>>>
>>>> If I recall, FreeBSD mostly removed page coloring from their VM page
>>>> allocation subsystem.  DragonFly kept it and integrated it into the
>>>> fine-grained-locked VM page allocator.  There's no advantage to
>>>> manipulating the parameters for two reasons.
>>>>
>>>> First, all page coloring really does is try to avoid degenerate
>>>> situations in the cpu caches.  The cpu caches are already 4-way or 8-way
>>>> set-associative.  The page coloring improves this but frankly even the set
>>>> associativeness in the base cpu caches gets us most of the way there.  So
>>>> adjusting the page coloring algorithms will not yield any improvements.
>>>>
>>>> Secondly, the L1 cache is a physical memory cache but it is also
>>>> virtually indexed.  This is a cpu hardware optimization that allows the
>>>> cache lookup to be initiated concurrent with the TLB lookup.  Because of
>>>> this, physical set associatively does not actually solve all the problems
>>>> which can occur with a virtually indexed cache.
>>>>
>>>> So the userland memory allocator implements an offsetting feature for
>>>> allocations which attempts to address the virtually indexed cache issues.
>>>> This feature is just as important as the physical page coloring feature for
>>>> performance purposes.
>>>>
>>>> -Matt
>>>>
>>>>
>>>> On Tue, Apr 14, 2015 at 10:10 AM, Alex Merritt <merritt.alex at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I am interested in learning whether Dragonfly supports large pages (2M
>>>>> and 1G), and secondly, what mechanisms exist for applications to have
>>>>> influence over the colors used to assign the physical pages backing their
>>>>> memory, specifically for private anonymous mmap'd regions. Regarding
>>>>> coloring, I'd like to be able to evaluate applications with a small number
>>>>> of colors (restricting their access to the last-level cache) and compare
>>>>> their performance to more/all colors available. I am initially looking to
>>>>> work in hacks to achieve this to perform some preliminary experiments,
>>>>> perhaps by way of a kernel module or something.
>>>>>
>>>>> A cursory search of the code showed no hints at support for large
>>>>> pages, but I did find there are more internal functions governing the
>>>>> allocation of pages based on colors, compared to FreeBSD (10.1). In FreeBSD
>>>>> it seems colors are only considered for regions which are added that are
>>>>> backed by a file, but I am not 100% certain.
>>>>>
>>>>> I appreciate any help!
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/kernel/attachments/20150601/189f07df/attachment-0003.htm>