• Andrew Morton's avatar
    [PATCH] speedup heuristic for get_unmapped_area · 631709da
    Andrew Morton authored
    [I was going to send shared pagetables today, but it failed in
     my testing under X :( ]
    
    the first one is an mmap inefficiency that was reported by Saurabh Desai.
    The test_str02 NPTL test-utility does the following: it tests the maximum
    number of threads by creating a new thread, which thread creates a new
    thread itself, etc. It basically creates thousands of parallel threads,
    which means thousands of thread stacks.
    
    NPTL uses mmap() to allocate new default thread stacks - and POSIX
    requires us to install a 'guard page' as well, which is done via
    mprotect(PROT_NONE) on the first page of the stack. This means that tons
    of NPTL threads means 2* tons of vmas per MM, all allocated in a forward
    fashion starting at the virtual address of 1 GB (TASK_UNMAPPED_BASE).
    
    Saurabh reported a slowdown after the first couple of thousands of
    threads, which i can reproduce as well. The reason for this slowdown is
    the get_unmapped_area() implementation, which tries to achieve the most
    compact virtual memory allocation, by searching for the vma at
    TASK_UNMAPPED_BASE, and then linearly searching for a hole. With thousands
    of linearly allocated vmas this is an increasingly painful thing to do ...
    
    obviously, high-performance threaded applications will create stacks
    without the guard page, which triggers the anon-vma merging code so we end
    up with one large vma, not tons of small vmas.
    
    it's also possible for userspace to be smarter by setting aside a stack
    space and keeping a bitmap of allocated stacks and using MAP_FIXED (this
    also enables it to do the guard page not via mprotect() but by keeping the
    stacks apart by 1 page - ie. half the number of vmas) - but this also
    decreases flexibility.
    
    So i think that the default behavior nevertheless makes sense as well, so
    IMO we should optimize it in the kernel.
    
    there are various solutions to this problem, none of which solve the
    problem in a 100% sufficient way, so i went for the simplest approach: i
    added code to cache the 'last known hole' address in mm->free_area_cache,
    which is used as a hint to get_unmapped_area().
    
    this fixed the test_str02 testcase wonderfully, thread creation
    performance for this testcase is O(1) again, but this simpler solution
    obviously has a number of weak spots, and the (unlikely but possible)
    worst-case is quite close to the current situation. In any case, this
    approach does not sacrifice the perfect VM compactness out mmap()
    implementation achieves, so it's a performance optimization with no
    externally visible consequences.
    
    The most generic and still perfectly-compact VM allocation solution would
    be to have a vma tree for the 'inverse virtual memory space', ie. a tree
    of free virtual memory ranges, which could be searched and iterated like
    the space of allocated vmas. I think we could do this by extending vmas,
    but the drawback is larger vmas. This does not save us from having to scan
    vmas linearly still, because the size constraint is still present, but at
    least most of the anon-mmap activities are constant sized. (both malloc()
    and the thread-stack allocator uses mostly fixed sizes.)
    
    This patch contains some fixes from Dave Miller - on some architectures
    it is not posible to evaluate TASK_UNMAPPED_BASE at compile-time.
    631709da
processor.h 12.1 KB