1. 07 Nov, 2015 37 commits
    • Kirill A. Shutemov's avatar
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov authored
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
    • Kirill A. Shutemov's avatar
      mm: pack compound_dtor and compound_order into one word in struct page · f1e61557
      Kirill A. Shutemov authored
      The patch halves space occupied by compound_dtor and compound_order in
      struct page.
      
      For compound_order, it's trivial long -> short conversion.
      
      For get_compound_page_dtor(), we now use hardcoded table for destructor
      lookup and store its index in the struct page instead of direct pointer
      to destructor. It shouldn't be a big trouble to maintain the table: we
      have only two destructor and NULL currently.
      
      This patch free up one word in tail pages for reuse. This is preparation
      for the next patch.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1e61557
    • Kirill A. Shutemov's avatar
      zsmalloc: use page->private instead of page->first_page · 32e7ba1e
      Kirill A. Shutemov authored
      We are going to rework how compound_head() work. It will not use
      page->first_page as we have it now.
      
      The only other user of page->first_page beyond compound pages is
      zsmalloc.
      
      Let's use page->private instead of page->first_page here. It occupies
      the same storage space.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32e7ba1e
    • Kirill A. Shutemov's avatar
      slab, slub: use page->rcu_head instead of page->lru plus cast · bc4f610d
      Kirill A. Shutemov authored
      We have properly typed page->rcu_head, no need to cast page->lru.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc4f610d
    • Kirill A. Shutemov's avatar
      mm: drop page->slab_page · 474e4eea
      Kirill A. Shutemov authored
      Since 8456a648 ("slab: use struct page for slab management") nobody
      uses slab_page field in struct page.
      
      Let's drop it.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      474e4eea
    • Sergey Senozhatsky's avatar
      zsmalloc: reduce size_class memory usage · 6fe5186f
      Sergey Senozhatsky authored
      Each `struct size_class' contains `struct zs_size_stat': an array of
      NR_ZS_STAT_TYPE `unsigned long'.  For zsmalloc built with no
      CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned
      long)' per-class.
      
      The patch removes unneeded `struct zs_size_stat' members by redefining
      NR_ZS_STAT_TYPE (max stat idx in array).
      
      Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants,
      GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type
      larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at
      the moment.
      
      ./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39)
      function                                     old     new   delta
      fix_fullness_group                            97      94      -3
      insert_zspage                                100      86     -14
      remove_zspage                                141     119     -22
      
      To summarize:
      a) each class now uses less memory
      b) we avoid a number of dec/inc stats (a minor optimization,
         but still).
      
      The gain will increase once we introduce additional stats.
      
      A simple IO test.
      
      iozone -t 4 -R -r 32K -s 60M -I +Z
                              patched                 base
      "  Initial write "       4145599.06              4127509.75
      "        Rewrite "       4146225.94              4223618.50
      "           Read "      17157606.00             17211329.50
      "        Re-read "      17380428.00             17267650.50
      "   Reverse Read "      16742768.00             16162732.75
      "    Stride read "      16586245.75             16073934.25
      "    Random read "      16349587.50             15799401.75
      " Mixed workload "      10344230.62              9775551.50
      "   Random write "       4277700.62              4260019.69
      "         Pwrite "       4302049.12              4313703.88
      "          Pread "       6164463.16              6126536.72
      "         Fwrite "       7131195.00              6952586.00
      "          Fread "      12682602.25             12619207.50
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fe5186f
    • Hui Zhu's avatar
      mm/zsmalloc.c: remove useless line in obj_free() · 6f0b2276
      Hui Zhu authored
      Signed-off-by: default avatarHui Zhu <zhuhui@xiaomi.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f0b2276
    • Sergey Senozhatsky's avatar
      zsmalloc: don't test shrinker_enabled in zs_shrinker_count() · 2c351695
      Sergey Senozhatsky authored
      We don't let user to disable shrinker in zsmalloc (once it's been
      enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(),
      at the moment at least.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c351695
    • Sergey Senozhatsky's avatar
      zsmalloc: use preempt.h for in_interrupt() · 759b26b2
      Sergey Senozhatsky authored
      A cosmetic change.
      
      Commit c60369f0 ("staging: zsmalloc: prevent mappping in interrupt
      context") added in_interrupt() check to zs_map_object() and 'hardirq.h'
      include; but in_interrupt() macro is defined in 'preempt.h' not in
      'hardirq.h', so include it instead.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      759b26b2
    • Hui Zhu's avatar
      zsmalloc: fix obj_to_head use page_private(page) as value but not pointer · 12a7bfad
      Hui Zhu authored
      In obj_malloc():
      
      	if (!class->huge)
      		/* record handle in the header of allocated chunk */
      		link->handle = handle;
      	else
      		/* record handle in first_page->private */
      		set_page_private(first_page, handle);
      
      In the hugepage we save handle to private directly.
      
      But in obj_to_head():
      
      	if (class->huge) {
      		VM_BUG_ON(!is_first_page(page));
      		return *(unsigned long *)page_private(page);
      	} else
      		return *(unsigned long *)obj;
      
      It is used as a pointer.
      
      The reason why there is no problem until now is huge-class page is born
      with ZS_FULL so it can't be migrated.  However, we need this patch for
      future work: "VM-aware zsmalloced page migration" to reduce external
      fragmentation.
      Signed-off-by: default avatarHui Zhu <zhuhui@xiaomi.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12a7bfad
    • Hui Zhu's avatar
      zsmalloc: add comments for ->inuse to zspage · 8f958c98
      Hui Zhu authored
      [akpm@linux-foundation.org: fix grammar]
      Signed-off-by: default avatarHui Zhu <zhuhui@xiaomi.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f958c98
    • Sergey SENOZHATSKY's avatar
      mm: zsmalloc: constify struct zs_pool name · 6f3526d6
      Sergey SENOZHATSKY authored
      Constify `struct zs_pool' ->name.
      
      [akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f3526d6
    • Dan Streetman's avatar
      zpool: remove redundant zpool->type string, const-ify zpool_get_type · 69e18f4d
      Dan Streetman authored
      Make the return type of zpool_get_type const; the string belongs to the
      zpool driver and should not be modified.  Remove the redundant type field
      in the struct zpool; it is private to zpool.c and isn't needed since
      ->driver->type can be used directly.  Add comments indicating strings must
      be null-terminated.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69e18f4d
    • Dan Streetman's avatar
      zswap: use charp for zswap param strings · c99b42c3
      Dan Streetman authored
      Instead of using a fixed-length string for the zswap params, use charp.
      This simplifies the code and uses less memory, as most zswap param strings
      will be less than the current maximum length.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c99b42c3
    • Dan Streetman's avatar
      module: export param_free_charp() · 3d9c637f
      Dan Streetman authored
      Change the param_free_charp() function from static to exported.
      
      It is used by zswap in the next patch ("zswap: use charp for zswap param
      strings").
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d9c637f
    • Alexey Klimov's avatar
      mm/zswap.c: remove unneeded initialization to NULL in zswap_entry_find_get() · b0c9865f
      Alexey Klimov authored
      On the next line entry variable will be re-initialized so no need to init
      it with NULL.
      Signed-off-by: default avatarAlexey Klimov <alexey.klimov@linaro.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0c9865f
    • Geliang Tang's avatar
      zram: make is_partial_io/valid_io_request/page_zero_filled return boolean · 1c53e0d2
      Geliang Tang authored
      Make is_partial_io()/valid_io_request()/page_zero_filled() return boolean,
      since each function only uses either one or zero as its return value.
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c53e0d2
    • Sergey SENOZHATSKY's avatar
      zram: keep the exact overcommited value in mem_used_max · 12372755
      Sergey SENOZHATSKY authored
      `mem_used_max' is designed to store the max amount of memory zram consumed
      to store the data.  However, it does not represent the actual
      'overcommited' (max) value.  The existing code goes to -ENOMEM
      overcommited case before it updates `->stats.max_used_pages', which hides
      the reason we went to -ENOMEM in the first place -- we actually used more
      memory than `->limit_pages':
      
              alloced_pages = zs_get_total_pages(meta->mem_pool);
              if (zram->limit_pages && alloced_pages > zram->limit_pages) {
                      zs_free(meta->mem_pool, handle);
                      ret = -ENOMEM;
                      goto out;
              }
      
              update_used_max(zram, alloced_pages);
      
      Which is misleading.  User will see -ENOMEM, check `->limit_pages', check
      `->stats.max_used_pages', which will keep the value BEFORE zram passed
      `->limit_pages', and see:
      	`->stats.max_used_pages' < `->limit_pages'
      
      Move update_used_max() before we do `->limit_pages' check, so that
      user will see:
      	`->stats.max_used_pages' > `->limit_pages'
      should the overcommit and -ENOMEM happen.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12372755
    • Luis Henriques's avatar
      zram: introduce comp algorithm fallback functionality · 1d5b43bf
      Luis Henriques authored
      When the user supplies an unsupported compression algorithm, keep the
      previously selected one (knowingly supported) or the default one (if the
      compression algorithm hasn't been changed yet).
      
      Note that previously this operation (i.e. setting an invalid algorithm)
      would result in no algorithm being selected, which means that this
      represents a small change in the default behaviour.
      
      Minchan said:
      
      For initializing zram, we need to set up 3 optional parameters in advance.
      
      1. the number of compression streams
      2. memory limitation
      3. compression algorithm
      
      Although user pass completely wrong value to set up for 1 and 2
      parameters, it's okay because they have default value so zram will be
      initialized with the default value (of course, when user passes a wrong
      value via *echo*, sysfs returns -EINVAL so the user can notice it).
      
      But 3 is not consistent with other optional parameters.  IOW, if the
      user passes a wrong value to set up 3 parameter, zram's initialization
      would fail unlike other optional parameters.
      
      So this patch makes them consistent.
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d5b43bf
    • Andrew Morton's avatar
      mm/memcontrol.c: uninline mem_cgroup_usage · 6f646156
      Andrew Morton authored
      gcc version 5.2.1 20151010 (Debian 5.2.1-22)
      $ size mm/memcontrol.o mm/memcontrol.o.before
         text    data     bss     dec     hex filename
        35535    7908      64   43507    a9f3 mm/memcontrol.o
        35762    7908      64   43734    aad6 mm/memcontrol.o.before
      
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f646156
    • Jan Kara's avatar
      fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback · 23d01270
      Jan Kara authored
      sync_file_range(2) is documented to issue writeback only for pages that
      are not currently being written.  After all the system call has been
      created for userspace to be able to issue background writeout and so
      waiting for in-flight IO is undesirable there.  However commit
      ee53a891 ("mm: do_sync_mapping_range integrity fix") switched
      do_sync_mapping_range() and thus sync_file_range() to issue writeback in
      WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code
      relying on WB_SYNC_ALL semantics.
      
      These days do_sync_mapping_range() went away and we can switch
      sync_file_range(2) back to issuing WB_SYNC_NONE writeback.  That should
      help PostgreSQL avoid large latency spikes when flushing data in the
      background.
      
      Andres measured a 20% increase in transactions per second on an SSD disk.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Tested-By: default avatarAndres Freund <andres@anarazel.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23d01270
    • Aaron Tomlin's avatar
      thp: remove unused vma parameter from khugepaged_alloc_page · d6669d68
      Aaron Tomlin authored
      The "vma" parameter to khugepaged_alloc_page() is unused.  It has to
      remain unused or the drop read lock 'map_sem' optimisation introduce by
      commit 8b164568 ("mm, THP: don't hold mmap_sem in khugepaged when
      allocating THP") wouldn't be safe.  So let's remove it.
      Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6669d68
    • Michal Hocko's avatar
      mm, fs: introduce mapping_gfp_constraint() · c62d2555
      Michal Hocko authored
      There are many places which use mapping_gfp_mask to restrict a more
      generic gfp mask which would be used for allocations which are not
      directly related to the page cache but they are performed in the same
      context.
      
      Let's introduce a helper function which makes the restriction explicit and
      easier to track.  This patch doesn't introduce any functional changes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c62d2555
    • Andrew Morton's avatar
      include/linux/mmzone.h: reflow comment · 89903327
      Andrew Morton authored
      Someone has an 86 column display.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89903327
    • Mel Gorman's avatar
      mm: page_alloc: hide some GFP internals and document the bits and flag combinations · dd56b046
      Mel Gorman authored
      Andrew stated the following
      
      	We have quite a history of remote parts of the kernel using
      	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
      	to think that this is because we didn't adequately explain the
      	interface.
      
      	And I don't think that gfp.h really improved much in this area as
      	a result of this patchset.  Could you go through it some time and
      	decide if we've adequately documented all this stuff?
      
      This patches first moves some GFP flag combinations that are part of the MM
      internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
      bits under various headings and then documents the flag combinations. It
      will not help callers that are brain damaged but the clarity might motivate
      some fixes and avoid future mistakes.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd56b046
    • Mel Gorman's avatar
      mm, page_alloc: only enforce watermarks for order-0 allocations · 97a16fc8
      Mel Gorman authored
      The primary purpose of watermarks is to ensure that reclaim can always
      make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
      These assume that order-0 allocations are all that is necessary for
      forward progress.
      
      High-order watermarks serve a different purpose.  Kswapd had no high-order
      awareness before they were introduced
      (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au).  This was
      particularly important when there were high-order atomic requests.  The
      watermarks both gave kswapd awareness and made a reserve for those atomic
      requests.
      
      There are two important side-effects of this.  The most important is that
      a non-atomic high-order request can fail even though free pages are
      available and the order-0 watermarks are ok.  The second is that
      high-order watermark checks are expensive as the free list counts up to
      the requested order must be examined.
      
      With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
      have high-order watermarks.  Kswapd and compaction still need high-order
      awareness which is handled by checking that at least one suitable
      high-order page is free.
      
      With the patch applied, there was little difference in the allocation
      failure rates as the atomic reserves are small relative to the number of
      allocation attempts.  The expected impact is that there will never be an
      allocation failure report that shows suitable pages on the free lists.
      
      The one potential side-effect of this is that in a vanilla kernel, the
      watermark checks may have kept a free page for an atomic allocation.  Now,
      we are 100% relying on the HighAtomic reserves and an early allocation to
      have allocated them.  If the first high-order atomic allocation is after
      the system is already heavily fragmented then it'll fail.
      
      [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97a16fc8
    • Mel Gorman's avatar
      mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand · 0aaa29a5
      Mel Gorman authored
      High-order watermark checking exists for two reasons -- kswapd high-order
      awareness and protection for high-order atomic requests.  Historically the
      kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
      high-order free pages for as long as possible.  This patch introduces
      MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
      allocations on demand and avoids using those blocks for order-0
      allocations.  This is more flexible and reliable than MIGRATE_RESERVE was.
      
      A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
      allocation request steals a pageblock but limits the total number to 1% of
      the zone.  Callers that speculatively abuse atomic allocations for
      long-lived high-order allocations to access the reserve will quickly fail.
       Note that SLUB is currently not such an abuser as it reclaims at least
      once.  It is possible that the pageblock stolen has few suitable
      high-order pages and will need to steal again in the near future but there
      would need to be strong justification to search all pageblocks for an
      ideal candidate.
      
      The pageblocks are unreserved if an allocation fails after a direct
      reclaim attempt.
      
      The watermark checks account for the reserved pageblocks when the
      allocation request is not a high-order atomic allocation.
      
      The reserved pageblocks can not be used for order-0 allocations.  This may
      allow temporary wastage until a failed reclaim reassigns the pageblock.
      This is deliberate as the intent of the reservation is to satisfy a
      limited number of atomic high-order short-lived requests if the system
      requires them.
      
      The stutter benchmark was used to evaluate this but while it was running
      there was a systemtap script that randomly allocated between 1 high-order
      page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC.  This
      is much larger than the potential reserve and it does not attempt to be
      realistic.  It is intended to stress random high-order allocations from an
      unknown source, show that there is a reduction in failures without
      introducing an anomaly where atomic allocations are more reliable than
      regular allocations.  The amount of memory reserved varied throughout the
      workload as reserves were created and reclaimed under memory pressure.
      The allocation failures once the workload warmed up were as follows;
      
      4.2-rc5-vanilla		70%
      4.2-rc5-atomic-reserve	56%
      
      The failure rate was also measured while building multiple kernels.  The
      failure rate was 14% but is 6% with this patch applied.
      
      Overall, this is a small reduction but the reserves are small relative to
      the number of allocation requests.  In early versions of the patch, the
      failure rate reduced by a much larger amount but that required much larger
      reserves and perversely made atomic allocations seem more reliable than
      regular allocations.
      
      [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avataryalin wang <yalin.wang2010@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0aaa29a5
    • Mel Gorman's avatar
      mm, page_alloc: remove MIGRATE_RESERVE · 974a786e
      Mel Gorman authored
      MIGRATE_RESERVE preserves an old property of the buddy allocator that
      existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
      tended to remain contiguous until the only alternative was to fail the
      allocation.  At the time it was discovered that high-order atomic
      allocations relied on this property so MIGRATE_RESERVE was introduced.  A
      later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
      deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
      Note that this patch in isolation may look like a false regression if
      someone was bisecting high-order atomic allocation failures.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      974a786e
    • Mel Gorman's avatar
      mm, page_alloc: delete the zonelist_cache · f77cf4e4
      Mel Gorman authored
      The zonelist cache (zlc) was introduced to skip over zones that were
      recently known to be full.  This avoided expensive operations such as the
      cpuset checks, watermark calculations and zone_reclaim.  The situation
      today is different and the complexity of zlc is harder to justify.
      
      1) The cpuset checks are no-ops unless a cpuset is active and in general
         are a lot cheaper.
      
      2) zone_reclaim is now disabled by default and I suspect that was a large
         source of the cost that zlc wanted to avoid. When it is enabled, it's
         known to be a major source of stalling when nodes fill up and it's
         unwise to hit every other user with the overhead.
      
      3) Watermark checks are expensive to calculate for high-order
         allocation requests. Later patches in this series will reduce the cost
         of the watermark checking.
      
      4) The most important issue is that in the current implementation it
         is possible for a failed THP allocation to mark a zone full for order-0
         allocations and cause a fallback to remote nodes.
      
      The last issue could be addressed with additional complexity but as the
      benefit of zlc is questionable, it is better to remove it.  If stalls due
      to zone_reclaim are ever reported then an alternative would be to
      introduce deferring logic based on a timeout inside zone_reclaim itself
      and leave the page allocator fast paths alone.
      
      The impact on page-allocator microbenchmarks is negligible as they don't
      hit the paths where the zlc comes into play.  Most page-reclaim related
      workloads showed no noticeable difference as a result of the removal.
      
      The impact was noticeable in a workload called "stutter".  One part uses a
      lot of anonymous memory, a second measures mmap latency and a third copies
      a large file.  In an ideal world the latency application would not notice
      the mmap latency.  On a 2-node machine the results of this patch are
      
      stutter
                                   4.3.0-rc1             4.3.0-rc1
                                    baseline              nozlc-v4
      Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
      1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
      2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
      3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
      Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
      Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
      Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
      Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
      Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
      Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
      Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
      Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
      Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
      Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
      Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
      Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
      Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
      
      Note that the maximum stall latency went from 24 seconds to 12 which is
      still bad but an improvement.  The milage varies considerably 2-node
      machine on an earlier test went from 494 seconds to 47 seconds and a
      4-node machine that tested an earlier version of this patch went from a
      worst case stall time of 6 seconds to 67ms.  The nature of the benchmark
      is inherently unpredictable as it is hammering the system and the milage
      will vary between machines.
      
      There is a secondary impact with potentially more direct reclaim because
      zones are now being considered instead of being skipped by zlc.  In this
      particular test run it did not occur so will not be described.  However,
      in at least one test the following was observed
      
      1. Direct reclaim rates were higher. This was likely due to direct reclaim
        being entered instead of the zlc disabling a zone and busy looping.
        Busy looping may have the effect of allowing kswapd to make more
        progress and in some cases may be better overall. If this is found then
        the correct action is to put direct reclaimers to sleep on a waitqueue
        and allow kswapd make forward progress. Busy looping on the zlc is even
        worse than when the allocator used to blindly call congestion_wait().
      
      2. There was higher swap activity as direct reclaim was active.
      
      3. Direct reclaim efficiency was lower. This is related to 1 as more
        scanning activity also encountered more pages that could not be
        immediately reclaimed
      
      In that case, the direct page scan and reclaim rates are noticeable but
      it is not considered a problem for a few reasons
      
      1. The test is primarily concerned with latency. The mmap attempts are also
         faulted which means there are THP allocation requests. The ZLC could
         cause zones to be disabled causing the process to busy loop instead
         of reclaiming.  This looks like elevated direct reclaim activity but
         it's the correct action to take based on what processes requested.
      
      2. The test hammers reclaim and compaction heavily. The number of successful
         THP faults is highly variable but affects the reclaim stats. It's not a
         realistic or reasonable measure of page reclaim activity.
      
      3. No other page-reclaim intensive workload that was tested showed a problem.
      
      4. If a workload is identified that benefitted from the busy looping then it
         should be fixed by having direct reclaimers sleep on a wait queue until
         woken by kswapd instead of busy looping. We had this class of problem before
         when congestion_waits() with a fixed timeout was a brain damaged decision
         but happened to benefit some workloads.
      
      If a workload is identified that relied on the zlc to busy loop then it
      should be fixed correctly and have a direct reclaimer sleep on a waitqueue
      until woken by kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f77cf4e4
    • Mel Gorman's avatar
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman authored
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • Mel Gorman's avatar
      mm: page_alloc: remove GFP_IOFS · 40113370
      Mel Gorman authored
      GFP_IOFS was intended to be shorthand for clearing two flags, not a set of
      allocation flags.  There is only one user of this flag combination now and
      there appears to be no reason why Lustre had to be protected from reclaim
      stalls.  As none of the sites appear to be atomic, this patch simply
      deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or
      GFP_NOIO as appropriate.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40113370
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
    • Mel Gorman's avatar
      mm, page_alloc: use masks and shifts when converting GFP flags to migrate types · 016c13da
      Mel Gorman authored
      This patch redefines which GFP bits are used for specifying mobility and
      the order of the migrate types.  Once redefined it's possible to convert
      GFP flags to a migrate type with a simple mask and shift.  The only
      downside is that readers of OOM kill messages and allocation failures may
      have been used to the existing values but scripts/gfp-translate will help.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      016c13da
    • Mel Gorman's avatar
      mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled · 46e700ab
      Mel Gorman authored
      There is a seqcounter that protects against spurious allocation failures
      when a task is changing the allowed nodes in a cpuset.  There is no need
      to check the seqcounter until a cpuset exists.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46e700ab
    • Mel Gorman's avatar
      mm, page_alloc: remove unnecessary recalculations for dirty zone balancing · c9ab0c4f
      Mel Gorman authored
      File-backed pages that will be immediately written are balanced between
      zones.  This heuristic tries to avoid having a single zone filled with
      recently dirtied pages but the checks are unnecessarily expensive.  Move
      consider_zone_balanced into the alloc_context instead of checking bitmaps
      multiple times.  The patch also gives the parameter a more meaningful
      name.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9ab0c4f
    • Mel Gorman's avatar
      mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe · e2b19197
      Mel Gorman authored
      Overall, the intent of this series is to remove the zonelist cache which
      was introduced to avoid high overhead in the page allocator.  Once this is
      done, it is necessary to reduce the cost of watermark checks.
      
      The series starts with minor micro-optimisations.
      
      Next it notes that GFP flags that affect watermark checks are abused.
      __GFP_WAIT historically identified callers that could not sleep and could
      access reserves.  This was later abused to identify callers that simply
      prefer to avoid sleeping and have other options.  A patch distinguishes
      between atomic callers, high-priority callers and those that simply wish
      to avoid sleep.
      
      The zonelist cache has been around for a long time but it is of dubious
      merit with a lot of complexity and some issues that are explained.  The
      most important issue is that a failed THP allocation can cause a zone to
      be treated as "full".  This potentially causes unnecessary stalls, reclaim
      activity or remote fallbacks.  The issues could be fixed but it's not
      worth it.  The series places a small number of other micro-optimisations
      on top before examining GFP flags watermarks.
      
      High-order watermarks enforcement can cause high-order allocations to fail
      even though pages are free.  The watermark checks both protect high-order
      atomic allocations and make kswapd aware of high-order pages but there is
      a much better way that can be handled using migrate types.  This series
      uses page grouping by mobility to reserve pageblocks for high-order
      allocations with the size of the reservation depending on demand.  kswapd
      awareness is maintained by examining the free lists.  By patch 12 in this
      series, there are no high-order watermark checks while preserving the
      properties that motivated the introduction of the watermark checks.
      
      This patch (of 10):
      
      No user of zone_watermark_ok_safe() specifies alloc_flags.  This patch
      removes the unnecessary parameter.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2b19197
    • Yaowei Bai's avatar
      mm/oom_kill.c: introduce is_sysrq_oom helper · db2a0dd7
      Yaowei Bai authored
      Introduce is_sysrq_oom helper function indicating oom kill triggered
      by sysrq to improve readability.
      
      No functional changes.
      Signed-off-by: default avatarYaowei Bai <bywxiaobai@163.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db2a0dd7
  2. 06 Nov, 2015 3 commits
    • Linus Torvalds's avatar
      Merge tag 'backlight-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · 5bc23a0c
      Linus Torvalds authored
      Pull backlight updates from Lee Jones:
       "New Device Support
         - None
      
        New Functionality:
         - None
      
        Core Frameworks:
         - Reject legacy PWM request for device defined in DT
      
        Fix-ups:
         - Remove unnecessary MODULE_ALIAS(); adp8860_bl, adp8870_bl
         - Simplify code: pm8941-wled
         - Supply default-brightness logic; pm8941-wled
      
        Bug Fixes:
         - Clean up OF node; 88pm860x_bl
         - Ensure struct is zeroed; lp855x_bl"
      
      * tag 'backlight-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
        backlight: pm8941-wled: Add default-brightness property
        backlight: pm8941-wled: Fix ptr_ret.cocci warnings
        backlight: pwm: Reject legacy PWM request for device defined in DT
        backlight: 88pm860x_bl: Add missing of_node_put
        backlight: adp8870: Remove unnecessary MODULE_ALIAS()
        backlight: adp8860: Remove unnecessary MODULE_ALIAS()
        backlight: lp855x: Make sure props struct is zeroed
      5bc23a0c
    • Linus Torvalds's avatar
      mfd: avoid newly introduced compiler warning · 4dcee4d8
      Linus Torvalds authored
      Commit b158b69a ("mfd: rtsx: Simplify function return logic")
      removed the use of the 'err' variable, but left the variable itself
      around, resulting in gcc quite reasonably warning:
      
          drivers/mfd/rtsx_pcr.c: In function ‘rtsx_pci_set_pull_ctl’:
          drivers/mfd/rtsx_pcr.c:565:6: warning: unused variable ‘err’ [-Wunused-variable]
            int err;
                ^
      
      Get rid of the unused variable, and avoid the new warning.
      
      Cc: Javier Martinez Canillas <javier@osg.samsung.com>
      Cc: Lee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dcee4d8
    • Linus Torvalds's avatar
      Merge tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · bc914532
      Linus Torvalds authored
      Pull MFD updates from Lee Jones:
       "New Device Support:
         - Add support for 88pm860; 88pm80x
         - Add support for 24c08 EEPROM; at24
         - Add support for Broxton Whiskey Cove; intel*
         - Add support for RTS522A; rts5227
         - Add support for I2C devices; intel_quark_i2c_gpio
      
        New Functionality:
         - Add microphone support; arizona
         - Add general purpose switch support; arizona
         - Add fuel-gauge support; da9150-core
         - Add shutdown support; sec-core
         - Add charger support; tps65217
         - Add flexible serial communication unit support; atmel-flexcom
         - Add power button support; axp20x
         - Add led-flash support; rt5033
      
        Core Frameworks:
         - Supply a generic macro for defining Regmap IRQs
         - Rework ACPI child device matching
      
        Fix-ups:
         - Use Regmap to access registers; tps6105x
         - Use DEFINE_RES_IRQ_NAMED() macro; da9150
         - Re-arrange device registration order; intel_quark_i2c_gpio
         - Allow OF matching; cros_ec_i2c, atmel-hlcdc, hi6421-pmic, max8997, sm501
         - Handle deferred probe; twl6040
         - Improve accuracy of headphone detect; arizona
         - Unnecessary MODULE_ALIAS() removal; bcm590xx, rt5033
         - Remove unused code; htc-i2cpld, arizona, pcf50633-irq, sec-core
         - Simplify code; kempld, rts5209, da903x, lm3533, da9052, arizona
         - Remove #iffery; arizona
         - DT binding adaptions; many
      
        Bug Fixes:
         - Fix possible NULL pointer dereference; wm831x, tps6105x
         - Fix 64bit bug; intel_soc_pmic_bxtwc
         - Fix signedness issue; arizona"
      
      * tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
        bindings: mfd: s2mps11: Add documentation for s2mps15 PMIC
        mfd: sec-core: Remove unused s2mpu02-rtc and s2mpu02-clk children
        extcon: arizona: Add extcon specific device tree binding document
        MAINTAINERS: Add binding docs for Cirrus Logic/Wolfson Arizona devices
        mfd: arizona: Remove bindings covered in new subsystem specific docs
        mfd: rt5033: Add RT5033 Flash led sub device
        mfd: lpss: Add Intel Broxton PCI IDs
        mfd: lpss: Add Broxton ACPI IDs
        mfd: arizona: Signedness bug in arizona_runtime_suspend()
        mfd: axp20x: Add a cell for the power button part of the, axp288 PMICs
        mfd: dt-bindings: Document pulled down WRSTBI pin on S2MPS1X
        mfd: sec-core: Disable buck voltage reset on watchdog falling edge
        mfd: sec-core: Dump PMIC revision to find out the HW
        mfd: arizona: Use correct type ID for device tree config
        mfd: arizona: Remove use of codec build config #ifdefs
        mfd: arizona: Simplify adding subdevices
        mfd: arizona: Downgrade type mismatch messages to dev_warn
        mfd: arizona: Factor out checking of jack detection state
        mfd: arizona: Factor out DCVDD isolation control
        mfd: Make TPS6105X select REGMAP_I2C
        ...
      bc914532