1. 12 Dec, 2015 6 commits
    • Hugh Dickins's avatar
      osd fs: __r4w_get_page rely on PageUptodate for uptodate · 3066a967
      Hugh Dickins authored
      Commit 42cb14b1 ("mm: migrate dirty page without
      clear_page_dirty_for_io etc") simplified the migration of a PageDirty
      pagecache page: one stat needs moving from zone to zone and that's about
      all.
      
      It's convenient and safest for it to shift the PageDirty bit from old
      page to new, just before updating the zone stats: before copying data
      and marking the new PageUptodate.  This is all done while both pages are
      isolated and locked, just as before; and just as before, there's a
      moment when the new page is visible in the radix_tree, but not yet
      PageUptodate.  What's new is that it may now be briefly visible as
      PageDirty before it is PageUptodate.
      
      When I scoured the tree to see if this could cause a problem anywhere,
      the only places I found were in two similar functions __r4w_get_page():
      which look up a page with find_get_page() (not using page lock), then
      claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.
      
      I'm not sure whether that was right before, but now it might be wrong
      (on rare occasions): only claim the page is uptodate if PageUptodate.
      Or perhaps the page in question could never be migratable anyway?
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarBoaz Harrosh <ooo@electrozaur.com>
      Cc: Benny Halevy <bhalevy@panasas.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3066a967
    • Johannes Weiner's avatar
      MAINTAINERS: make Vladimir co-maintainer of the memory controller · ed0f1e21
      Johannes Weiner authored
      Vladimir architected and authored much of the current state of the
      memcg's slab memory accounting and tracking.  Make sure he gets CC'd on
      bug reports ;-)
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed0f1e21
    • Michal Hocko's avatar
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 373ccbe5
      Michal Hocko authored
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      373ccbe5
    • Vlastimil Babka's avatar
      mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo · 475a2f90
      Vlastimil Babka authored
      Commit 016c13da ("mm, page_alloc: use masks and shifts when
      converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
      MIGRATE_RECLAIMABLE in the enum definition.  However, migratetype_names
      wasn't updated to reflect that.
      
      As a result, the file /proc/pagetypeinfo shows the counts for Movable as
      Reclaimable and vice versa.
      
      Additionally, commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks
      for high-order atomic allocations on demand") introduced
      MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
      show_migration_types(), so it doesn't appear in the listing of free
      areas during page alloc failures or oom kills.
      
      This patch fixes both problems.  The atomic reserves will show with a
      letter 'H' in the free areas listings.
      
      Fixes: 016c13da ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
      Fixes: 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      475a2f90
    • Vladimir Davydov's avatar
      memcg: fix memory.high target · 9516a18a
      Vladimir Davydov authored
      When the memory.high threshold is exceeded, try_charge() schedules a
      task_work to reclaim the excess.  The reclaim target is set to the
      number of pages requested by try_charge().
      
      This is wrong, because try_charge() usually charges more pages than
      requested (batch > nr_pages) in order to refill per cpu stocks.  As a
      result, a process in a cgroup can easily exceed memory.high
      significantly when doing a lot of charges w/o returning to userspace
      (e.g.  reading a file in big chunks).
      
      Fix this issue by assuring that when exceeding memory.high a process
      reclaims as many pages as were actually charged (i.e.  batch).
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9516a18a
    • Naoya Horiguchi's avatar
      mm: hugetlb: fix hugepage memory leak caused by wrong reserve count · a88c7695
      Naoya Horiguchi authored
      When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
      alloc_buddy_huge_page() to directly create a hugepage from the buddy
      allocator.
      
      In that case, however, if alloc_buddy_huge_page() succeeds we don't
      decrement h->resv_huge_pages, which means that successful
      hugetlb_fault() returns without releasing the reserve count.  As a
      result, subsequent hugetlb_fault() might fail despite that there are
      still free hugepages.
      
      This patch simply adds decrementing code on that code path.
      
      I reproduced this problem when testing v4.3 kernel in the following situation:
       - the test machine/VM is a NUMA system,
       - hugepage overcommiting is enabled,
       - most of hugepages are allocated and there's only one free hugepage
         which is on node 0 (for example),
       - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
         node 1, tries to allocate a hugepage,
       - the allocation should fail but the reserve count is still hold.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org> [3.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a88c7695
  2. 11 Dec, 2015 5 commits
  3. 10 Dec, 2015 6 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 0bd0f1e6
      Linus Torvalds authored
      Pull rdma fixes from Doug Ledford:
       "Most are minor to important fixes.
      
        There is one performance enhancement that I took on the grounds that
        failing to check if other processes can run before running what's
        intended to be a background, idle-time task is a bug, even though the
        primary effect of the fix is to improve performance (and it was a very
        simple patch)"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
        IB/mlx5: Postpone remove_keys under knowledge of coming preemption
        IB/mlx4: Use vmalloc for WR buffers when needed
        IB/mlx4: Use correct order of variables in log message
        iser-target: Remove explicit mlx4 work-around
        mlx4: Expose correct max_sge_rd limit
        IB/mad: Require CM send method for everything except ClassPortInfo
        IB/cma: Add a missing rcu_read_unlock()
        IB core: Fix ib_sg_to_pages()
        IB/srp: Fix srp_map_sg_fr()
        IB/srp: Fix indirect data buffer rkey endianness
        IB/srp: Initialize dma_length in srp_map_idb
        IB/srp: Fix possible send queue overflow
        IB/srp: Fix a memory leak
        IB/sa: Put netlink request into the request list before sending
        IB/iser: use sector_div instead of do_div
        IB/core: use RCU for uverbs id lookup
        IB/qib: Minor fixes to qib per SFF 8636
        IB/core: Fix user mode post wr corruption
        IB/qib: Fix qib_mr structure
      0bd0f1e6
    • Linus Torvalds's avatar
      Merge tag 'sound-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · a80c47da
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Again less intensive changes in this rc: you can find only a few
        HD-audio fixes (noise fixes for Intel Broxton chip and a few Thinkpad
        models, quirks for Alienware 17 and Packard Bell DOTS) in addition to
        a long-standing rme96 bug fix"
      
      * tag 'sound-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda/ca0132 - quirk for Alienware 17 2015
        ALSA: hda - Fix noise problems on Thinkpad T440s
        ALSA: hda - Fixing speaker noise on the two latest thinkpad models
        ALSA: hda - Add inverted dmic for Packard Bell DOTS
        ALSA: hda - Fix playback noise with 24/32 bit sample size on BXT
        ALSA: rme96: Fix unexpected volume reset after rate changes
      a80c47da
    • Joe Thornber's avatar
      dm btree: fix bufio buffer leaks in dm_btree_del() error path · ed8b45a3
      Joe Thornber authored
      If dm_btree_del()'s call to push_frame() fails, e.g. due to
      btree_node_validator finding invalid metadata, the dm_btree_del() error
      path must unlock all frames (which have active dm-bufio buffers) that
      were pushed onto the del_stack.
      
      Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio
      buffers have leaked, e.g.:
        device-mapper: bufio: leaked buffer 3, hold count 1, list 0
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      ed8b45a3
    • Linus Torvalds's avatar
      Merge tag 'vfio-v4.4-rc5' of git://github.com/awilliam/linux-vfio · 6764e5eb
      Linus Torvalds authored
      Pull VFIO fixes from Alex Williamson:
      
       - Various fixes for removing redundancy, const'ifying structs, avoiding
         stack usage, fixing WARN usage (Krzysztof Kozlowski, Julia Lawall,
         Kees Cook, Dan Carpenter)
      
       - Revert No-IOMMU mode as the intended user has not emerged (Alex
         Williamson)
      
      * tag 'vfio-v4.4-rc5' of git://github.com/awilliam/linux-vfio:
        Revert: "vfio: Include No-IOMMU mode"
        vfio: fix a warning message
        vfio: platform: remove needless stack usage
        vfio-pci: constify pci_error_handlers structures
        vfio: Drop owner assignment from platform_driver
      6764e5eb
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-4.4-rc4' of... · eef121f4
      Linus Torvalds authored
      Merge tag 'devicetree-fixes-for-4.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
      
      Pull DT fixes from Rob Herring:
       "I think this should be all for 4.4:
      
         - Fix incorrect warning about overlapping memory regions
      
         - Export of_irq_find_parent again which was made static in 4.4, but
           has users pending for 4.5.
      
         - Fix of_msi_map_rid declaration location
      
         - Fix re-entrancy for of_fdt_unflatten_tree
      
         - Clean-up of phys_addr_t printks"
      
      * tag 'devicetree-fixes-for-4.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        of/irq: move of_msi_map_rid declaration to the correct ifdef section
        of/irq: Export of_irq_find_parent again
        of/fdt: Add mutex protection for calls to __unflatten_device_tree()
        of/address: fix typo in comment block of of_translate_one()
        of: do not use 0x in front of %pa
        of: Fix comparison of reserved memory regions
      eef121f4
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · abb7e2b3
      Linus Torvalds authored
      Pull clk fixes from Stephen Boyd:
       "One small build fix, a couple do_div() fixes, and a fix for the gpio
        basic clock type are the major changes here.  There's also a couple
        fixes for the TI, sunxi, and scpi clock drivers"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: sunxi: pll2: Fix clock running too fast
        clk: scpi: add missing of_node_put
        clk: qoriq: fix memory leak
        imx/clk-pllv2: fix wrong do_div() usage
        imx/clk-pllv1: fix wrong do_div() usage
        clk: mmp: add linux/clk.h includes
        clk: ti: drop locking code from mux/divider drivers
        clk: ti816x: Add missing dmtimer clkdev entries
        clk: ti: fapll: fix wrong do_div() usage
        clk: ti: clkt_dpll: fix wrong do_div() usage
        clk: gpio: Get parent clk names in of_gpio_clk_setup()
      abb7e2b3
  4. 09 Dec, 2015 19 commits
  5. 08 Dec, 2015 4 commits
    • Leon Romanovsky's avatar
      IB/mlx5: Postpone remove_keys under knowledge of coming preemption · ab5cdc31
      Leon Romanovsky authored
      The remove_keys() logic is performed as garbage collection task. Such
      task is intended to be run when no other active processes are running.
      
      The need_resched() will return TRUE if there are user tasks to be
      activated in near future.
      
      In such case, we don't execute remove_keys() and postpone
      the garbage collection work to try to run in next cycle,
      in order to free CPU resources to other tasks.
      
      The possible pseudo-code to trigger such scenario:
      1. Allocate a lot of MR to fill the cache above the limit.
      2. Wait a small amount of time "to calm" the system.
      3. Start CPU extensive operations on multi-node cluster.
      4. Expect performance degradation during MR cache shrink operation.
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ab5cdc31
    • Wengang Wang's avatar
      IB/mlx4: Use vmalloc for WR buffers when needed · 0ef2f05c
      Wengang Wang authored
      There are several hits that WR buffer allocation(kmalloc) failed.
      It failed at order 3 and/or 4 contigous pages allocation. At the same time
      there are actually 100MB+ free memory but well fragmented.
      So try vmalloc when kmalloc failed.
      Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Acked-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      0ef2f05c
    • Wengang Wang's avatar
      IB/mlx4: Use correct order of variables in log message · 73d4da7b
      Wengang Wang authored
      There is a mis-order in mlx4 log. Fix it.
      Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Acked-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      73d4da7b
    • Linus Torvalds's avatar
      Merge branch 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 5406812e
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "More change than I'd have liked at this stage.  The pids controller
        and the changes made to cgroup core to support it introduced and
        revealed several important issues.
      
         - Assigning membership to a newly created task and migrating it can
           race leading to incorrect accounting.  Oleg fixed it by widening
           threadgroup synchronization.  It looks like we'll be able to merge
           it with a different percpu rwsem which is used in fork path making
           things simpler and cheaper.
      
         - The recent change to extend cgroup membership to zombies (so that
           pid accounting can extend till the pid is actually released) missed
           pinning the underlying data structures leading to use-after-free.
           Fixed.
      
         - v2 hierarchy was calling subsystem callbacks with the wrong target
           cgroup_subsys_state based on the incorrect assumption that they
           share the same target.  pids is the first controller affected by
           this.  Subsys callbacks updated so that they can deal with
           multi-target migrations"
      
      * 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup_pids: don't account for the root cgroup
        cgroup: fix handling of multi-destination migration from subtree_control enabling
        cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
        cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
        cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
        cgroup: make css_set pin its css's to avoid use-afer-free
        cgroup: fix cftype->file_offset handling
      5406812e