1. 18 Jan, 2014 1 commit
  2. 16 Jan, 2014 2 commits
  3. 15 Jan, 2014 1 commit
    • Steven Whitehouse's avatar
      GFS2: Fix kbuild test robot reported warning · 1e3d3620
      Steven Whitehouse authored
      Well I don't get the same warning locally as the kbuild
      robot, but I guess this should fix the problem, anyway.
      Here is the warning:
      
      head:   2d9e7230
      commit: ee2411a8 [19/20] GFS2: Clean up quota slot allocation
      config: make ARCH=powerpc allmodconfig
      
      All error/warnings:
      
         fs/gfs2/quota.c: In function 'gfs2_quota_init':
      >> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
            sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
            ^
      >> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
            sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
                                 ^
         fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
      >> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
             vfree(sdp->sd_quota_bitmap);
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      1e3d3620
  4. 14 Jan, 2014 5 commits
    • Steven Whitehouse's avatar
      GFS2: Move quota bitmap operations under their own lock · 2d9e7230
      Steven Whitehouse authored
      Gradually, the global qd_lock is being used for less and less.
      After this patch it will only be used for the per super block
      list whose purpose is to allow syncing of changes back to the
      master quota file from the local quota changes file. Fixing
      up that process to make it more efficient will be the subject
      of a later patch, however this patch removes another barrier
      to doing that.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Abhijith Das <adas@redhat.com>
      2d9e7230
    • Steven Whitehouse's avatar
      GFS2: Clean up quota slot allocation · ee2411a8
      Steven Whitehouse authored
      Quota slot allocation has historically used a vector of pages
      and a set of homegrown find/test/set/clear bit functions. Since
      the size of the bitmap is likely to be based on the default
      qc file size, thats a couple of pages at most. So we ought
      to be able to allocate that as a single chunk, with a vmalloc
      fallback, just in case of memory fragmentation.
      
      We are then able to use the kernel's own find/test/set/clear
      bit functions, rather than rolling our own.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Abhijith Das <adas@redhat.com>
      ee2411a8
    • Steven Whitehouse's avatar
      GFS2: Only run logd and quota when mounted read/write · 8ad151c2
      Steven Whitehouse authored
      While investigating a rather strange bit of code in the quota
      clean up function, I spotted that the reason for its existence
      was that when remounting read only, we were not stopping the
      quotad thread, and thus it was possible for it to still have
      a reference to some of the quotas in that case.
      
      This patch moves the logd and quota thread start and stop into
      the make_fs_rw/ro functions, so that we now stop those threads
      when mounted read only.
      
      This means that quotad will always be stopped before we call
      the quota clean up function, and we can thus dispose of the
      (rather hackish) code that waits for it to give up its
      reference on the quotas.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Abhijith Das <adas@redhat.com>
      8ad151c2
    • Steven Whitehouse's avatar
      GFS2: Use RCU/hlist_bl based hash for quotas · c754fbbb
      Steven Whitehouse authored
      Prior to this patch, GFS2 kept all the quotas for each
      super block in a single linked list. This is rather slow
      when there are large numbers of quotas.
      
      This patch introduces a hlist_bl based hash table, similar
      to the one used for glocks. The initial look up of the quota
      is now lockless in the case where it is already cached,
      although we still have to take the per quota spinlock in
      order to bump the ref count. Either way though, this is a
      big improvement on what was there before.
      
      The qd_lock and the per super block list is preserved, for
      the time being. However it is intended that since this is no
      longer used for its original role, it should be possible to
      shrink the number of items on that list in due course and
      remove the requirement to take qd_lock in qd_get.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Abhijith Das <adas@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      c754fbbb
    • Steven Whitehouse's avatar
      GFS2: No need to invalidate pages for a dio read · 086352f1
      Steven Whitehouse authored
      We recently fixed the writeback of pages prior to performing
      direct i/o, however the initial fix was perhaps a bit heavy
      handed. There is no need to invalidate pages if the direct i/o
      is only a read, since they will be identical to what has been
      flushed to disk anyway.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      086352f1
  5. 09 Jan, 2014 1 commit
  6. 08 Jan, 2014 2 commits
    • Steven Whitehouse's avatar
      GFS2: Add hints to directory leaf blocks · 01bcb0de
      Steven Whitehouse authored
      This patch adds four new fields to directory leaf blocks.
      The intent is not to use them in the kernel itself, although
      perhaps we may be able to use them as hints at some later date,
      but instead to provide more information for debug/fsck use.
      
      One new field adds a pointer to the inode to which the leaf
      belongs. This can be useful if the pointer to the leaf block
      has become corrupt, as it will allow us to know which inode
      this block should be associated with. This field is set when
      the leaf is created and never changed over its lifetime.
      
      The second field is a "distance from the hash table" field.
      The meaning is as follows:
       0  = An old leaf in which this value has not been set
       1  = This leaf is pointed to directly from the hash table
       2+ = This leaf is part of a chain, pointed to by another leaf
            block, the value gives the position in the chain.
      
      The third and fourth fields combine to give a time stamp of
      the most recent directory insertion or deletion from this
      leaf block. The time stamp is not updated when a new leaf
      block is chained from the current one. The code is currently
      written such that the timestamp on the dir inode will match
      that of the leaf block for the most recent insertion/deletion.
      
      For backwards compatibility, any of these new fields which is
      zero should be considered to be "unknown".
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      01bcb0de
    • Steven Whitehouse's avatar
      GFS2: For exhash conversion, only one block is needed · 22b5a6c0
      Steven Whitehouse authored
      For most cases, only a single new block is needed when we reach
      the point of converting from stuffed to exhash directory. The
      exception being when the file name is so long that it will not
      fit within the new leaf block.
      
      So this patch adds a simple test for that situation so that we
      do not need to request the full reservation size in this case.
      
      Potentially we could calculate more accurately the value to use
      in other cases too, but that is much more complicated to do and
      it is doubtful that the benefit would outweigh the extra cost
      in code complexity.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      22b5a6c0
  7. 07 Jan, 2014 1 commit
  8. 06 Jan, 2014 3 commits
    • Steven Whitehouse's avatar
      GFS2: Remember directory insert point · 2b47dad8
      Steven Whitehouse authored
      When we look to see if there is enough space to add a dir
      entry without allocation, we have then been repeating the
      same search later when we do the actual insertion. This
      patch caches the details of the location in the gfs2_diradd
      structure, so that we do not have to repeat the search.
      
      This will provide a performance improvement which will be
      greater as the size of the directory increases.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      2b47dad8
    • Steven Whitehouse's avatar
      GFS2: Consolidate transaction blocks calculation for dir add · 534cf9ca
      Steven Whitehouse authored
      There are three cases where we need to calculate the number of
      blocks to reserve in a transaction involving linking an inode
      into a directory. The one in rename is a bit more complicated,
      but the basis of it is the same as for link and create. So it
      makes sense to move this calculation into a single function
      rather than repeating it three times.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      534cf9ca
    • Steven Whitehouse's avatar
      GFS2: Add directory addition info structure · 3c1c0ae1
      Steven Whitehouse authored
      The intent is that this structure will hold the information
      required when adding entries to a directory (linking). To
      start with, it will contain only the number of blocks which
      are required to link the new entry into the directory. The
      current calculation returns either 0 or the maximim number of
      blocks that can ever be requested by such a transaction.
      
      The intent is that in a later patch, we can update the dir
      code to calculate this value more accurately. In addition
      further patches will also add further fields to the new
      structure to increase its utility.
      
      In addition this patch fixes a bug where the link used during
      inode creation was adding requesting too many blocks in
      some cases. This is harmless unless the fs is close to being
      full.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      3c1c0ae1
  9. 03 Jan, 2014 8 commits
    • Steven Whitehouse's avatar
      GFS2: Use only a single address space for rgrps · 70d4ee94
      Steven Whitehouse authored
      Prior to this patch, GFS2 had one address space for each rgrp,
      stored in the glock. This patch changes them to use a single
      address space in the super block. This therefore saves
      (sizeof(struct address_space) * nr_of_rgrps) bytes of memory
      and for large filesystems, that can be significant.
      
      It would be nice to be able to do something similar and merge
      the inode metadata address space into the same global
      address space. However, that is rather more complicated as the
      on-disk location doesn't have a 1:1 mapping with the inodes in
      general. So while it could be done, it will be a more complicated
      operation as it requires changing a lot more code paths.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      70d4ee94
    • Steven Whitehouse's avatar
      GFS2: Use range based functions for rgrp sync/invalidation · 7005c3e4
      Steven Whitehouse authored
      Each rgrp header is represented as a single extent on disk, so we
      can calculate the position within the address space, since we are
      using address spaces mapped 1:1 to the disk. This means that it
      is possible to use the range based versions of filemap_fdatawrite/wait
      and for invalidating the page cache.
      
      Our eventual intent is to then be able to merge the address spaces
      used for rgrps into a single address space, rather than to have
      one for each glock, saving memory and reducing complexity.
      
      Since during umount, the rgrp structures are disposed of before
      the glocks, we need to store the extent information in the glock
      so that is is available for a final invalidation. This patch uses
      a field which is otherwise unused in rgrp glocks to do that, so
      that we do not have to expand the size of a glock.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      7005c3e4
    • Steven Whitehouse's avatar
      GFS2: Remove test which is always true · 7de41d36
      Steven Whitehouse authored
      Since gfs2_inplace_reserve() is always called with a valid
      alloc parms structure, there is no need to test for this
      within the function itself - and in any case, after we've
      all ready dereferenced it anyway.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      7de41d36
    • Steven Whitehouse's avatar
      GFS2: Remove gfs2_quota_change_host structure · 7aed98fb
      Steven Whitehouse authored
      There is only one place this is used, when reading in the quota
      changes at mount time. It is not really required and much
      simpler to just convert the fields from the on-disk structure
      as required.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      7aed98fb
    • Steven Whitehouse's avatar
      GFS2: Clean up releasepage · e4f29206
      Steven Whitehouse authored
      For historical reasons, we drop and retake the log lock in ->releasepage()
      however, since there is no reason why we cannot hold the log lock over
      the whole function, this allows some simplification. In particular,
      pinning a buffer is only ever done under the log lock, so it is possible
      here to remove the test for pinned buffers in the second loop, since it
      is impossible for that to happen (it is also tested in the first loop).
      
      As a result, two tests made later in the second loop become constants
      and can also be reduced to the only possible branch. So the net result
      is to remove various bits of unreachable code and make this more
      readable.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      e4f29206
    • Bob Peterson's avatar
      GFS2: Implement a "rgrp has no extents longer than X" scheme · 5ea5050c
      Bob Peterson authored
      With the preceding patch, we started accepting block reservations
      smaller than the ideal size, which requires a lot more parsing of the
      bitmaps. To reduce the amount of bitmap searching, this patch
      implements a scheme whereby each rgrp keeps track of the point
      at this multi-block reservations will fail.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      5ea5050c
    • Bob Peterson's avatar
      GFS2: Drop inadequate rgrps from the reservation tree · 1330edbe
      Bob Peterson authored
      This is just basically a resend of a patch I posted earlier.
      It didn't change from its original, except in diff offsets, etc:
      
      This patch fixes a bug in the GFS2 block allocation code. The problem
      starts if a process already has a multi-block reservation, but for
      some reason, another process disqualifies it from further allocations.
      For example, the other process might set on the GFS2_RDF_ERROR bit.
      The process holding the reservation jumps to label skip_rgrp, but
      that label comes after the code that removes the reservation from the
      tree. Therefore, the no longer usable reservation is not removed from
      the rgrp's reservations tree; it's lost. Eventually, the lost reservation
      causes the count of reserved blocks to get off, and eventually that
      causes a BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free) to trigger.
      This patch moves the call to after label skip_rgrp so that the
      disqualified reservation is properly removed from the tree, thus keeping
      the rgrp rd_reserved count sane.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      1330edbe
    • Bob Peterson's avatar
      GFS2: If requested is too large, use the largest extent in the rgrp · 5ce13431
      Bob Peterson authored
      Here is a second try at a patch I posted earlier, which also implements
      suggestions Steve made:
      
      Before this patch, GFS2 would keep searching through all the rgrps
      until it found one that had a chunk of free blocks big enough to
      satisfy the size hint, which is based on the file write size,
      regardless of whether the chunk was big enough to perform the write.
      However, when doing big writes there may not be a large enough
      chunk of free blocks in any rgrp, due to file system fragmentation.
      The largest chunk may be big enough to satisfy the write request,
      but it may not meet the ideal reservation size from the "size hint".
      The writes would slow to a crawl because every write would search
      every rgrp, then finally give up and default to a single-block write.
      In my case, performance would drop from 425MB/s to 18KB/s, or 24000
      times slower.
      
      This patch basically makes it so that if we can't find a contiguous
      chunk of blocks big enough to satisfy the sizehint, we'll use the
      largest chunk of blocks we found that will still contain the write.
      It does so by keeping track of the largest run of blocks within the
      rgrp.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      5ce13431
  10. 02 Jan, 2014 16 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/virt/kvm/kvm · 7a262d2e
      Linus Torvalds authored
      Pull kvm bugfixes from Marcelo Tosatti.
      
      * git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: nVMX: Unconditionally uninit the MMU on nested vmexit
        KVM: x86: Fix APIC map calculation after re-enabling
      7a262d2e
    • Linus Torvalds's avatar
      Merge branch 'akpm' (incoming from Andrew) · 06f055f3
      Linus Torvalds authored
      Merge patches from Andrew Morton:
       "Ten fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
        sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
        drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test()
        mm/memory-failure.c: transfer page count from head page to tail page after split thp
        MAINTAINERS: set up proper record for Xilinx Zynq
        mm: remove bogus warning in copy_huge_pmd()
        memcg: fix memcg_size() calculation
        mm: fix use-after-free in sys_remap_file_pages
        mm: munlock: fix deadlock in __munlock_pagevec()
        mm: munlock: fix a bug where THP tail page is encountered
      06f055f3
    • Jason Baron's avatar
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 4ff36ee9
      Jason Baron authored
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ff36ee9
    • Nobuhiro Iwamatsu's avatar
      sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c · ad70b029
      Nobuhiro Iwamatsu authored
      Min_low_pfn and max_low_pfn were used in pfn_valid macro if defined
      CONFIG_FLATMEM.  When the functions that use the pfn_valid is used in
      driver module, max_low_pfn and min_low_pfn is to undefined, and fail to
      build.
      
        ERROR: "min_low_pfn" [drivers/block/aoe/aoe.ko] undefined!
        ERROR: "max_low_pfn" [drivers/block/aoe/aoe.ko] undefined!
        make[2]: *** [__modpost] Error 1
        make[1]: *** [modules] Error 2
      
      This patch fix this problem.
      Signed-off-by: default avatarNobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
      Cc: Kuninori Morimoto <kuninori.morimoto.gx@gmail.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad70b029
    • Jiang Liu's avatar
      drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test() · 3532e566
      Jiang Liu authored
      Check DMA mapping return values in function ioat_dma_self_test() to get
      rid of following warning message.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 1203 at lib/dma-debug.c:937 check_unmap+0x4c0/0x9a0()
        ioatdma 0000:00:04.0: DMA-API: device driver failed to check map error[device address=0x000000085191b000] [size=2000 bytes] [mapped as single]
        Modules linked in: ioatdma(+) mac_hid wmi acpi_pad lp parport hidd_generic usbhid hid ixgbe isci dca libsas ahci ptp libahci scsi_transport_sas meegaraid_sas pps_core mdio
        CPU: 0 PID: 1203 Comm: systemd-udevd Not tainted 3.13.0-rc4+ #8
        Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIIN1.86B.0044.L09.1311181644 11/18/2013
        Call Trace:
          dump_stack+0x4d/0x66
          warn_slowpath_common+0x7d/0xa0
          warn_slowpath_fmt+0x4c/0x50
          check_unmap+0x4c0/0x9a0
          debug_dma_unmap_page+0x81/0x90
          ioat_dma_self_test+0x3d2/0x680 [ioatdma]
          ioat3_dma_self_test+0x12/0x30 [ioatdma]
          ioat_probe+0xf4/0x110 [ioatdma]
          ioat3_dma_probe+0x268/0x410 [ioatdma]
          ioat_pci_probe+0x122/0x1b0 [ioatdma]
          local_pci_probe+0x45/0xa0
          pci_device_probe+0xd9/0x130
          driver_probe_device+0x171/0x490
          __driver_attach+0x93/0xa0
          bus_for_each_dev+0x6b/0xb0
          driver_attach+0x1e/0x20
          bus_add_driver+0x1f8/0x2b0
          driver_register+0x81/0x110
          __pci_register_driver+0x60/0x70
          ioat_init_module+0x89/0x1000 [ioatdma]
          do_one_initcall+0xe2/0x250
          load_module+0x2313/0x2a00
          SyS_init_module+0xd9/0x130
          system_call_fastpath+0x1a/0x1f
        ---[ end trace 990c591681d27c31 ]---
        Mapped at:
          debug_dma_map_page+0xbe/0x180
          ioat_dma_self_test+0x1ab/0x680 [ioatdma]
          ioat3_dma_self_test+0x12/0x30 [ioatdma]
          ioat_probe+0xf4/0x110 [ioatdma]
          ioat3_dma_probe+0x268/0x410 [ioatdma]
      Signed-off-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3532e566
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: transfer page count from head page to tail page after split thp · a3e0f9e4
      Naoya Horiguchi authored
      Memory failures on thp tail pages cause kernel panic like below:
      
         mce: [Hardware Error]: Machine check events logged
         MCE exception done on CPU 7
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
         IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0
         PGD bae42067 PUD ba47d067 PMD 0
         Oops: 0000 [#1] SMP
        ...
         CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G   M       O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
        ...
         Call Trace:
           me_huge_page+0x3e/0x50
           memory_failure+0x4bb/0xc20
           mce_process_work+0x3e/0x70
           process_one_work+0x171/0x420
           worker_thread+0x11b/0x3a0
           ? manage_workers.isra.25+0x2b0/0x2b0
           kthread+0xe4/0x100
           ? kthread_create_on_node+0x190/0x190
           ret_from_fork+0x7c/0xb0
           ? kthread_create_on_node+0x190/0x190
        ...
         RIP   dequeue_hwpoisoned_huge_page+0x131/0x1e0
         CR2: 0000000000000058
      
      The reasoning of this problem is shown below:
       - when we have a memory error on a thp tail page, the memory error
         handler grabs a refcount of the head page to keep the thp under us.
       - Before unmapping the error page from processes, we split the thp,
         where page refcounts of both of head/tail pages don't change.
       - Then we call try_to_unmap() over the error page (which was a tail
         page before). We didn't pin the error page to handle the memory error,
         this error page is freed and removed from LRU list.
       - We never have the error page on LRU list, so the first page state
         check returns "unknown page," then we move to the second check
         with the saved page flag.
       - The saved page flag have PG_tail set, so the second page state check
         returns "hugepage."
       - We call me_huge_page() for freed error page, then we hit the above panic.
      
      The root cause is that we didn't move refcount from the head page to the
      tail page after split thp.  So this patch suggests to do this.
      
      This panic was introduced by commit 524fca1e ("HWPOISON: fix
      misjudgement of page_action() for errors on mlocked pages").  Note that we
      did have the same refcount problem before this commit, but it was just
      ignored because we had only first page state check which returned "unknown
      page." The commit changed the refcount problem from "doesn't work" to
      "kernel panic."
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3e0f9e4
    • Michal Simek's avatar
      MAINTAINERS: set up proper record for Xilinx Zynq · c2fd4e38
      Michal Simek authored
      Setup correct zynq entry.
       - Add missing cadence_ttc_timer maintainership
       - Add zynq wildcard
       - Add xilinx wildcard
      Signed-off-by: default avatarMichal Simek <michal.simek@xilinx.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2fd4e38
    • Mel Gorman's avatar
      mm: remove bogus warning in copy_huge_pmd() · d0319bd5
      Mel Gorman authored
      Sasha Levin reported the following warning being triggered
      
        WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
        Call Trace:
          copy_huge_pmd+0x145/0x3a0
          copy_page_range+0x3f2/0x560
          dup_mmap+0x2c9/0x3d0
          dup_mm+0xad/0x150
          copy_process+0xa68/0x12e0
          do_fork+0x96/0x270
          SyS_clone+0x16/0x20
          stub_clone+0x69/0x90
      
      This warning was introduced by "mm: numa: Avoid unnecessary disruption
      of NUMA hinting during migration" for paranoia reasons but the warning
      is bogus.  I was thinking of parallel races between NUMA hinting faults
      and forks but this warning would also be triggered by a parallel reclaim
      splitting a THP during a fork.  Remote the bogus warning.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0319bd5
    • Vladimir Davydov's avatar
      memcg: fix memcg_size() calculation · 695c6083
      Vladimir Davydov authored
      The mem_cgroup structure contains nr_node_ids pointers to
      mem_cgroup_per_node objects, not the objects themselves.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      695c6083
    • Rik van Riel's avatar
      mm: fix use-after-free in sys_remap_file_pages · 4eb91982
      Rik van Riel authored
      remap_file_pages calls mmap_region, which may merge the VMA with other
      existing VMAs, and free "vma".  This can lead to a use-after-free bug.
      Avoid the bug by remembering vm_flags before calling mmap_region, and
      not trying to dereference vma later.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eb91982
    • Vlastimil Babka's avatar
      mm: munlock: fix deadlock in __munlock_pagevec() · 3b25df93
      Vlastimil Babka authored
      Commit 7225522b ("mm: munlock: batch non-THP page isolation and
      munlock+putback using pagevec" introduced __munlock_pagevec() to speed
      up munlock by holding lru_lock over multiple isolated pages.  Pages that
      fail to be isolated are put_page()d immediately, also within the lock.
      
      This can lead to deadlock when __munlock_pagevec() becomes the holder of
      the last page pin and put_page() leads to __page_cache_release() which
      also locks lru_lock.  The deadlock has been observed by Sasha Levin
      using trinity.
      
      This patch avoids the deadlock by deferring put_page() operations until
      lru_lock is released.  Another pagevec (which is also used by later
      phases of the function is reused to gather the pages for put_page()
      operation.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b25df93
    • Vlastimil Babka's avatar
      mm: munlock: fix a bug where THP tail page is encountered · c424be1c
      Vlastimil Babka authored
      Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP
      pages") munlock skips tail pages of a munlocked THP page.  However, when
      the head page already has PageMlocked unset, it will not skip the tail
      pages.
      
      Commit 7225522b ("mm: munlock: batch non-THP page isolation and
      munlock+putback using pagevec") has added a PageTransHuge() check which
      contains VM_BUG_ON(PageTail(page)).  Sasha Levin found this triggered
      using trinity, on the first tail page of a THP page without PageMlocked
      flag.
      
      This patch fixes the issue by skipping tail pages also in the case when
      PageMlocked flag is unset.  There is still a possibility of race with
      THP page split between clearing PageMlocked and determining how many
      pages to skip.  The race might result in former tail pages not being
      skipped, which is however no longer a bug, as during the skip the
      PageTail flags are cleared.
      
      However this race also affects correctness of NR_MLOCK accounting, which
      is to be fixed in a separate patch.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c424be1c
    • Linus Torvalds's avatar
      Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes · 152b734a
      Linus Torvalds authored
      Pull GFS2 fixes from Steven Whitehouse:
       "Here is a set of small fixes for GFS2.  There is a fix to drop
        s_umount which is copied in from the core vfs, two patches relate to a
        hard to hit "use after free" and memory leak.  Two patches related to
        using DIO and buffered I/O on the same file to ensure correct
        operation in relation to glock state changes.  The final patch adds an
        RCU read lock to ensure correct locking on an error path"
      
      * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
        GFS2: Fix unsafe dereference in dump_holder()
        GFS2: Wait for async DIO in glock state changes
        GFS2: Fix incorrect invalidation for DIO/buffered I/O
        GFS2: Fix slab memory leak in gfs2_bufdata
        GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
        GFS2: don't hold s_umount over blkdev_put
      152b734a
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · b4796679
      Linus Torvalds authored
      Pull s390 fixes from Martin Schwidefsky:
       "Two small bug fixes and a follow-up to the CONFIG_NR_CPUS change.
      
        A kernel compiled with CONFIG_NR_CPUS=256 will waste quite a bit of
        memory for the per-cpu arrays.  Under z/VM the maximum number of CPUs
        is 64, the code now limits the possible cpu mask to 64 if running
        under z/VM"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/pci: obtain function handle in hotplug notifier
        s390/3270: fix allocation of tty3270_screen structure
        s390/smp: improve setup of possible cpu mask
      b4796679
    • Jan Kiszka's avatar
      KVM: nVMX: Unconditionally uninit the MMU on nested vmexit · 29bf08f1
      Jan Kiszka authored
      Three reasons for doing this: 1. arch.walk_mmu points to arch.mmu anyway
      in case nested EPT wasn't in use. 2. this aligns VMX with SVM. But 3. is
      most important: nested_cpu_has_ept(vmcs12) queries the VMCS page, and if
      one guest VCPU manipulates the page of another VCPU in L2, we may be
      fooled to skip over the nested_ept_uninit_mmu_context, leaving mmu in
      nested state. That can crash the host later on if nested_ept_get_cr3 is
      invoked while L1 already left vmxon and nested.current_vmcs12 became
      NULL therefore.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      29bf08f1
    • Tetsuo Handa's avatar
      GFS2: Fix unsafe dereference in dump_holder() · 0b3a2c99
      Tetsuo Handa authored
      GLOCK_BUG_ON() might call this function without RCU read lock. Make sure that
      RCU read lock is held when using task_struct returned from pid_task().
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      0b3a2c99