1. 28 Oct, 2013 13 commits
    • Chuck Lever's avatar
      NFS: Introduce a vector of migration recovery ops · ec011fe8
      Chuck Lever authored
      The differences between minor version 0 and minor version 1
      migration will be abstracted by the addition of a set of migration
      recovery ops.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      ec011fe8
    • Chuck Lever's avatar
      NFS: Add functions to swap transports during migration recovery · 800c06a5
      Chuck Lever authored
      Introduce functions that can walk through an array of returned
      fs_locations information and connect a transport to one of the
      destination servers listed therein.
      
      Note that NFS minor version 1 introduces "fs_locations_info" which
      extends the locations array sorting criteria available to clients.
      This is not supported yet.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      800c06a5
    • Chuck Lever's avatar
      NFS: Add nfs4_update_server · 32e62b7c
      Chuck Lever authored
      New function nfs4_update_server() moves an nfs_server to a different
      nfs_client.  This is done as part of migration recovery.
      
      Though it may be appealing to think of them as the same thing,
      migration recovery is not the same as following a referral.
      
      For a referral, the client has not descended into the file system
      yet: it has no nfs_server, no super block, no inodes or open state.
      It is enough to simply instantiate the nfs_server and super block,
      and perform a referral mount.
      
      For a migration, however, we have all of those things already, and
      they have to be moved to a different nfs_client.  No local namespace
      changes are needed here.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      32e62b7c
    • Trond Myklebust's avatar
      SUNRPC: Add a helper to switch the transport of an rpc_clnt · 40b00b6b
      Trond Myklebust authored
      Add an RPC client API to redirect an rpc_clnt's transport from a
      source server to a destination server during a migration event.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      [ cel: forward ported to 3.12 ]
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      40b00b6b
    • Chuck Lever's avatar
      SUNRPC: Modify synopsis of rpc_client_register() · d746e545
      Chuck Lever authored
      The rpc_client_register() helper was added in commit e73f4cc0,
      "SUNRPC: split client creation routine into setup and registration,"
      Mon Jun 24 11:52:52 2013.  In a subsequent patch, I'd like to invoke
      rpc_client_register() from a context where a struct rpc_create_args
      is not available.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      d746e545
    • Weston Andros Adamson's avatar
      NFSv4: don't reprocess cached open CLAIM_PREVIOUS · d2bfda2e
      Weston Andros Adamson authored
      Cached opens have already been handled by _nfs4_opendata_reclaim_to_nfs4_state
      and can safely skip being reprocessed, but must still call update_open_stateid
      to make sure that all active fmodes are recovered.
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Cc: stable@vger.kernel.org # 3.7.x: f494a607: NFSv4: fix NULL dereference
      Cc: stable@vger.kernel.org # 3.7.x: a43ec98b: NFSv4: don't fail on missin
      Cc: stable@vger.kernel.org # 3.7.x
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      d2bfda2e
    • Trond Myklebust's avatar
      NFSv4: Fix state reference counting in _nfs4_opendata_reclaim_to_nfs4_state · d49f042a
      Trond Myklebust authored
      Currently, if the call to nfs_refresh_inode fails, then we end up leaking
      a reference count, due to the call to nfs4_get_open_state.
      While we're at it, replace nfs4_get_open_state with a simple call to
      atomic_inc(); there is no need to do a full lookup of the struct nfs_state
      since it is passed as an argument in the struct nfs4_opendata, and
      is already assigned to the variable 'state'.
      
      Cc: stable@vger.kernel.org # 3.7.x: a43ec98b: NFSv4: don't fail on missing
      Cc: stable@vger.kernel.org # 3.7.x
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      d49f042a
    • Weston Andros Adamson's avatar
      NFSv4: don't fail on missing fattr in open recover · a43ec98b
      Weston Andros Adamson authored
      This is an unneeded check that could cause the client to fail to recover
      opens.
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      a43ec98b
    • Weston Andros Adamson's avatar
      NFSv4: fix NULL dereference in open recover · f494a607
      Weston Andros Adamson authored
      _nfs4_opendata_reclaim_to_nfs4_state doesn't expect to see a cached
      open CLAIM_PREVIOUS, but this can happen. An example is when there are
      RDWR openers and RDONLY openers on a delegation stateid. The recovery
      path will first try an open CLAIM_PREVIOUS for the RDWR openers, this
      marks the delegation as not needing RECLAIM anymore, so the open
      CLAIM_PREVIOUS for the RDONLY openers will not actually send an rpc.
      
      The NULL dereference is due to _nfs4_opendata_reclaim_to_nfs4_state
      returning PTR_ERR(rpc_status) when !rpc_done. When the open is
      cached, rpc_done == 0 and rpc_status == 0, thus
      _nfs4_opendata_reclaim_to_nfs4_state returns NULL - this is unexpected
      by callers of nfs4_opendata_to_nfs4_state().
      
      This can be reproduced easily by opening the same file two times on an
      NFSv4.0 mount with delegations enabled, once as RDWR and once as RDONLY then
      sleeping for a long time.  While the files are held open, kick off state
      recovery and this NULL dereference will be hit every time.
      
      An example OOPS:
      
      [   65.003602] BUG: unable to handle kernel NULL pointer dereference at 00000000
      00000030
      [   65.005312] IP: [<ffffffffa037d6ee>] __nfs4_close+0x1e/0x160 [nfsv4]
      [   65.006820] PGD 7b0ea067 PUD 791ff067 PMD 0
      [   65.008075] Oops: 0000 [#1] SMP
      [   65.008802] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache
      snd_ens1371 gameport nfsd snd_rawmidi snd_ac97_codec ac97_bus btusb snd_seq snd
      _seq_device snd_pcm ppdev bluetooth auth_rpcgss coretemp snd_page_alloc crc32_pc
      lmul crc32c_intel ghash_clmulni_intel microcode rfkill nfs_acl vmw_balloon serio
      _raw snd_timer lockd parport_pc e1000 snd soundcore parport i2c_piix4 shpchp vmw
      _vmci sunrpc ata_generic mperf pata_acpi mptspi vmwgfx ttm scsi_transport_spi dr
      m mptscsih mptbase i2c_core
      [   65.018684] CPU: 0 PID: 473 Comm: 192.168.10.85-m Not tainted 3.11.2-201.fc19
      .x86_64 #1
      [   65.020113] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop
      Reference Platform, BIOS 6.00 07/31/2013
      [   65.022012] task: ffff88003707e320 ti: ffff88007b906000 task.ti: ffff88007b906000
      [   65.023414] RIP: 0010:[<ffffffffa037d6ee>]  [<ffffffffa037d6ee>] __nfs4_close+0x1e/0x160 [nfsv4]
      [   65.025079] RSP: 0018:ffff88007b907d10  EFLAGS: 00010246
      [   65.026042] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [   65.027321] RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000000
      [   65.028691] RBP: ffff88007b907d38 R08: 0000000000016f60 R09: 0000000000000000
      [   65.029990] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      [   65.031295] R13: 0000000000000050 R14: 0000000000000000 R15: 0000000000000001
      [   65.032527] FS:  0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
      [   65.033981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   65.035177] CR2: 0000000000000030 CR3: 000000007b27f000 CR4: 00000000000407f0
      [   65.036568] Stack:
      [   65.037011]  0000000000000000 0000000000000001 ffff88007b907d90 ffff88007a880220
      [   65.038472]  ffff88007b768de8 ffff88007b907d48 ffffffffa037e4a5 ffff88007b907d80
      [   65.039935]  ffffffffa036a6c8 ffff880037020e40 ffff88007a880000 ffff880037020e40
      [   65.041468] Call Trace:
      [   65.042050]  [<ffffffffa037e4a5>] nfs4_close_state+0x15/0x20 [nfsv4]
      [   65.043209]  [<ffffffffa036a6c8>] nfs4_open_recover_helper+0x148/0x1f0 [nfsv4]
      [   65.044529]  [<ffffffffa036a886>] nfs4_open_recover+0x116/0x150 [nfsv4]
      [   65.045730]  [<ffffffffa036d98d>] nfs4_open_reclaim+0xad/0x150 [nfsv4]
      [   65.046905]  [<ffffffffa037d979>] nfs4_do_reclaim+0x149/0x5f0 [nfsv4]
      [   65.048071]  [<ffffffffa037e1dc>] nfs4_run_state_manager+0x3bc/0x670 [nfsv4]
      [   65.049436]  [<ffffffffa037de20>] ? nfs4_do_reclaim+0x5f0/0x5f0 [nfsv4]
      [   65.050686]  [<ffffffffa037de20>] ? nfs4_do_reclaim+0x5f0/0x5f0 [nfsv4]
      [   65.051943]  [<ffffffff81088640>] kthread+0xc0/0xd0
      [   65.052831]  [<ffffffff81088580>] ? insert_kthread_work+0x40/0x40
      [   65.054697]  [<ffffffff8165686c>] ret_from_fork+0x7c/0xb0
      [   65.056396]  [<ffffffff81088580>] ? insert_kthread_work+0x40/0x40
      [   65.058208] Code: 5c 41 5d 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 89 d5 41 54 53 48 89 fb <4c> 8b 67 30 f0 41 ff 44 24 44 49 8d 7c 24 40 e8 0e 0a 2d e1 44
      [   65.065225] RIP  [<ffffffffa037d6ee>] __nfs4_close+0x1e/0x160 [nfsv4]
      [   65.067175]  RSP <ffff88007b907d10>
      [   65.068570] CR2: 0000000000000030
      [   65.070098] ---[ end trace 0d1fe4f5c7dd6f8b ]---
      
      Cc: <stable@vger.kernel.org> #3.7+
      Signed-off-by: default avatarWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      f494a607
    • Trond Myklebust's avatar
      NFSv4.1: Don't change the security label as part of open reclaim. · 83c78eb0
      Trond Myklebust authored
      The current caching model calls for the security label to be set on
      first lookup and/or on any subsequent label changes. There is no
      need to do it as part of an open reclaim.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      83c78eb0
    • Jeff Layton's avatar
      nfs: fix handling of invalid mount options in nfs_remount · 1966903f
      Jeff Layton authored
      nfs_parse_mount_options returns 0 on error, not -errno.
      Reported-by: default avatarKarel Zak <kzak@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      1966903f
    • Jeff Layton's avatar
    • Andy Adamson's avatar
      NFSv4 Remove zeroing state kern warnings · 3660cd43
      Andy Adamson authored
      As of commit 5d422301 we no longer zero the
      state.
      Signed-off-by: default avatarAndy Adamson <andros@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      3660cd43
  2. 01 Oct, 2013 12 commits
  3. 30 Sep, 2013 15 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (fixes from Andrew Morton) · 522d6d38
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
        pidns: fix free_pid() to handle the first fork failure
        ipc,msg: prevent race with rmid in msgsnd,msgrcv
        ipc/sem.c: update sem_otime for all operations
        mm/hwpoison: fix the lack of one reference count against poisoned page
        mm/hwpoison: fix false report on 2nd attempt at page recovery
        mm/hwpoison: fix test for a transparent huge page
        mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
        block: change config option name for cmdline partition parsing
        mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
        mm: avoid reinserting isolated balloon pages into LRU lists
        arch/parisc/mm/fault.c: fix uninitialized variable usage
        include/asm-generic/vtime.h: avoid zero-length file
        nilfs2: fix issue with race condition of competition between segments for dirty blocks
        Documentation/kernel-parameters.txt: replace kernelcore with Movable
        mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
        kernel/kmod.c: check for NULL in call_usermodehelper_exec()
        ipc/sem.c: synchronize the proc interface
        ipc/sem.c: optimize sem_lock()
        ipc/sem.c: fix race in sem_lock()
        mm/compaction.c: periodically schedule when freeing pages
        ...
      522d6d38
    • Oleg Nesterov's avatar
      pidns: fix free_pid() to handle the first fork failure · 314a8ad0
      Oleg Nesterov authored
      "case 0" in free_pid() assumes that disable_pid_allocation() should
      clear PIDNS_HASH_ADDING before the last pid goes away.
      
      However this doesn't happen if the first fork() fails to create the
      child reaper which should call disable_pid_allocation().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      314a8ad0
    • Davidlohr Bueso's avatar
      ipc,msg: prevent race with rmid in msgsnd,msgrcv · 4271b05a
      Davidlohr Bueso authored
      This fixes a race in both msgrcv() and msgsnd() between finding the msg
      and actually dealing with the queue, as another thread can delete shmid
      underneath us if we are preempted before acquiring the
      kern_ipc_perm.lock.
      
      Manfred illustrates this nicely:
      
      Assume a preemptible kernel that is preempted just after
      
          msq = msq_obtain_object_check(ns, msqid)
      
      in do_msgrcv().  The only lock that is held is rcu_read_lock().
      
      Now the other thread processes IPC_RMID.  When the first task is
      resumed, then it will happily wait for messages on a deleted queue.
      
      Fix this by checking for if the queue has been deleted after taking the
      lock.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: <stable@vger.kernel.org> 	[3.11]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4271b05a
    • Manfred Spraul's avatar
      ipc/sem.c: update sem_otime for all operations · 0e8c6656
      Manfred Spraul authored
      In commit 0a2b9d4c ("ipc/sem.c: move wake_up_process out of the
      spinlock section"), the update of semaphore's sem_otime(last semop time)
      was moved to one central position (do_smart_update).
      
      But since do_smart_update() is only called for operations that modify
      the array, this means that wait-for-zero semops do not update sem_otime
      anymore.
      
      The fix is simple:
      Non-alter operations must update sem_otime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarJia He <jiakernel@gmail.com>
      Tested-by: default avatarJia He <jiakernel@gmail.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e8c6656
    • Wanpeng Li's avatar
      mm/hwpoison: fix the lack of one reference count against poisoned page · fb31ba30
      Wanpeng Li authored
      The lack of one reference count against poisoned page for hwpoison_inject
      w/o hwpoison_filter enabled result in hwpoison detect -1 users still
      referenced the page, however, the number should be 0 except the poison
      handler held one after successfully unmap.  This patch fix it by hold one
      referenced count against poisoned page for hwpoison_inject w/ and w/o
      hwpoison_filter enabled.
      
      Before patch:
      
      [   71.902112] Injecting memory failure at pfn 224706
      [   71.902137] MCE 0x224706: dirty LRU page recovery: Failed
      [   71.902138] MCE 0x224706: dirty LRU page still referenced by -1 users
      
      After patch:
      
      [   94.710860] Injecting memory failure at pfn 215b68
      [   94.710885] MCE 0x215b68: dirty LRU page recovery: Recovered
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb31ba30
    • Wanpeng Li's avatar
      mm/hwpoison: fix false report on 2nd attempt at page recovery · 2d421acd
      Wanpeng Li authored
      If the page is poisoned by software injection w/ MF_COUNT_INCREASED
      flag, there is a false report during the 2nd attempt at page recovery
      which is not truthful.
      
      This patch fixes it by reporting the first attempt to try free buddy
      page recovery if MF_COUNT_INCREASED is set.
      
      Before patch:
      
      [  346.332041] Injecting memory failure at pfn 200010
      [  346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed
      
      After patch:
      
      [  297.742600] Injecting memory failure at pfn 200010
      [  297.742941] MCE 0x200010: free buddy page recovery: Delayed
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d421acd
    • Wanpeng Li's avatar
      mm/hwpoison: fix test for a transparent huge page · e76d30e2
      Wanpeng Li authored
      PageTransHuge() can't guarantee the page is a transparent huge page
      since it returns true for both transparent huge and hugetlbfs pages.
      
      This patch fixes it by checking the page is also !hugetlbfs page.
      
      Before patch:
      
      [  121.571128] Injecting memory failure at pfn 23a200
      [  121.571141] MCE 0x23a200: huge page recovery: Delayed
      [  140.355100] MCE: Memory failure is now running on 0x23a200
      
      After patch:
      
      [   94.290793] Injecting memory failure at pfn 23a000
      [   94.290800] MCE 0x23a000: huge page recovery: Delayed
      [  105.722303] MCE: Software-unpoisoned page 0x23a000
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e76d30e2
    • Wanpeng Li's avatar
      mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood · 20cb6cab
      Wanpeng Li authored
      madvise_hwpoison won't check if the page is small page or huge page and
      traverses in small page granularity against the range unconditionally,
      which result in a printk flood "MCE xxx: already hardware poisoned" if
      the page is a huge page.
      
      This patch fixes it by using compound_order(compound_head(page)) for
      huge page iterator.
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 3
      #define PAGE_SIZE	4096 * 512
      
      int main(void)
      {
      	char *mem;
      	int i;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
      
      	return 0;
      }
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20cb6cab
    • Paul Gortmaker's avatar
      block: change config option name for cmdline partition parsing · 080506ad
      Paul Gortmaker authored
      Recently commit bab55417 ("block: support embedded device command
      line partition") introduced CONFIG_CMDLINE_PARSER.  However, that name
      is too generic and sounds like it enables/disables generic kernel boot
      arg processing, when it really is block specific.
      
      Before this option becomes a part of a full/final release, add the BLK_
      prefix to it so that it is clear in absence of any other context that it
      is block specific.
      
      In addition, fix up the following less critical items:
       - help text was not really at all helpful.
       - index file for Documentation was not updated
       - add the new arg to Documentation/kernel-parameters.txt
       - clarify wording in source comments
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Cai Zhiyong <caizhiyong@huawei.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      080506ad
    • Vlastimil Babka's avatar
      mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration · eadb41ae
      Vlastimil Babka authored
      The function __munlock_pagevec_fill() introduced in commit 7a8010cd
      ("mm: munlock: manual pte walk in fast path instead of
      follow_page_mask()") uses pmd_addr_end() for restricting its operation
      within current page table.
      
      This is insufficient on architectures/configurations where pmd is folded
      and pmd_addr_end() just returns the end of the full range to be walked.
      In this case, it allows pte++ to walk off the end of a page table
      resulting in unpredictable behaviour.
      
      This patch fixes the function by using pgd_addr_end() and pud_addr_end()
      before pmd_addr_end(), which will yield correct page table boundary on
      all configurations.  This is similar to what existing page walkers do
      when walking each level of the page table.
      
      Additionaly, the patch clarifies a comment for get_locked_pte() call in the
      function.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eadb41ae
    • Rafael Aquini's avatar
      mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e
      Rafael Aquini authored
      Isolated balloon pages can wrongly end up in LRU lists when
      migrate_pages() finishes its round without draining all the isolated
      page list.
      
      The same issue can happen when reclaim_clean_pages_from_list() tries to
      reclaim pages from an isolated page list, before migration, in the CMA
      path.  Such balloon page leak opens a race window against LRU lists
      shrinkers that leads us to the following kernel panic:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
        IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
        PGD 3cda2067 PUD 3d713067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: shrink_page_list+0x24e/0x897
        RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
        RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
        RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
        R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
        R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
        FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
        Call Trace:
          shrink_inactive_list+0x240/0x3de
          shrink_lruvec+0x3e0/0x566
          __shrink_zone+0x94/0x178
          shrink_zone+0x3a/0x82
          balance_pgdat+0x32a/0x4c2
          kswapd+0x2f0/0x372
          kthread+0xa2/0xaa
          ret_from_fork+0x7c/0xb0
        Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
        RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
         RSP <ffff88003da499b8>
        CR2: 0000000000000028
        ---[ end trace 703d2451af6ffbfd ]---
        Kernel panic - not syncing: Fatal exception
      
      This patch fixes the issue, by assuring the proper tests are made at
      putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
      isolated balloon pages being wrongly reinserted in LRU lists.
      
      [akpm@linux-foundation.org: clarify awkward comment text]
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Reported-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Tested-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      117aad1e
    • Felipe Pena's avatar
      arch/parisc/mm/fault.c: fix uninitialized variable usage · 0772dac1
      Felipe Pena authored
      The FAULT_FLAG_WRITE flag has been set based on uninitialized variable.
      
      Fixes a regression added by commit 759496ba ("arch: mm: pass
      userspace fault flag to generic fault handler")
      Signed-off-by: default avatarFelipe Pena <felipensp@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0772dac1
    • Andrew Morton's avatar
      include/asm-generic/vtime.h: avoid zero-length file · 2a156a6b
      Andrew Morton authored
      patch(1) can't handle zero-length files - it appears to simply not create
      the file, so my powerpc build fails.
      
      Put something in here to make life easier.
      
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a156a6b
    • Vyacheslav Dubeyko's avatar
      nilfs2: fix issue with race condition of competition between segments for dirty blocks · 7f42ec39
      Vyacheslav Dubeyko authored
      Many NILFS2 users were reported about strange file system corruption
      (for example):
      
         NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
         NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)
      
      But such error messages are consequence of file system's issue that takes
      place more earlier.  Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
      and Anton Eliasson <devel@antoneliasson.se> were reported about another
      issue not so recently.  These reports describe the issue with segctor
      thread's crash:
      
        BUG: unable to handle kernel paging request at 0000000000004c83
        IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
        Call Trace:
         nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
         nilfs_segctor_construct+0x17b/0x290 [nilfs2]
         nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
         kthread+0xc0/0xd0
         ret_from_fork+0x7c/0xb0
      
      These two issues have one reason.  This reason can raise third issue
      too.  Third issue results in hanging of segctor thread with eating of
      100% CPU.
      
      REPRODUCING PATH:
      
      One of the possible way or the issue reproducing was described by
      Jermoe me Poulin <jeromepoulin@gmail.com>:
      
      1. init S to get to single user mode.
      2. sysrq+E to make sure only my shell is running
      3. start network-manager to get my wifi connection up
      4. login as root and launch "screen"
      5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
      6. lscp | xz -9e > lscp.txt.xz
      7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
      8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
      9. start a screen and launch strace -f -o find-cat.log -t find
      /mnt/nilfs -type f -exec cat {} > /dev/null \;
      10. start a screen and launch strace -f -o apt-get.log -t apt-get update
      11. launch the last command again as it did not crash the first time
      12. apt-get crashes
      13. ps aux > ps-aux-crashed.log
      13. sysrq+W
      14. sysrq+E  wait for everything to terminate
      15. sysrq+SUSB
      
      Simplified way of the issue reproducing is starting kernel compilation
      task and "apt-get update" in parallel.
      
      REPRODUCIBILITY:
      
      The issue is reproduced not stable [60% - 80%].  It is very important to
      have proper environment for the issue reproducing.  The critical
      conditions for successful reproducing:
      
      (1) It should have big modified file by mmap() way.
      
      (2) This file should have the count of dirty blocks are greater that
          several segments in size (for example, two or three) from time to time
          during processing.
      
      (3) It should be intensive background activity of files modification
          in another thread.
      
      INVESTIGATION:
      
      First of all, it is possible to see that the reason of crash is not valid
      page address:
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
        NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783
      
      Moreover, value of b_page (0x1a82) is 6786.  This value looks like segment
      number.  And b_blocknr with b_size values look like block numbers.  So,
      buffer_head's pointer points on not proper address value.
      
      Detailed investigation of the issue is discovered such picture:
      
        [-----------------------------SEGMENT 6783-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783
      
        [-----------------------------SEGMENT 6784-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15
      
        [-----------------------------SEGMENT 6785-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12
      
        NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
      
        BUG: unable to handle kernel paging request at 0000000000001a82
        IP: [<ffffffffa024d0f2>] nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
      Usually, for every segment we collect dirty files in list.  Then, dirty
      blocks are gathered for every dirty file, prepared for write and
      submitted by means of nilfs_segbuf_submit_bh() call.  Finally, it takes
      place complete write phase after calling nilfs_end_bio_write() on the
      block layer.  Buffers/pages are marked as not dirty on final phase and
      processed files removed from the list of dirty files.
      
      It is possible to see that we had three prepare_write and submit_bio
      phases before segbuf_wait and complete_write phase.  Moreover, segments
      compete between each other for dirty blocks because on every iteration
      of segments processing dirty buffer_heads are added in several lists of
      payload_buffers:
      
        [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
      
      The next pointer is the same but prev pointer has changed.  It means
      that buffer_head has next pointer from one list but prev pointer from
      another.  Such modification can be made several times.  And, finally, it
      can be resulted in various issues: (1) segctor hanging, (2) segctor
      crashing, (3) file system metadata corruption.
      
      FIX:
      This patch adds:
      
      (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
          for every proccessed dirty block;
      
      (2) checking of BH_Async_Write flag in
          nilfs_lookup_dirty_data_buffers() and
          nilfs_lookup_dirty_node_buffers();
      
      (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
          nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().
      Reported-by: default avatarJerome Poulin <jeromepoulin@gmail.com>
      Reported-by: default avatarAnton Eliasson <devel@antoneliasson.se>
      Cc: Paul Fertser <fercerpav@gmail.com>
      Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
      Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
      Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net>
      Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
      Cc: Elmer Zhang <freeboy6716@gmail.com>
      Cc: Kenneth Langga <klangga@gmail.com>
      Signed-off-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f42ec39
    • Weiping Pan's avatar
      Documentation/kernel-parameters.txt: replace kernelcore with Movable · 675217fd
      Weiping Pan authored
      Han Pingtian found a typo in Documentation/kernel-parameters.txt about
      "kernelcore=", that "kernelcore" should be replaced with "Movable" here.
      Signed-off-by: default avatarWeiping Pan <wpan@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      675217fd