1. 07 Jul, 2017 8 commits
  2. 09 Jun, 2017 10 commits
    • Nicholas Bellinger's avatar
      iscsi-target: Avoid holding ->tpg_state_lock during param update · eceb4459
      Nicholas Bellinger authored
      As originally reported by Jia-Ju, iscsit_tpg_enable_portal_group()
      holds iscsi_portal_group->tpg_state_lock while updating AUTHMETHOD
      via iscsi_update_param_value(), which performs a GFP_KERNEL
      allocation.
      
      However, since iscsit_tpg_enable_portal_group() is already protected
      by iscsit_get_tpg() -> iscsi_portal_group->tpg_access_lock in it's
      parent caller, ->tpg_state_lock only needs to be held when setting
      TPG_STATE_ACTIVE.
      Reported-by: default avatarJia-Ju Bai <baijiaju1990@163.com>
      Reviewed-by: default avatarJia-Ju Bai <baijiaju1990@163.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      eceb4459
    • Nicholas Bellinger's avatar
      target/configfs: Kill se_lun->lun_link_magic · 9ae0e9ad
      Nicholas Bellinger authored
      Instead of using a hardcoded magic value in se_lun when verifying
      a target config_item symlink source during target_fabric_mappedlun_link(),
      go ahead and use target_fabric_port_item_ops directly instead.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      9ae0e9ad
    • Nicholas Bellinger's avatar
      target/configfs: Kill se_device->dev_link_magic · c17cd249
      Nicholas Bellinger authored
      Instead of using a hardcoded magic value in se_device when verifying
      a target config_item symlink source during target_fabric_port_link(),
      go ahead and use target_core_dev_item_ops directly instead.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      c17cd249
    • Nicholas Bellinger's avatar
      target/iblock: Convert WRITE_SAME to blkdev_issue_zeroout · 2237498f
      Nicholas Bellinger authored
      The people who are actively using iblock_execute_write_same_direct() are
      doing so in the context of ESX VAAI BlockZero, together with
      EXTENDED_COPY and COMPARE_AND_WRITE primitives.
      
      In practice though I've not seen any users of IBLOCK WRITE_SAME for
      anything other than VAAI BlockZero, so just using blkdev_issue_zeroout()
      when available, and falling back to iblock_execute_write_same() if the
      WRITE_SAME buffer contains anything other than zeros should be OK.
      
      (Hook up max_write_zeroes_sectors to signal LBPRZ feature bit in
       target_configure_unmap_from_queue - nab)
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      2237498f
    • Michael Cyr's avatar
      ibmvscsis: Enable Logical Partition Migration Support · 464fd641
      Michael Cyr authored
      Changes to support a new mechanism from phyp to better synchronize the
      logical partition migration (LPM) of the client partition.
      This includes a new VIOCTL to register that we support this new
      functionality, and 2 new Transport Event types, and finally another
      new VIOCTL to let phyp know once we're ready for the Suspend.
      Signed-off-by: default avatarMichael Cyr <mikecyr@us.ibm.com>
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      464fd641
    • Byungchul Park's avatar
      vhost/scsi: Don't reinvent the wheel but use existing llist API · 12bdcbd5
      Byungchul Park authored
      Although llist provides proper APIs, they are not used. Make them used.
      Signed-off-by: default avatarByungchul Park <byungchul.park@lge.com>
      Acked-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      12bdcbd5
    • Gustavo A. R. Silva's avatar
      target: remove dead code · fb418240
      Gustavo A. R. Silva authored
      Local variable _ret_ is assigned to a constant value and it is never
      updated again. Remove this variable and the dead code it guards.
      
      Addresses-Coverity-ID: 140761
      Signed-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Reviewed-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      fb418240
    • Nicholas Bellinger's avatar
      iscsi-target: Reject immediate data underflow larger than SCSI transfer length · abb85a9b
      Nicholas Bellinger authored
      When iscsi WRITE underflow occurs there are two different scenarios
      that can happen.
      
      Normally in practice, when an EDTL vs. SCSI CDB TRANSFER LENGTH
      underflow is detected, the iscsi immediate data payload is the
      smaller SCSI CDB TRANSFER LENGTH.
      
      That is, when a host fabric LLD is using a fixed size EDTL for
      a specific control CDB, the SCSI CDB TRANSFER LENGTH and actual
      SCSI payload ends up being smaller than EDTL.  In iscsi, this
      means the received iscsi immediate data payload matches the
      smaller SCSI CDB TRANSFER LENGTH, because there is no more
      SCSI payload to accept beyond SCSI CDB TRANSFER LENGTH.
      
      However, it's possible for a malicous host to send a WRITE
      underflow where EDTL is larger than SCSI CDB TRANSFER LENGTH,
      but incoming iscsi immediate data actually matches EDTL.
      
      In the wild, we've never had a iscsi host environment actually
      try to do this.
      
      For this special case, it's wrong to truncate part of the
      control CDB payload and continue to process the command during
      underflow when immediate data payload received was larger than
      SCSI CDB TRANSFER LENGTH, so go ahead and reject and drop the
      bogus payload as a defensive action.
      
      Note this potential bug was originally relaxed by the following
      for allowing WRITE underflow in MSFT FCP host environments:
      
         commit c72c5250
         Author: Roland Dreier <roland@purestorage.com>
         Date:   Wed Jul 22 15:08:18 2015 -0700
      
            target: allow underflow/overflow for PR OUT etc. commands
      
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: <stable@vger.kernel.org> # v4.3+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      abb85a9b
    • Nicholas Bellinger's avatar
      iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP · 105fa2f4
      Nicholas Bellinger authored
      This patch fixes a BUG() in iscsit_close_session() that could be
      triggered when iscsit_logout_post_handler() execution from within
      tx thread context was not run for more than SECONDS_FOR_LOGOUT_COMP
      (15 seconds), and the TCP connection didn't already close before
      then forcing tx thread context to automatically exit.
      
      This would manifest itself during explicit logout as:
      
      [33206.974254] 1 connection(s) still exist for iSCSI session to iqn.1993-08.org.debian:01:3f5523242179
      [33206.980184] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 2100.772 msecs
      [33209.078643] ------------[ cut here ]------------
      [33209.078646] kernel BUG at drivers/target/iscsi/iscsi_target.c:4346!
      
      Normally when explicit logout attempt fails, the tx thread context
      exits and iscsit_close_connection() from rx thread context does the
      extra cleanup once it detects conn->conn_logout_remove has not been
      cleared by the logout type specific post handlers.
      
      To address this special case, if the logout post handler in tx thread
      context detects conn->tx_thread_active has already been cleared, simply
      return and exit in order for existing iscsit_close_connection()
      logic from rx thread context do failed logout cleanup.
      Reported-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: stable@vger.kernel.org # 3.14+
      Tested-by: default avatarGary Guo <ghg@datera.io>
      Tested-by: default avatarChu Yuan Lin <cyl@datera.io>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      105fa2f4
    • Nicholas Bellinger's avatar
      target: Fix kref->refcount underflow in transport_cmd_finish_abort · 73d4e580
      Nicholas Bellinger authored
      This patch fixes a se_cmd->cmd_kref underflow during CMD_T_ABORTED
      when a fabric driver drops it's second reference from below the
      target_core_tmr.c based callers of transport_cmd_finish_abort().
      
      Recently with the conversion of kref to refcount_t, this bug was
      manifesting itself as:
      
      [705519.601034] refcount_t: underflow; use-after-free.
      [705519.604034] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 20116.512 msecs
      [705539.719111] ------------[ cut here ]------------
      [705539.719117] WARNING: CPU: 3 PID: 26510 at lib/refcount.c:184 refcount_sub_and_test+0x33/0x51
      
      Since the original kref atomic_t based kref_put() didn't check for
      underflow and only invoked the final callback when zero was reached,
      this bug did not manifest in practice since all se_cmd memory is
      using preallocated tags.
      
      To address this, go ahead and propigate the existing return from
      transport_put_cmd() up via transport_cmd_finish_abort(), and
      change transport_cmd_finish_abort() + core_tmr_handle_tas_abort()
      callers to only do their local target_put_sess_cmd() if necessary.
      Reported-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: stable@vger.kernel.org # 3.14+
      Tested-by: default avatarGary Guo <ghg@datera.io>
      Tested-by: default avatarChu Yuan Lin <cyl@datera.io>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      73d4e580
  3. 31 May, 2017 2 commits
    • Jiang Yi's avatar
      iscsi-target: Always wait for kthread_should_stop() before kthread exit · 5e0cf5e6
      Jiang Yi authored
      There are three timing problems in the kthread usages of iscsi_target_mod:
      
       - np_thread of struct iscsi_np
       - rx_thread and tx_thread of struct iscsi_conn
      
      In iscsit_close_connection(), it calls
      
       send_sig(SIGINT, conn->tx_thread, 1);
       kthread_stop(conn->tx_thread);
      
      In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive
      SIGINT the kthread will exit without checking the return value of
      kthread_should_stop().
      
      So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...)
      and kthread_stop(...), the kthread_stop() will try to stop an already
      stopped kthread.
      
      This is invalid according to the documentation of kthread_stop().
      
      (Fix -ECONNRESET logout handling in iscsi_target_tx_thread and
       early iscsi_target_rx_thread failure case - nab)
      Signed-off-by: default avatarJiang Yi <jiangyilism@gmail.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      5e0cf5e6
    • Nicholas Bellinger's avatar
      iscsi-target: Fix initial login PDU asynchronous socket close OOPs · 25cdda95
      Nicholas Bellinger authored
      This patch fixes a OOPs originally introduced by:
      
         commit bb048357
         Author: Nicholas Bellinger <nab@linux-iscsi.org>
         Date:   Thu Sep 5 14:54:04 2013 -0700
      
         iscsi-target: Add sk->sk_state_change to cleanup after TCP failure
      
      which would trigger a NULL pointer dereference when a TCP connection
      was closed asynchronously via iscsi_target_sk_state_change(), but only
      when the initial PDU processing in iscsi_target_do_login() from iscsi_np
      process context was blocked waiting for backend I/O to complete.
      
      To address this issue, this patch makes the following changes.
      
      First, it introduces some common helper functions used for checking
      socket closing state, checking login_flags, and atomically checking
      socket closing state + setting login_flags.
      
      Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
      connection has dropped via iscsi_target_sk_state_change(), but the
      initial PDU processing within iscsi_target_do_login() in iscsi_np
      context is still running.  For this case, it sets LOGIN_FLAGS_CLOSED,
      but doesn't invoke schedule_delayed_work().
      
      The original NULL pointer dereference case reported by MNC is now handled
      by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
      transitioning to FFP to determine when the socket has already closed,
      or iscsi_target_start_negotiation() if the login needs to exchange
      more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
      closed.  For both of these cases, the cleanup up of remaining connection
      resources will occur in iscsi_target_start_negotiation() from iscsi_np
      process context once the failure is detected.
      
      Finally, to handle to case where iscsi_target_sk_state_change() is
      called after the initial PDU procesing is complete, it now invokes
      conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
      existing iscsi_target_sk_check_close() checks detect connection failure.
      For this case, the cleanup of remaining connection resources will occur
      in iscsi_target_do_login_rx() from delayed workqueue process context
      once the failure is detected.
      Reported-by: default avatarMike Christie <mchristi@redhat.com>
      Reviewed-by: default avatarMike Christie <mchristi@redhat.com>
      Tested-by: default avatarMike Christie <mchristi@redhat.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Reported-by: default avatarHannes Reinecke <hare@suse.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Varun Prakash <varun@chelsio.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      25cdda95
  4. 24 May, 2017 1 commit
    • Mike Christie's avatar
      tcmu: fix crash during device removal · f3cdbe39
      Mike Christie authored
      We currently do
      
      tcmu_free_device ->tcmu_netlink_event(TCMU_CMD_REMOVED_DEVICE) ->
      uio_unregister_device -> kfree(tcmu_dev).
      
      The problem is that the kernel does not wait for userspace to
      do the close() on the uio device before freeing the tcmu_dev.
      We can then hit a race where the kernel frees the tcmu_dev before
      userspace does close() and so when close() -> release -> tcmu_release
      is done, we try to access a freed tcmu_dev.
      
      This patch made over the target-pending master branch moves the freeing
      of the tcmu_dev to when the last reference has been dropped.
      
      This also fixes a leak where if tcmu_configure_device was not called on a
      device we did not free udev->name which was allocated at tcmu_alloc_device time.
      Signed-off-by: default avatarMike Christie <mchristi@redhat.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      f3cdbe39
  5. 16 May, 2017 3 commits
    • Nicholas Bellinger's avatar
      target: Re-add check to reject control WRITEs with overflow data · 4ff83daa
      Nicholas Bellinger authored
      During v4.3 when the overflow/underflow check was relaxed by
      commit c72c5250:
      
        commit c72c5250
        Author: Roland Dreier <roland@purestorage.com>
        Date:   Wed Jul 22 15:08:18 2015 -0700
      
             target: allow underflow/overflow for PR OUT etc. commands
      
      to allow underflow/overflow for Windows compliance + FCP, a
      consequence was to allow control CDBs to process overflow
      data for iscsi-target with immediate data as well.
      
      As per Roland's original change, continue to allow underflow
      cases for control CDBs to make Windows compliance + FCP happy,
      but until overflow for control CDBs is supported tree-wide,
      explicitly reject all control WRITEs with overflow following
      pre v4.3.y logic.
      Reported-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: <stable@vger.kernel.org> # v4.3+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      4ff83daa
    • Bryant G. Ly's avatar
      ibmvscsis: Fix the incorrect req_lim_delta · 75dbf2d3
      Bryant G. Ly authored
      The current code is not correctly calculating the req_lim_delta.
      
      We want to make sure vscsi->credit is always incremented when
      we do not send a response for the scsi op. Thus for the case where
      there is a successfully aborted task we need to make sure the
      vscsi->credit is incremented.
      
      v2 - Moves the original location of the vscsi->credit increment
      to a better spot. Since if we increment credit, the next command
      we send back will have increased req_lim_delta. But we probably
      shouldn't be doing that until the aborted cmd is actually released.
      Otherwise the client will think that it can send a new command, and
      we could find ourselves short of command elements. Not likely, but could
      happen.
      
      This patch depends on both:
      commit 25e78531 ("ibmvscsis: Do not send aborted task response")
      commit 98883f1b ("ibmvscsis: Clear left-over abort_cmd pointers")
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      75dbf2d3
    • Bryant G. Ly's avatar
      ibmvscsis: Clear left-over abort_cmd pointers · 98883f1b
      Bryant G. Ly authored
      With the addition of ibmvscsis->abort_cmd pointer within
      commit 25e78531 ("ibmvscsis: Do not send aborted task response"),
      make sure to explicitly NULL these pointers when clearing
      DELAY_SEND flag.
      
      Do this for two cases, when getting the new new ibmvscsis
      descriptor in ibmvscsis_get_free_cmd() and before posting
      the response completion in ibmvscsis_send_messages().
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      98883f1b
  6. 13 May, 2017 5 commits
    • Linus Torvalds's avatar
      Linux 4.12-rc1 · 2ea659a9
      Linus Torvalds authored
      2ea659a9
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · cd636458
      Linus Torvalds authored
      Pull some more input subsystem updates from Dmitry Torokhov:
       "An updated xpad driver with a few more recognized device IDs, and a
        new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
        via SPI bus"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: cros_ec_keyb - remove extraneous 'const'
        Input: add support for PlayStation 1/2 joypads connected via SPI
        Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
        Input: xpad - sync supported devices with xboxdrv
        Input: xpad - sort supported devices by USB ID
      cd636458
    • Linus Torvalds's avatar
      Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs · b53c4d5e
      Linus Torvalds authored
      Pull UBI/UBIFS updates from Richard Weinberger:
      
       - new config option CONFIG_UBIFS_FS_SECURITY
      
       - minor improvements
      
       - random fixes
      
      * tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
        ubi: Add debugfs file for tracking PEB state
        ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
        ubifs: Remove unnecessary assignment
        ubifs: Fix cut and paste error on sb type comparisons
        ubi: fastmap: Fix slab corruption
        ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
        ubi: Make mtd parameter readable
        ubi: Fix section mismatch
      b53c4d5e
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · ec059019
      Linus Torvalds authored
      Pull UML fixes from Richard Weinberger:
       "No new stuff, just fixes"
      
      * 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: Add missing NR_CPUS include
        um: Fix to call read_initrd after init_bootmem
        um: Include kbuild.h instead of duplicating its macros
        um: Fix PTRACE_POKEUSER on x86_64
        um: Set number of CPUs
        um: Fix _print_addr()
      ec059019
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 1251704a
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "15 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, docs: update memory.stat description with workingset* entries
        mm: vmscan: scan until it finds eligible pages
        mm, thp: copying user pages must schedule on collapse
        dax: fix PMD data corruption when fault races with write
        dax: fix data corruption when fault races with write
        ext4: return to starting transaction in ext4_dax_huge_fault()
        mm: fix data corruption due to stale mmap reads
        dax: prevent invalidation of mapped DAX entries
        Tigran has moved
        mm, vmalloc: fix vmalloc users tracking properly
        mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
        gcov: support GCC 7.1
        mm, vmstat: Remove spurious WARN() during zoneinfo print
        time: delete current_fs_time()
        hwpoison, memcg: forcibly uncharge LRU pages
      1251704a
  7. 12 May, 2017 11 commits
    • Roman Gushchin's avatar
      mm, docs: update memory.stat description with workingset* entries · b340959e
      Roman Gushchin authored
      Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
      workingset transition") introduced three new entries in memory stat
      file:
      
       - workingset_refault
       - workingset_activate
       - workingset_nodereclaim
      
      This commit adds a corresponding description to the cgroup v2 docs.
      
      Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b340959e
    • Minchan Kim's avatar
      mm: vmscan: scan until it finds eligible pages · 791b48b6
      Minchan Kim authored
      Although there are a ton of free swap and anonymous LRU page in elgible
      zones, OOM happened.
      
        balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
        CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
         oom_kill_process+0x21d/0x3f0
         out_of_memory+0xd8/0x390
         __alloc_pages_slowpath+0xbc1/0xc50
         __alloc_pages_nodemask+0x1a5/0x1c0
         pte_alloc_one+0x20/0x50
         __pte_alloc+0x1e/0x110
         __handle_mm_fault+0x919/0x960
         handle_mm_fault+0x77/0x120
         __do_page_fault+0x27a/0x550
         trace_do_page_fault+0x43/0x150
         do_async_page_fault+0x2c/0x90
         async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:424716 inactive_anon:65314 isolated_anon:0
         active_file:52 inactive_file:46 isolated_file:0
         unevictable:0 dirty:27 writeback:0 unstable:0
         slab_reclaimable:3967 slab_unreclaimable:4125
         mapped:133 shmem:43 pagetables:1674 bounce:0
         free:4637 free_pcp:225 free_cma:0
        Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 992 992 1952
        DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 959
        Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
        DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
        Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
        378 total pagecache pages
        17 pages in swap cache
        Swap cache stats: add 17325, delete 17302, find 0/27
        Free swap  = 978940kB
        Total swap = 1048572kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        12629 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
        [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
        [  433]     0   433     4904        5      14       3       82             0 upstart-udev-br
        [  438]     0   438    12371        5      27       3      191         -1000 systemd-udevd
      
      With investigation, skipping page of isolate_lru_pages makes reclaim
      void because it returns zero nr_taken easily so LRU shrinking is
      effectively nothing and just increases priority aggressively.  Finally,
      OOM happens.
      
      The problem is that get_scan_count determines nr_to_scan with eligible
      zones so although priority drops to zero, it couldn't reclaim any pages
      if the LRU contains mostly ineligible pages.
      
      get_scan_count:
      
              size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
      	size = size >> sc->priority;
      
      Assumes sc->priority is 0 and LRU list is as follows.
      
      	N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H
      
      (Ie, small eligible pages are in the head of LRU but others are
       almost ineligible pages)
      
      In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
      tail of the LRU are not eligible pages.  If get_scan_count counts
      skipped pages, it doesn't reclaim any pages remained after scanning 4
      pages so it ends up OOM happening.
      
      This patch makes isolate_lru_pages try to scan pages until it encounters
      eligible zones's pages.
      
      [akpm@linux-foundation.org: clean up mind-bending `for' statement.  Tweak comment text]
      Fixes: 3db65812 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
      Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      791b48b6
    • David Rientjes's avatar
      mm, thp: copying user pages must schedule on collapse · 338a16ba
      David Rientjes authored
      We have encountered need_resched warnings in __collapse_huge_page_copy()
      while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.
      
      mm->mmap_sem is held for write, but the iteration is well bounded.
      
      Reschedule as needed.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      338a16ba
    • Ross Zwisler's avatar
      dax: fix PMD data corruption when fault races with write · 876f2946
      Ross Zwisler authored
      This is based on a patch from Jan Kara that fixed the equivalent race in
      the DAX PTE fault path.
      
      Currently DAX PMD read fault can race with write(2) in the following
      way:
      
      CPU1 - write(2)                 CPU2 - read fault
                                      dax_iomap_pmd_fault()
                                        ->iomap_begin() - sees hole
      
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      
                                        grab_mapping_entry()
      				  - we add huge zero page to the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e ("dax: Call ->iomap_begin without entry lock during dax fault")
      Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      876f2946
    • Jan Kara's avatar
      dax: fix data corruption when fault races with write · 13e451fd
      Jan Kara authored
      Currently DAX read fault can race with write(2) in the following way:
      
      CPU1 - write(2)			CPU2 - read fault
      				dax_iomap_pte_fault()
      				  ->iomap_begin() - sees hole
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      				  grab_mapping_entry()
      				  - we add zero page in the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13e451fd
    • Jan Kara's avatar
      ext4: return to starting transaction in ext4_dax_huge_fault() · fb26a1cb
      Jan Kara authored
      DAX will return to locking exceptional entry before mapping blocks for a
      page fault to fix possible races with concurrent writes.  To avoid lock
      inversion between exceptional entry lock and transaction start, start
      the transaction already in ext4_dax_huge_fault().
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb26a1cb
    • Jan Kara's avatar
      mm: fix data corruption due to stale mmap reads · cd656375
      Jan Kara authored
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd656375
    • Ross Zwisler's avatar
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler authored
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
    • Andrew Morton's avatar
      Tigran has moved · cea58224
      Andrew Morton authored
      Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cea58224
    • Michal Hocko's avatar
      mm, vmalloc: fix vmalloc users tracking properly · 8594a21c
      Michal Hocko authored
      Commit 1f5307b1 ("mm, vmalloc: properly track vmalloc users") has
      pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
      turned out to be a bad idea for some architectures.  E.g.  m68k fails
      with
      
         In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
                          from arch/m68k/include/asm/pgtable.h:4,
                          from include/linux/vmalloc.h:9,
                          from arch/m68k/kernel/module.c:9:
         arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
      >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
          #define pgd_offset_k(address) pgd_offset(&init_mm, address)
      
      as spotted by kernel build bot. nios2 fails for other reason
      
        In file included from include/asm-generic/io.h:767:0,
                         from arch/nios2/include/asm/io.h:61,
                         from include/linux/io.h:25,
                         from arch/nios2/include/asm/pgtable.h:18,
                         from include/linux/mm.h:70,
                         from include/linux/pid_namespace.h:6,
                         from include/linux/ptrace.h:9,
                         from arch/nios2/include/uapi/asm/elf.h:23,
                         from arch/nios2/include/asm/elf.h:22,
                         from include/linux/elf.h:4,
                         from include/linux/module.h:15,
                         from init/main.c:16:
        include/linux/vmalloc.h: In function '__vmalloc_node_flags':
        include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?
      
      which is due to the newly added #include <asm/pgtable.h>, which on nios2
      includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which
      again includes <linux/vmalloc.h>.
      
      Tweaking that around just turns out a bigger headache than necessary.
      This patch reverts 1f5307b1 and reimplements the original fix in a
      different way.  __vmalloc_node_flags can stay static inline which will
      cover vmalloc* functions.  We only have one external user
      (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
      provide the caller directly.  This is much simpler and it doesn't really
      need any games with header files.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: revert old comment]
        Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
      Fixes: 1f5307b1 ("mm, vmalloc: properly track vmalloc users")
      Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tobias Klauser <tklauser@distanz.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8594a21c
    • SeongJae Park's avatar
      mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin · 835152a2
      SeongJae Park authored
      One return case of `__collapse_huge_page_swapin()` does not invoke
      tracepoint while every other return case does.  This commit adds a
      tracepoint invocation for the case.
      
      Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.comSigned-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      835152a2