1. 16 Mar, 2017 3 commits
  2. 15 Mar, 2017 7 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 69eea5a4
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Four small fixes for this cycle:
      
         - followup fix from Neil for a fix that went in before -rc2, ensuring
           that we always see the full per-task bio_list.
      
         - fix for blk-mq-sched from me that ensures that we retain similar
           direct-to-issue behavior on running the queue.
      
         - fix from Sagi fixing a potential NULL pointer dereference in blk-mq
           on spurious CPU unplug.
      
         - a memory leak fix in writeback from Tahsin, fixing a case where
           device removal of a mounted device can leak a struct
           wb_writeback_work"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
        writeback: fix memory leak in wb_queue_work()
        blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
        blk: Ensure users for current->bio_list can see the full list.
      69eea5a4
    • Ingo Molnar's avatar
      Merge tag 'perf-core-for-mingo-4.12-20170314' of... · ffa86c2f
      Ingo Molnar authored
      Merge tag 'perf-core-for-mingo-4.12-20170314' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
      
      Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
      
      New features:
      
      - Add PERF_RECORD_NAMESPACES so that the kernel can record information
        required to associate samples to namespaces, helping in container
        problem characterization.
      
        Now the 'perf record has a --namespace' option to ask for such info,
        and when present, it can be used, initially, via a new sort order,
        'cgroup_id', allowing histogram entry bucketization by a (device, inode)
        based cgroup identifier (Hari Bathini)
      
      - Add --next option to 'perf sched timehist', showing what is the next
        thread to run (Brendan Gregg)
      
      Fixes:
      
      - Fix segfault with basic block 'cycles' sort dimension (Changbin Du)
      
      - Add c2c to command-list.txt, making it appear in the 'perf help'
        output (Changbin Du)
      
      - Fix zeroing of 'abs_path' variable in the perf hists browser switch
        file code (Changbin Du)
      
      - Hide tips messages when -q/--quiet is given to 'perf report' (Namhyung Kim)
      
      Infrastructure changes:
      
      - Use ref_reloc_sym + offset to setup kretprobes (Naveen Rao)
      
      - Ignore generated files pmu-events/{jevents,pmu-events.c} for git (Changbin Du)
      
      Documentation changes:
      
      - Document +field style argument support for --field option (Changbin Du)
      
      - Clarify 'perf c2c --stats' help message (Namhyung Kim)
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ffa86c2f
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 95422dec
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This is a rather large set of fixes. The bulk are for lpfc correcting
        a lot of issues in the new NVME driver code which just went in in the
        merge window.
      
        The others are:
      
         - fix a hang in the vmware paravirt driver caused by incorrect
           handling of the new MSI vector allocation
      
         - long standing bug in storvsc, which recent block changes turned
           from being a harmless annoyance into a hang
      
         - yet more fallout (in mpt3sas) from the changes to device blocking
      
        The remainder are small fixes and updates"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (34 commits)
        scsi: lpfc: Add shutdown method for kexec
        scsi: storvsc: Workaround for virtual DVD SCSI version
        scsi: lpfc: revise version number to 11.2.0.10
        scsi: lpfc: code cleanups in NVME initiator discovery
        scsi: lpfc: code cleanups in NVME initiator base
        scsi: lpfc: correct rdp diag portnames
        scsi: lpfc: remove dead sli3 nvme code
        scsi: lpfc: correct double print
        scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT
        scsi: lpfc: Rework lpfc Kconfig for NVME options
        scsi: lpfc: add transport eh_timed_out reference
        scsi: lpfc: Fix eh_deadline setting for sli3 adapters.
        scsi: lpfc: add NVME exchange aborts
        scsi: lpfc: Fix nvme allocation bug on failed nvme_fc_register_localport
        scsi: lpfc: Fix IO submission if WQ is full
        scsi: lpfc: Fix NVME CMD IU byte swapped word 1 problem
        scsi: lpfc: Fix RCTL value on NVME LS request and response
        scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
        scsi: lpfc: fix missing spin_unlock on sql_list_lock
        scsi: lpfc: don't dereference dma_buf->iocbq before null check
        ...
      95422dec
    • Linus Torvalds's avatar
      Merge tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · aabcf5fc
      Linus Torvalds authored
      Pull gfs2 fix from Bob Peterson:
       "This is an emergency patch for 4.11-rc3
      
        The GFS2 developers uncovered a really nasty problem that can lead to
        random corruption and kernel panic, much like the last one. Andreas
        Gruenbacher wrote a simple one-line patch to fix the problem."
      
      * tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Avoid alignment hole in struct lm_lockname
      aabcf5fc
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · defc7d75
      Linus Torvalds authored
      Pull crypto fixes from Herbert Xu:
      
       - self-test failure of crc32c on powerpc
      
       - regressions of ecb(aes) when used with xts/lrw in s5p-sss
      
       - a number of bugs in the omap RNG driver
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: s5p-sss - Fix spinlock recursion on LRW(AES)
        hwrng: omap - Do not access INTMASK_REG on EIP76
        hwrng: omap - use devm_clk_get() instead of of_clk_get()
        hwrng: omap - write registers after enabling the clock
        crypto: s5p-sss - Fix completing crypto request in IRQ handler
        crypto: powerpc - Fix initialisation of crc32c context
      defc7d75
    • Andreas Gruenbacher's avatar
      gfs2: Avoid alignment hole in struct lm_lockname · 28ea06c4
      Andreas Gruenbacher authored
      Commit 88ffbf3e switches to using rhashtables for glocks, hashing over
      the entire struct lm_lockname instead of its individual fields.  On some
      architectures, struct lm_lockname contains a hole of uninitialized
      memory due to alignment rules, which now leads to incorrect hash values.
      Get rid of that hole.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      CC: <stable@vger.kernel.org> #v4.3+
      28ea06c4
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ae50dfd6
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
          from Steffen Klassert.
      
       2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
          Alexei Starovoitov.
      
       3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
          Michal Schmidt.
      
       4) We can get a divide by zero when TCP socket are morphed into
          listening state, fix from Eric Dumazet.
      
       5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
          skb_complete_tx_timestamp(). From Eric Dumazet.
      
       6) Use after free in dccp_feat_activate_values(), also from Eric
          Dumazet.
      
       7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
          Jarod Wilson.
      
       8) Fix use after free in vrf_xmit(), from David Ahern.
      
       9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
          Alexey Kodanev.
      
      10) Properly check napi_complete_done() return value in order to decide
          whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
          Lendacky.
      
      11) Fix double free of hwmon device in marvell phy driver, from Andrew
          Lunn.
      
      12) Don't crash on malformed netlink attributes in act_connmark, from
          Etienne Noss.
      
      13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
          from Sabrina Dubroca.
      
      14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
          Florian Westphal.
      
      15) Fix routing redirect races in dccp and tcp, basically the ICMP
          handler can't modify the socket's cached route in it's locked by the
          user at this moment. From Jon Maxwell.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
        qed: Enable iSCSI Out-of-Order
        qed: Correct out-of-bound access in OOO history
        qed: Fix interrupt flags on Rx LL2
        qed: Free previous connections when releasing iSCSI
        qed: Fix mapping leak on LL2 rx flow
        qed: Prevent creation of too-big u32-chains
        qed: Align CIDs according to DORQ requirement
        mlxsw: reg: Fix SPVMLR max record count
        mlxsw: reg: Fix SPVM max record count
        net: Resend IGMP memberships upon peer notification.
        dccp: fix memory leak during tear-down of unsuccessful connection request
        tun: fix premature POLLOUT notification on tun devices
        dccp/tcp: fix routing redirect race
        ucc/hdlc: fix two little issue
        vxlan: fix ovs support
        net: use net->count to check whether a netns is alive or not
        bridge: drop netfilter fake rtable unconditionally
        ipv6: avoid write to a possibly cloned skb
        net: wimax/i2400m: fix NULL-deref at probe
        isdn/gigaset: fix NULL-deref at probe
        ...
      ae50dfd6
  3. 14 Mar, 2017 30 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 352526f4
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Three cgroup fixes.  Nothing critical:
      
         - the pids controller could trigger suspicious RCU warning
           spuriously. Fixed.
      
         - in the debug controller, %p -> %pK to protect kernel pointer
           from getting exposed.
      
         - documentation formatting fix"
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroups: censor kernel pointer in debug files
        cgroup/pids: remove spurious suspicious RCU usage warning
        cgroup: Fix indenting in PID controller documentation
      352526f4
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · 6517569d
      Linus Torvalds authored
      Pull libata fixes from Tejun Heo:
       "Three libata fixes:
      
         - fix for a circular reference bug in sysfs code which prevented
           pata_legacy devices from being released after probe failure, which
           in turn prevented devres from releasing the associated resources.
      
         - drop spurious WARN in the command issue path which can be triggered
           by a legitimate passthrough command.
      
         - an ahci_qoriq specific fix"
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        ahci: qoriq: correct the sata ecc setting error
        libata: drop WARN from protocol error in ata_sff_qc_issue()
        libata: transport: Remove circular dependency at free time
      6517569d
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · bc258879
      Linus Torvalds authored
      Pull workqueue fix from Tejun Heo:
       "If a delayed work is queued with NULL @wq, workqueue code explodes
        after the timer expires at which point it's difficult to tell who the
        culprit was.
      
        This actually happened and the offender was net/smc this time.
      
        Add an explicit sanity check for it in the queueing path"
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
      bc258879
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · 83e63226
      Linus Torvalds authored
      Pull percpu fixes from Tejun Heo:
      
       - the allocation path was updating pcpu_nr_empty_pop_pages without the
         required locking which can lead to incorrect handling of empty chunks
         (e.g. keeping too many around), which is buggy but shouldn't lead to
         critical failures. Fixed by adding the locking
      
       - a trivial patch to drop an unused param from pcpu_get_pages()
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        percpu: remove unused chunk_alloc parameter from pcpu_get_pages()
        percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages
      83e63226
    • David S. Miller's avatar
      Merge branch 'qed-fixes' · 1e6a1cd8
      David S. Miller authored
      Yuval Mintz says:
      
      ====================
      qed: Fixes series
      
      This address several different issues in qed.
      The more significant portions:
      
      Patch #1 would cause timeout when qedr utilizes the highest
      CIDs availble for it [or when future qede adapters would utilize
      queues in some constellations].
      
      Patch #4 fixes a leak of mapped addresses; When iommu is enabled,
      offloaded storage protocols might eventually run out of resources
      and fail to map additional buffers.
      
      Patches #6,#7 were missing in the initial iSCSI infrastructure
      submissions, and would hamper qedi's stability when it reaches
      out-of-order scenarios.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e6a1cd8
    • Mintz, Yuval's avatar
      qed: Enable iSCSI Out-of-Order · 6b116b1d
      Mintz, Yuval authored
      Missing in the initial submission, qed fails to propagate qedi's
      request to enable OOO to firmware.
      
      Fixes: fc831825 ("qed: Add support for hardware offloaded iSCSI")
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b116b1d
    • Mintz, Yuval's avatar
      qed: Correct out-of-bound access in OOO history · db31d330
      Mintz, Yuval authored
      Need to set the number of entries in database, otherwise the logic
      would quickly surpass the array.
      
      Fixes: 1d6cff4f ("qed: Add iSCSI out of order packet handling")
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db31d330
    • Ram Amrani's avatar
      qed: Fix interrupt flags on Rx LL2 · 1df2aded
      Ram Amrani authored
      Before iterating over the the LL2 Rx ring, the ring's
      spinlock is taken via spin_lock_irqsave().
      The actual processing of the packet [including handling
      by the protocol driver] is done without said lock,
      so qed releases the spinlock and re-claims it afterwards.
      
      Problem is that the final spin_lock_irqrestore() at the end
      of the iteration uses the original flags saved from the
      initial irqsave() instead of the flags from the most recent
      irqsave(). So it's possible that the interrupt status would
      be incorrect at the end of the processing.
      
      Fixes: 0a7fb11c ("qed: Add Light L2 support");
      CC: Ram Amrani <Ram.Amrani@cavium.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1df2aded
    • Mintz, Yuval's avatar
      qed: Free previous connections when releasing iSCSI · 4621ceb2
      Mintz, Yuval authored
      Fixes: fc831825 ("qed: Add support for hardware offloaded iSCSI")
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4621ceb2
    • Mintz, Yuval's avatar
      qed: Fix mapping leak on LL2 rx flow · 752ecb2d
      Mintz, Yuval authored
      When receiving an Rx LL2 packet, qed fails to unmap the previous buffer.
      
      Fixes: 0a7fb11c ("qed: Add Light L2 support");
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      752ecb2d
    • Tomer Tayar's avatar
      qed: Prevent creation of too-big u32-chains · 3ef310a7
      Tomer Tayar authored
      Current Logic would allow the creation of a chain with U32_MAX + 1
      elements, when the actual maximum supported by the driver infrastructure
      is U32_MAX.
      
      Fixes: a91eb52a ("qed: Revisit chain implementation")
      Signed-off-by: default avatarTomer Tayar <Tomer.Tayar@cavium.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ef310a7
    • Ram Amrani's avatar
      qed: Align CIDs according to DORQ requirement · f3e48119
      Ram Amrani authored
      The Doorbell HW block can be configured at a granularity
      of 16 x CIDs, so we need to make sure that the actual number
      of CIDs configured would be a multiplication of 16.
      
      Today, when RoCE is enabled - given that the number is unaligned,
      doorbelling the higher CIDs would fail to reach the firmware and
      would eventually timeout.
      
      Fixes: dbb799c3 ("qed: Initialize hardware for new protocols")
      Signed-off-by: default avatarRam Amrani <Ram.Amrani@cavium.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3e48119
    • David S. Miller's avatar
      Merge branch 'mlxsw-small-fixes' · a8aa3953
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      mlxsw: Couple of fixes
      
      Couple or small fixes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8aa3953
    • Jiri Pirko's avatar
      mlxsw: reg: Fix SPVMLR max record count · e9093b11
      Jiri Pirko authored
      The num_rec field is 8 bit, so the maximal count number is 255.
      This fixes vlans learning not being enabled for wider ranges than 255.
      
      Fixes: a4feea74 ("mlxsw: reg: Add Switch Port VLAN MAC Learning register definition")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9093b11
    • Jiri Pirko's avatar
      mlxsw: reg: Fix SPVM max record count · f004ec06
      Jiri Pirko authored
      The num_rec field is 8 bit, so the maximal count number is 255. This
      fixes vlans not being enabled for wider ranges than 255.
      
      Fixes: b2e345f9 ("mlxsw: reg: Add Switch Port VID and Switch Port VLAN Membership registers definitions")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f004ec06
    • Vlad Yasevich's avatar
      net: Resend IGMP memberships upon peer notification. · 37c343b4
      Vlad Yasevich authored
      When we notify peers of potential changes,  it's also good to update
      IGMP memberships.  For example, during VM migration, updating IGMP
      memberships will redirect existing multicast streams to the VM at the
      new location.
      Signed-off-by: default avatarVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37c343b4
    • Naveen N. Rao's avatar
      kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL · 5f6bee34
      Naveen N. Rao authored
      commit fc62d020 ("kprobes: Introduce weak variant of
      kprobe_exceptions_notify()") used the __kprobes annotation to exclude
      kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a
      better way to do this enabling the symbol to be discovered as being
      blacklisted, change over to using NOKPROBE_SYMBOL().
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/3f25bf400da5c222cd9b10eec6ded2d6b58209f8.1488991670.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      5f6bee34
    • Naveen N. Rao's avatar
      doc: trace/kprobes: add information about NOKPROBE_SYMBOL · c1ac094d
      Naveen N. Rao authored
      Update kprobe tracer documentation to also mention that
      NOKPROBE_SYMBOL() and nokprobe_inline add symbols to the kprobes
      blacklist.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/d924e20de099579ace4286e610304f054cd798db.1488991670.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c1ac094d
    • Naveen N. Rao's avatar
      perf powerpc: Choose local entry point with kretprobes · 44ca9341
      Naveen N. Rao authored
      perf now uses an offset from _text/_stext for kretprobes if the kernel
      supports it, rather than the actual function name. As such, let's choose
      the LEP for powerpc ABIv2 so as to ensure the probe gets hit. Do it only
      if the kernel supports specifying offsets with kretprobes.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/7445b5334673ef5404ac1d12609bad4d73d2b567.1488961018.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      44ca9341
    • Naveen N. Rao's avatar
      perf kretprobes: Offset from reloc_sym if kernel supports it · 7ab31d94
      Naveen N. Rao authored
      We indicate support for accepting sym+offset with kretprobes through a
      line in ftrace README. Parse the same to identify support and choose the
      appropriate format for kprobe_events.
      
      As an example, without this perf patch, but with the ftrace changes:
      
        naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/tracing/README | grep kretprobe
        place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>
        naveen@ubuntu:~/linux/tools/perf$
        naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
        probe-definition(0): do_open%return
        symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
        0 arguments
        Looking at the vmlinux_path (8 entries long)
        Using /boot/vmlinux for symbols
        Open Debuginfo file: /boot/vmlinux
        Try to find probe point from debuginfo.
        Matched function: do_open [2d0c7d8]
        Probe point found: do_open+0
        Matched function: do_open [35d76b5]
        found inline addr: 0xc0000000004ba984
        Failed to find "do_open%return",
         because do_open is an inlined function and has no return point.
        An error occurred in debuginfo analysis (-22).
        Trying to use symbols.
        Opening /sys/kernel/debug/tracing//kprobe_events write=1
        Writing event: r:probe/do_open do_open+0
        Writing event: r:probe/do_open_1 do_open+0
        Added new events:
          probe:do_open        (on do_open%return)
          probe:do_open_1      (on do_open%return)
      
        You can now use it in all perf tools, such as:
      
      	  perf record -e probe:do_open_1 -aR sleep 1
      
        naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
        c000000000041370  k  kretprobe_trampoline+0x0    [OPTIMIZED]
        c0000000004433d0  r  do_open+0x0    [DISABLED]
        c0000000004433d0  r  do_open+0x0    [DISABLED]
      
      And after this patch (and the subsequent powerpc patch):
      
        naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
        probe-definition(0): do_open%return
        symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
        0 arguments
        Looking at the vmlinux_path (8 entries long)
        Using /boot/vmlinux for symbols
        Open Debuginfo file: /boot/vmlinux
        Try to find probe point from debuginfo.
        Matched function: do_open [2d0c7d8]
        Probe point found: do_open+0
        Matched function: do_open [35d76b5]
        found inline addr: 0xc0000000004ba984
        Failed to find "do_open%return",
         because do_open is an inlined function and has no return point.
        An error occurred in debuginfo analysis (-22).
        Trying to use symbols.
        Opening /sys/kernel/debug/tracing//README write=0
        Opening /sys/kernel/debug/tracing//kprobe_events write=1
        Writing event: r:probe/do_open _text+4469712
        Writing event: r:probe/do_open_1 _text+4956248
        Added new events:
          probe:do_open        (on do_open%return)
          probe:do_open_1      (on do_open%return)
      
        You can now use it in all perf tools, such as:
      
      	  perf record -e probe:do_open_1 -aR sleep 1
      
        naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
        c000000000041370  k  kretprobe_trampoline+0x0    [OPTIMIZED]
        c0000000004433d0  r  do_open+0x0    [DISABLED]
        c0000000004ba058  r  do_open+0x8    [DISABLED]
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/496ef9f33c1ab16286ece9dd62aa672807aef91c.1488961018.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      7ab31d94
    • Naveen N. Rao's avatar
      perf probe: Factor out the ftrace README scanning · 3da3ea7a
      Naveen N. Rao authored
      Simplify and separate out the ftrace README scanning logic into a
      separate helper. This is used subsequently to scan for all patterns of
      interest and to cache the result.
      
      Since we are only interested in availability of probe argument type x,
      we will only scan for that.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/6dc30edc747ba82a236593be6cf3a046fa9453b5.1488961018.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      3da3ea7a
    • Brendan Gregg's avatar
      perf sched timehist: Add --next option · 292c4a8f
      Brendan Gregg authored
      The --next option shows the next task for each context switch, providing
      more context for the sequence of scheduler events.
      
        $ perf sched timehist --next | head
        Samples do not have callchains.
             time  cpu task name  waittime schdelay run time
                       [tid/pid]     (msec) (msec) (msec)
        ---------- --- ---------- --------- ------ -----
        374.793792 [0] <idle>         0.000  0.000 0.000 next: rngd[1524]
        374.793801 [0] rngd[1524]     0.000  0.000 0.009 next: swapper/0[0]
        374.794048 [7] <idle>         0.000  0.000 0.000 next: yes[30884]
        374.794066 [7] yes[30884]     0.000  0.000 0.018 next: swapper/7[0]
        374.794126 [2] <idle>         0.000  0.000 0.000 next: rngd[1524]
        374.794140 [2] rngd[1524]     0.325  0.006 0.013 next: swapper/2[0]
        374.794281 [3] <idle>         0.000  0.000 0.000 next: perf[31070]
      Signed-off-by: default avatarBrendan Gregg <bgregg@netflix.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1489456589-32555-1-git-send-email-bgregg@netflix.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      292c4a8f
    • Hari Bathini's avatar
      perf tools: Add 'cgroup_id' sort order keyword · d890a98c
      Hari Bathini authored
      This patch introduces a cgroup identifier entry field in perf report to
      identify or distinguish data of different cgroups. It uses the device
      number and inode number of cgroup namespace, included in perf data with
      the new PERF_RECORD_NAMESPACES event, as cgroup identifier.
      
      With the assumption that each container is created with it's own cgroup
      namespace,  this allows assessment/analysis of multiple containers at
      once.
      
      A simple test for this would be to clone a few processes passing
      SIGCHILD & CLONE_NEWCROUP flags to each of them, execute shell and run
      different workloads  on each of those contexts,  while running perf
      record command with --namespaces option.
      
      Shown below is the output of perf report, sorted with cgroup identifier,
      on perf.data generated with the above test scenario, clearly indicating
      one context's considerable use of kernel memory in comparison with
      others:
      
      	$ perf report -s cgroup_id,sample --stdio
      	#
      	# Total Lost Samples: 0
      	#
      	# Samples: 5K of event 'kmem:kmalloc'
      	# Event count (approx.): 5965
      	#
      	# Overhead  cgroup id (dev/inode)       Samples
      	# ........  .....................  ............
      	#
      	    81.27%  3/0xeffffffb                   4848
      	    16.24%  3/0xf00000d0                    969
      	     1.16%  3/0xf00000ce                     69
      	     0.82%  3/0xf00000cf                     49
      	     0.50%  0/0x0                            30
      
      While this is a start, there is further scope of improving this. For
      example, instead of cgroup namespace's device and inode numbers, dev
      and inode numbers of some or all namespaces may be used to distinguish
      which processes are running in a given container context.
      
      Also, scripts to map device and inode info to containers sounds
      plausible for better tracing of containers.
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891933338.25309.756882900782042645.stgit@hbathini.in.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d890a98c
    • Hari Bathini's avatar
      perf script: Add script print support for namespace events · 96a44bbc
      Hari Bathini authored
      Introduce a new option to display events of type PERF_RECORD_NAMESPACES
      and update perf-script documentation accordingly.
      
      Shown below is output (trimmed) of perf script command with the newly
      introduced option, on perf.data generated with perf record command using
      --namespaces option.
      
        $ perf script --show-namespace-events
            swapper   0 [000]     0.000000: PERF_RECORD_NAMESPACES 1/1 - nr_namespaces: 7
                      [0/net: 3/0xf000001c, 1/uts: 3/0xeffffffe, 2/ipc: 3/0xefffffff, 3/pid: 3/0xeffffffc,
                       4/user: 3/0xeffffffd, 5/mnt: 3/0xf0000000, 6/cgroup: 3/0xeffffffb]
            swapper   0 [000]     0.000000: PERF_RECORD_NAMESPACES 2/2 - nr_namespaces: 7
                      [0/net: 3/0xf000001c, 1/uts: 3/0xeffffffe, 2/ipc: 3/0xefffffff, 3/pid: 3/0xeffffffc,
                       4/user: 3/0xeffffffd, 5/mnt: 3/0xf0000000, 6/cgroup: 3/0xeffffffb]
      
      Commiter notes:
      
      Testing it:
      
      Investigating that double PERF_RECORD_NAMESPACES for the 19155
      pid/tid... Its more than that, there are two PERF_RECORD_COMM as well,
      and with zeroed timestamps, so probably a synthesizing artifact...
      
        # perf script --show-task --show-namespace
        <SNIP>
            perf     0 [000]     0.000000: PERF_RECORD_COMM: perf:19154/19154
            perf     0 [000]     0.000000: PERF_RECORD_FORK(19155:19155):(19154:19154)
            perf     0 [000]     0.000000: PERF_RECORD_NAMESPACES 19155/19155 - nr_namespaces: 7
                [0/net: 3/0xf0000081, 1/uts: 3/0xeffffffe, 2/ipc: 3/0xefffffff, 3/pid: 3/0xeffffffc,
                 4/user: 3/0xeffffffd, 5/mnt: 3/0xf0000000, 6/cgroup: 3/0xeffffffb]
            perf     0 [000]     0.000000: PERF_RECORD_COMM: perf:19155/19155
            perf     0 [000]     0.000000: PERF_RECORD_COMM: perf:19155/19155
            perf     0 [000]     0.000000: PERF_RECORD_NAMESPACES 19155/19155 - nr_namespaces: 7
                [0/net: 3/0xf0000081, 1/uts: 3/0xeffffffe, 2/ipc: 3/0xefffffff, 3/pid: 3/0xeffffffc,
                 4/user: 3/0xeffffffd, 5/mnt: 3/0xf0000000, 6/cgroup: 3/0xeffffffb]
         swapper     0 [000]  3110.881834:          1 cycles:  ffffffffa7060bf6 native_write_msr (/lib/modules/4.11.0-rc1+/build/vmlinux)
      
        <SNIP>
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891932627.25309.1941587059154176221.stgit@hbathini.in.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      96a44bbc
    • Hari Bathini's avatar
      perf record: Synthesize namespace events for current processes · e907caf3
      Hari Bathini authored
      Synthesize PERF_RECORD_NAMESPACES events for processes that were running prior
      to invocation of perf record. The data for this is taken from /proc/$PID/ns.
      These changes make way for analyzing events with regard to namespaces.
      
      Committer notes:
      
      Check if 'tool' is NULL in perf_event__synthesize_namespaces(), as in the
      test__mmap_thread_lookup case, i.e. 'perf test Lookup mmap thread".
      
      Testing it:
      
        # ps axH > /tmp/allthreads
        # perf record -a --namespaces usleep 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 1.169 MB perf.data (8 samples) ]
        # perf report -D | grep PERF_RECORD_NAMESPACES | wc -l
        602
        # wc -l /tmp/allthreads
        601 /tmp/allthreads
        # tail /tmp/allthreads
        16951 pts/4    T      0:00 git rebase -i a033bf1bfacdaa25642e6bcc857a7d0f67cc3c92^
        16952 pts/4    T      0:00 /bin/sh /usr/libexec/git-core/git-rebase -i a033bf1bfacdaa25642e6bcc857a7d0f67cc3c92^
        17176 pts/4    T      0:00 git commit --amend --no-post-rewrite
        17204 pts/4    T      0:00 vim /home/acme/git/linux/.git/COMMIT_EDITMSG
        18939 ?        S      0:00 [kworker/2:1]
        18947 ?        S      0:00 [kworker/3:0]
        18974 ?        S      0:00 [kworker/1:0]
        19047 ?        S      0:00 [kworker/0:1]
        19152 pts/6    S+     0:00 weechat
        19153 pts/7    R+     0:00 ps axH
        # perf report -D | grep PERF_RECORD_NAMESPACES | tail
        0 0 0x125068 [0xa0]: PERF_RECORD_NAMESPACES 17176/17176 - nr_namespaces: 7
        0 0 0x1255b8 [0xa0]: PERF_RECORD_NAMESPACES 17204/17204 - nr_namespaces: 7
        0 0 0x125df0 [0xa0]: PERF_RECORD_NAMESPACES 18939/18939 - nr_namespaces: 7
        0 0 0x125f00 [0xa0]: PERF_RECORD_NAMESPACES 18947/18947 - nr_namespaces: 7
        0 0 0x126010 [0xa0]: PERF_RECORD_NAMESPACES 18974/18974 - nr_namespaces: 7
        0 0 0x126120 [0xa0]: PERF_RECORD_NAMESPACES 19047/19047 - nr_namespaces: 7
        0 0 0x126230 [0xa0]: PERF_RECORD_NAMESPACES 19152/19152 - nr_namespaces: 7
        0 0 0x129330 [0xa0]: PERF_RECORD_NAMESPACES 19154/19154 - nr_namespaces: 7
        0 0 0x12a1f8 [0xa0]: PERF_RECORD_NAMESPACES 19155/19155 - nr_namespaces: 7
        0 0 0x12b0b8 [0xa0]: PERF_RECORD_NAMESPACES 19155/19155 - nr_namespaces: 7
        #
      
      Humm, investigate why we got two record for the 19155 pid/tid...
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891931111.25309.11073854609798681633.stgit@hbathini.in.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e907caf3
    • Jens Axboe's avatar
      blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly() · 9c621104
      Jens Axboe authored
      If we have scheduling enabled, we jump directly to insert-and-run.
      That's fine, but we run the queue async and we don't pass in information
      on whether we can block from this context or not. Fixup both these
      cases.
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9c621104
    • Hari Bathini's avatar
      perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related info · f3b3614a
      Hari Bathini authored
      Introduce a new option to record PERF_RECORD_NAMESPACES events emitted
      by the kernel when fork, clone, setns or unshare are invoked. And update
      perf-record documentation with the new option to record namespace
      events.
      
      Committer notes:
      
      Combined it with a later patch to allow printing it via 'perf report -D'
      and be able to test the feature introduced in this patch. Had to move
      here also perf_ns__name(), that was introduced in another later patch.
      
      Also used PRIu64 and PRIx64 to fix the build in some enfironments wrt:
      
        util/event.c:1129:39: error: format '%lx' expects argument of type 'long unsigned int', but argument 6 has type 'long long unsigned int' [-Werror=format=]
           ret  += fprintf(fp, "%u/%s: %lu/0x%lx%s", idx
                                               ^
      Testing it:
      
        # perf record --namespaces -a
        ^C[ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 1.083 MB perf.data (423 samples) ]
        #
        # perf report -D
        <SNIP>
        3 2028902078892 0x115140 [0xa0]: PERF_RECORD_NAMESPACES 14783/14783 - nr_namespaces: 7
                      [0/net: 3/0xf0000081, 1/uts: 3/0xeffffffe, 2/ipc: 3/0xefffffff, 3/pid: 3/0xeffffffc,
                       4/user: 3/0xeffffffd, 5/mnt: 3/0xf0000000, 6/cgroup: 3/0xeffffffb]
      
        0x1151e0 [0x30]: event: 9
        .
        . ... raw event: size 48 bytes
        .  0000:  09 00 00 00 02 00 30 00 c4 71 82 68 0c 7f 00 00  ......0..q.h....
        .  0010:  a9 39 00 00 a9 39 00 00 94 28 fe 63 d8 01 00 00  .9...9...(.c....
        .  0020:  03 00 00 00 00 00 00 00 ce c4 02 00 00 00 00 00  ................
        <SNIP>
              NAMESPACES events:          1
        <SNIP>
        #
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891930386.25309.18412039920746995488.stgit@hbathini.in.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      f3b3614a
    • Hannes Frederic Sowa's avatar
      dccp: fix memory leak during tear-down of unsuccessful connection request · 72ef9c41
      Hannes Frederic Sowa authored
      This patch fixes a memory leak, which happens if the connection request
      is not fulfilled between parsing the DCCP options and handling the SYN
      (because e.g. the backlog is full), because we forgot to free the
      list of ack vectors.
      Reported-by: default avatarJianwen Ji <jiji@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72ef9c41
    • Hannes Frederic Sowa's avatar
      tun: fix premature POLLOUT notification on tun devices · b20e2d54
      Hannes Frederic Sowa authored
      aszlig observed failing ssh tunnels (-w) during initialization since
      commit cc9da6cc ("ipv6: addrconf: use stable address generator for
      ARPHRD_NONE"). We already had reports that the mentioned commit breaks
      Juniper VPN connections. I can't clearly say that the Juniper VPN client
      has the same problem, but it is worth a try to hint to this patch.
      
      Because of the early generation of link local addresses, the kernel now
      can start asking for routers on the local subnet much earlier than usual.
      Those router solicitation packets arrive inside the ssh channels and
      should be transmitted to the tun fd before the configuration scripts
      might have upped the interface and made it ready for transmission.
      
      ssh polls on the interface and receives back a POLL_OUT. It tries to send
      the earily router solicitation packet to the tun interface.  Unfortunately
      it hasn't been up'ed yet by config scripts, thus failing with -EIO. ssh
      doesn't retry again and considers the tun interface broken forever.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
      Fixes: cc9da6cc ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
      Cc: Bjørn Mork <bjorn@mork.no>
      Reported-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Reported-by: default avatarJonas Lippuner <jonas@lippuner.ca>
      Cc: Jonas Lippuner <jonas@lippuner.ca>
      Reported-by: default avataraszlig <aszlig@redmoonstudios.org>
      Cc: aszlig <aszlig@redmoonstudios.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b20e2d54
    • Jon Maxwell's avatar
      dccp/tcp: fix routing redirect race · 45caeaa5
      Jon Maxwell authored
      As Eric Dumazet pointed out this also needs to be fixed in IPv6.
      v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
      
      We have seen a few incidents lately where a dst_enty has been freed
      with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
      dst_entry. If the conditions/timings are right a crash then ensues when the
      freed dst_entry is referenced later on. A Common crashing back trace is:
      
       #8 [] page_fault at ffffffff8163e648
          [exception RIP: __tcp_ack_snd_check+74]
      .
      .
       #9 [] tcp_rcv_established at ffffffff81580b64
      #10 [] tcp_v4_do_rcv at ffffffff8158b54a
      #11 [] tcp_v4_rcv at ffffffff8158cd02
      #12 [] ip_local_deliver_finish at ffffffff815668f4
      #13 [] ip_local_deliver at ffffffff81566bd9
      #14 [] ip_rcv_finish at ffffffff8156656d
      #15 [] ip_rcv at ffffffff81566f06
      #16 [] __netif_receive_skb_core at ffffffff8152b3a2
      #17 [] __netif_receive_skb at ffffffff8152b608
      #18 [] netif_receive_skb at ffffffff8152b690
      #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
      #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
      #21 [] net_rx_action at ffffffff8152bac2
      #22 [] __do_softirq at ffffffff81084b4f
      #23 [] call_softirq at ffffffff8164845c
      #24 [] do_softirq at ffffffff81016fc5
      #25 [] irq_exit at ffffffff81084ee5
      #26 [] do_IRQ at ffffffff81648ff8
      
      Of course it may happen with other NIC drivers as well.
      
      It's found the freed dst_entry here:
      
       224 static bool tcp_in_quickack_mode(struct sock *sk)
       225 {
       226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);
       227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);
       228 
       229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
       230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
       231 }
      
      But there are other backtraces attributed to the same freed dst_entry in
      netfilter code as well.
      
      All the vmcores showed 2 significant clues:
      
      - Remote hosts behind the default gateway had always been redirected to a
      different gateway. A rtable/dst_entry will be added for that host. Making
      more dst_entrys with lower reference counts. Making this more probable.
      
      - All vmcores showed a postitive LockDroppedIcmps value, e.g:
      
      LockDroppedIcmps                  267
      
      A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
      regardless of whether user space has the socket locked. This can result in a
      race condition where the same dst_entry cached in sk->sk_dst_entry can be
      decremented twice for the same socket via:
      
      do_redirect()->__sk_dst_check()-> dst_release().
      
      Which leads to the dst_entry being prematurely freed with another socket
      pointing to it via sk->sk_dst_cache and a subsequent crash.
      
      To fix this skip do_redirect() if usespace has the socket locked. Instead let
      the redirect take place later when user space does not have the socket
      locked.
      
      The dccp/IPv6 code is very similar in this respect, so fixing it there too.
      
      As Eric Garver pointed out the following commit now invalidates routes. Which
      can set the dst->obsolete flag so that ipv4_dst_check() returns null and
      triggers the dst_release().
      
      Fixes: ceb33206 ("ipv4: Kill routes during PMTU/redirect updates.")
      Cc: Eric Garver <egarver@redhat.com>
      Cc: Hannes Sowa <hsowa@redhat.com>
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45caeaa5