1. 21 Feb, 2020 22 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 3dc55dba
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
          Cong Wang.
      
       2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
          is finished, from Magnus Karlsson.
      
       3) Set table field properly to RT_TABLE_COMPAT when necessary, from
          Jethro Beekman.
      
       4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
          that in IFLA_ALT_IFNAME. From Eric Dumazet.
      
       5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.
      
       6) Handle mtu==0 devices properly in wireguard, from Jason A.
          Donenfeld.
      
       7) Fix several lockdep warnings in bonding, from Taehee Yoo.
      
       8) Fix cls_flower port blocking, from Jason Baron.
      
       9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.
      
      10) Fix RDMA race in qede driver, from Michal Kalderon.
      
      11) Fix several false lockdep warnings by adding conditions to
          list_for_each_entry_rcu(), from Madhuparna Bhowmik.
      
      12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.
      
      13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.
      
      14) Hey, variables declared in switch statement before any case
          statements are not initialized. I learn something every day. Get
          rids of this stuff in several parts of the networking, from Kees
          Cook.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
        bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
        bnxt_en: Improve device shutdown method.
        net: netlink: cap max groups which will be considered in netlink_bind()
        net: thunderx: workaround BGX TX Underflow issue
        ionic: fix fw_status read
        net: disable BRIDGE_NETFILTER by default
        net: macb: Properly handle phylink on at91rm9200
        s390/qeth: fix off-by-one in RX copybreak check
        s390/qeth: don't warn for napi with 0 budget
        s390/qeth: vnicc Fix EOPNOTSUPP precedence
        openvswitch: Distribute switch variables for initialization
        net: ip6_gre: Distribute switch variables for initialization
        net: core: Distribute switch variables for initialization
        udp: rehash on disconnect
        net/tls: Fix to avoid gettig invalid tls record
        bpf: Fix a potential deadlock with bpf_map_do_batch
        bpf: Do not grab the bucket spinlock by default on htab batch ops
        ice: Wait for VF to be reset/ready before configuration
        ice: Don't tell the OS that link is going down
        ice: Don't reject odd values of usecs set by user
        ...
      3dc55dba
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · b0dd1eb2
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
      
       - A few y2038 fixes which missed the merge window while dependencies
         in NFS were being sorted out.
      
       - A bunch of fixes. Some minor, some not.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        MAINTAINERS: use tabs for SAFESETID
        lib/stackdepot.c: fix global out-of-bounds in stack_slabs
        mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
        mm/vmscan.c: don't round up scan size for online memory cgroup
        lib/string.c: update match_string() doc-strings with correct behavior
        mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()
        mm/swapfile.c: fix a comment in sys_swapon()
        scripts/get_maintainer.pl: deprioritize old Fixes: addresses
        get_maintainer: remove uses of P: for maintainer name
        selftests/vm: add missed tests in run_vmtests
        include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap
        Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"
        y2038: hide timeval/timespec/itimerval/itimerspec types
        y2038: remove unused time32 interfaces
        y2038: remove ktime to/from timespec/timeval conversion
      b0dd1eb2
    • Randy Dunlap's avatar
      MAINTAINERS: use tabs for SAFESETID · bb8d00ff
      Randy Dunlap authored
      Use tabs for indentation instead of spaces for SAFESETID.  All (!) other
      entries in MAINTAINERS use tabs (according to my simple grepping).
      
      Link: http://lkml.kernel.org/r/2bb2e52a-2694-816d-57b4-6cabfadd6c1a@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Micah Morton <mortonm@chromium.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8d00ff
    • Alexander Potapenko's avatar
      lib/stackdepot.c: fix global out-of-bounds in stack_slabs · 305e519c
      Alexander Potapenko authored
      Walter Wu has reported a potential case in which init_stack_slab() is
      called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
      initialized.  In that case init_stack_slab() will overwrite
      stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
      corruption.
      
      Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      305e519c
    • Wei Yang's avatar
      mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM · 18e19f19
      Wei Yang authored
      When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
      doesn't work before sparse_init_one_section() is called.
      
      This leads to a crash when hotplug memory:
      
          BUG: unable to handle page fault for address: 0000000006400000
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G        W         5.5.0-next-20200205+ #343
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:__memset+0x24/0x30
          Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
          RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
          RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
          RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
          RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
          R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
          R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
          FS:  0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           sparse_add_section+0x1c9/0x26a
           __add_pages+0xbf/0x150
           add_pages+0x12/0x60
           add_memory_resource+0xc8/0x210
           __add_memory+0x62/0xb0
           acpi_memory_device_add+0x13f/0x300
           acpi_bus_attach+0xf6/0x200
           acpi_bus_scan+0x43/0x90
           acpi_device_hotplug+0x275/0x3d0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x1a7/0x370
           worker_thread+0x30/0x380
           kthread+0x112/0x130
           ret_from_fork+0x35/0x40
      
      We should use memmap as it did.
      
      On x86 the impact is limited to x86_32 builds, or x86_64 configurations
      that override the default setting for SPARSEMEM_VMEMMAP.
      
      Other memory hotplug archs (arm64, ia64, and ppc) also default to
      SPARSEMEM_VMEMMAP=y.
      
      [dan.j.williams@intel.com: changelog update]
      {rppt@linux.ibm.com: changelog update]
      Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18e19f19
    • Gavin Shan's avatar
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan authored
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
    • Alexandru Ardelean's avatar
      lib/string.c: update match_string() doc-strings with correct behavior · c11d3fa0
      Alexandru Ardelean authored
      There were a few attempts at changing behavior of the match_string()
      helpers (i.e.  'match_string()' & 'sysfs_match_string()'), to change &
      extend the behavior according to the doc-string.
      
      But the simplest approach is to just fix the doc-strings.  The current
      behavior is fine as-is, and some bugs were introduced trying to fix it.
      
      As for extending the behavior, new helpers can always be introduced if
      needed.
      
      The match_string() helpers behave more like 'strncmp()' in the sense
      that they go up to n elements or until the first NULL element in the
      array of strings.
      
      This change updates the doc-strings with this info.
      
      Link: http://lkml.kernel.org/r/20200213072722.8249-1-alexandru.ardelean@analog.comSigned-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Acked-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c11d3fa0
    • Vasily Averin's avatar
      mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() · 75866af6
      Vasily Averin authored
      for_each_mem_cgroup() increases css reference counter for memory cgroup
      and requires to use mem_cgroup_iter_break() if the walk is cancelled.
      
      Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
      Fixes: 0a4465d3 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75866af6
    • Christoph Hellwig's avatar
    • Douglas Anderson's avatar
      scripts/get_maintainer.pl: deprioritize old Fixes: addresses · 0ef82fce
      Douglas Anderson authored
      Recently, I found that get_maintainer was causing me to send emails to
      the old addresses for maintainers.  Since I usually just trust the
      output of get_maintainer to know the right email address, I didn't even
      look carefully and fired off two patch series that went to the wrong
      place.  Oops.
      
      The problem was introduced recently when trying to add signatures from
      Fixes.  The problem was that these email addresses were added too early
      in the process of compiling our list of places to send.  Things added to
      the list earlier are considered more canonical and when we later added
      maintainer entries we ended up deduplicating to the old address.
      
      Here are two examples using mainline commits (to make it easier to
      replicate) for the two maintainers that I messed up recently:
      
        $ git format-patch d8549bcd~..d8549bcd
        $ ./scripts/get_maintainer.pl 0001-clk-Add-clk_hw*.patch | grep Boyd
        Stephen Boyd <sboyd@codeaurora.org>...
      
        $ git format-patch 6d1238aa~..6d1238aa
        $ ./scripts/get_maintainer.pl 0001-arm64-dts-qcom-qcs404*.patch | grep Andy
        Andy Gross <andy.gross@linaro.org>
      
      Let's move the adding of addresses from Fixes: to the end since the
      email addresses from these are much more likely to be older.
      
      After this patch the above examples get the right addresses for the two
      examples.
      
      Link: http://lkml.kernel.org/r/20200127095001.1.I41fba9f33590bfd92cd01960161d8384268c6569@changeid
      Fixes: 2f5bd343 ("scripts/get_maintainer.pl: add signatures from Fixes: <badcommit> lines in commit message")
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Acked-by: default avatarJoe Perches <joe@perches.com>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
      Cc: Andy Gross <agross@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ef82fce
    • Joe Perches's avatar
      get_maintainer: remove uses of P: for maintainer name · ef0c0819
      Joe Perches authored
      Commit 1ca84ed6 ("MAINTAINERS: Reclaim the P: tag for Maintainer
      Entry Profile") changed the use of the "P:" tag from "Person" to
      "Profile (ie: special subsystem coding styles and characteristics)"
      
      Change how get_maintainer.pl parses the "P:" tag to match.
      
      Link: http://lkml.kernel.org/r/ca53823fc5d25c0be32ad937d0207a0589c08643.camel@perches.comSigned-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarDan Williams <dan.j.william@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef0c0819
    • SeongJae Park's avatar
      selftests/vm: add missed tests in run_vmtests · 9e69fa46
      SeongJae Park authored
      The commits introducing 'mlock-random-test'[1], 'map_fiex_noreplace'[2],
      and 'thuge-gen'[3] have not added those in the 'run_vmtests' script and
      thus the 'run_tests' command of kselftests doesn't run those.  This
      commit adds those in the script.
      
      'gup_benchmark' and 'transhuge-stress' are also not included in the
      'run_vmtests', but this commit does not add those because those are for
      performance measurement rather than pass/fail tests.
      
      [1] commit 26b4224d ("selftests: expanding more mlock selftest")
      [2] commit 91cbacc3 ("tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE")
      [3] commit fcc1f2d5 ("selftests: add a test program for variable huge page sizes in mmap/shmget")
      
      Link: http://lkml.kernel.org/r/20200206085144.29126-1-sj38.park@gmail.comSigned-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e69fa46
    • Christian Borntraeger's avatar
      include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap · 467d12f5
      Christian Borntraeger authored
      QEMU has a funny new build error message when I use the upstream kernel
      headers:
      
            CC      block/file-posix.o
          In file included from /home/cborntra/REPOS/qemu/include/qemu/timer.h:4,
                           from /home/cborntra/REPOS/qemu/include/qemu/timed-average.h:29,
                           from /home/cborntra/REPOS/qemu/include/block/accounting.h:28,
                           from /home/cborntra/REPOS/qemu/include/block/block_int.h:27,
                           from /home/cborntra/REPOS/qemu/block/file-posix.c:30:
          /usr/include/linux/swab.h: In function `__swab':
          /home/cborntra/REPOS/qemu/include/qemu/bitops.h:20:34: error: "sizeof" is not defined, evaluates to 0 [-Werror=undef]
             20 | #define BITS_PER_LONG           (sizeof (unsigned long) * BITS_PER_BYTE)
                |                                  ^~~~~~
          /home/cborntra/REPOS/qemu/include/qemu/bitops.h:20:41: error: missing binary operator before token "("
             20 | #define BITS_PER_LONG           (sizeof (unsigned long) * BITS_PER_BYTE)
                |                                         ^
          cc1: all warnings being treated as errors
          make: *** [/home/cborntra/REPOS/qemu/rules.mak:69: block/file-posix.o] Error 1
          rm tests/qemu-iotests/socket_scm_helper.o
      
      This was triggered by commit d5767057 ("uapi: rename ext2_swab() to
      swab() and share globally in swab.h").  That patch is doing
      
        #include <asm/bitsperlong.h>
      
      but it uses BITS_PER_LONG.
      
      The kernel file asm/bitsperlong.h provide only __BITS_PER_LONG.
      
      Let us use the __ variant in swap.h
      
      Link: http://lkml.kernel.org/r/20200213142147.17604-1-borntraeger@de.ibm.com
      Fixes: d5767057 ("uapi: rename ext2_swab() to swab() and share globally in swab.h")
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: Allison Randal <allison@lohutok.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: William Breathitt Gray <vilhelm.gray@gmail.com>
      Cc: Torsten Hilbrich <torsten.hilbrich@secunet.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      467d12f5
    • Ioanna Alifieraki's avatar
      Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()" · edf28f40
      Ioanna Alifieraki authored
      This reverts commit a9795584.
      
      Commit a9795584 ("ipc,sem: remove uneeded sem_undo_list lock usage
      in exit_sem()") removes a lock that is needed.  This leads to a process
      looping infinitely in exit_sem() and can also lead to a crash.  There is
      a reproducer available in [1] and with the commit reverted the issue
      does not reproduce anymore.
      
      Using the reproducer found in [1] is fairly easy to reach a point where
      one of the child processes is looping infinitely in exit_sem between
      for(;;) and if (semid == -1) block, while it's trying to free its last
      sem_undo structure which has already been freed by freeary().
      
      Each sem_undo struct is on two lists: one per semaphore set (list_id)
      and one per process (list_proc).  The list_id list tracks undos by
      semaphore set, and the list_proc by process.
      
      Undo structures are removed either by freeary() or by exit_sem().  The
      freeary function is invoked when the user invokes a syscall to remove a
      semaphore set.  During this operation freeary() traverses the list_id
      associated with the semaphore set and removes the undo structures from
      both the list_id and list_proc lists.
      
      For this case, exit_sem() is called at process exit.  Each process
      contains a struct sem_undo_list (referred to as "ulp") which contains
      the head for the list_proc list.  When the process exits, exit_sem()
      traverses this list to remove each sem_undo struct.  As in freeary(),
      whenever a sem_undo struct is removed from list_proc, it is also removed
      from the list_id list.
      
      Removing elements from list_id is safe for both exit_sem() and freeary()
      due to sem_lock().  Removing elements from list_proc is not safe;
      freeary() locks &un->ulp->lock when it performs
      list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
      removed by commit a9795584 ("ipc,sem: remove uneeded sem_undo_list
      lock usage in exit_sem()").
      
      This can result in the following situation while executing the
      reproducer [1] : Consider a child process in exit_sem() and the parent
      in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).
      
       - The list_proc for the child contains the last two undo structs A and
         B (the rest have been removed either by exit_sem() or freeary()).
      
       - The semid for A is 1 and semid for B is 2.
      
       - exit_sem() removes A and at the same time freeary() removes B.
      
       - Since A and B have different semid sem_lock() will acquire different
         locks for each process and both can proceed.
      
      The bug is that they remove A and B from the same list_proc at the same
      time because only freeary() acquires the ulp lock. When exit_sem()
      removes A it makes ulp->list_proc.next to point at B and at the same
      time freeary() removes B setting B->semid=-1.
      
      At the next iteration of for(;;) loop exit_sem() will try to remove B.
      
      The only way to break from for(;;) is for (&un->list_proc ==
      &ulp->list_proc) to be true which is not. Then exit_sem() will check if
      B->semid=-1 which is and will continue looping in for(;;) until the
      memory for B is reallocated and the value at B->semid is changed.
      
      At that point, exit_sem() will crash attempting to unlink B from the
      lists (this can be easily triggered by running the reproducer [1] a
      second time).
      
      To prove this scenario instrumentation was added to keep information
      about each sem_undo (un) struct that is removed per process and per
      semaphore set (sma).
      
                CPU0                                CPU1
        [caller holds sem_lock(sma for A)]      ...
        freeary()                               exit_sem()
        ...                                     ...
        ...                                     sem_lock(sma for B)
        spin_lock(A->ulp->lock)                 ...
        list_del_rcu(un_A->list_proc)           list_del_rcu(un_B->list_proc)
      
      Undo structures A and B have different semid and sem_lock() operations
      proceed.  However they belong to the same list_proc list and they are
      removed at the same time.  This results into ulp->list_proc.next
      pointing to the address of B which is already removed.
      
      After reverting commit a9795584 ("ipc,sem: remove uneeded
      sem_undo_list lock usage in exit_sem()") the issue was no longer
      reproducible.
      
      [1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779
      
      Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
      Fixes: a9795584 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
      Signed-off-by: default avatarIoanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: <malat@debian.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edf28f40
    • Arnd Bergmann's avatar
      y2038: hide timeval/timespec/itimerval/itimerspec types · c766d147
      Arnd Bergmann authored
      There are no in-kernel users remaining, but there may still be users that
      include linux/time.h instead of sys/time.h from user space, so leave the
      types available to user space while hiding them from kernel space.
      
      Only the __kernel_old_* versions of these types remain now.
      
      Link: http://lkml.kernel.org/r/20200110154232.4104492-4-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c766d147
    • Arnd Bergmann's avatar
      y2038: remove unused time32 interfaces · 412c53a6
      Arnd Bergmann authored
      No users remain, so kill these off before we grow new ones.
      
      Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      412c53a6
    • Arnd Bergmann's avatar
      y2038: remove ktime to/from timespec/timeval conversion · 595abbaf
      Arnd Bergmann authored
      A couple of helpers are now obsolete and can be removed, so drivers can no
      longer start using them and instead use y2038-safe interfaces.
      
      Link: http://lkml.kernel.org/r/20200110154232.4104492-2-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      595abbaf
    • Rafael J. Wysocki's avatar
      ACPI: PM: s2idle: Check fixed wakeup events in acpi_s2idle_wake() · 63fb9623
      Rafael J. Wysocki authored
      Commit fdde0ff8 ("ACPI: PM: s2idle: Prevent spurious SCIs from
      waking up the system") overlooked the fact that fixed events can wake
      up the system too and broke RTC wakeup from suspend-to-idle as a
      result.
      
      Fix this issue by checking the fixed events in acpi_s2idle_wake() in
      addition to checking wakeup GPEs and break out of the suspend-to-idle
      loop if the status bits of any enabled fixed events are set then.
      
      Fixes: fdde0ff8 ("ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system")
      Reported-and-tested-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63fb9623
    • David S. Miller's avatar
      Merge branch 'bnxt_en-shutdown-and-kexec-kdump-related-fixes' · 36a44bcd
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: shutdown and kexec/kdump related fixes.
      
      2 small patches to fix kexec shutdown and kdump kernel driver init issues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36a44bcd
    • Vasundhara Volam's avatar
      bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs. · 8743db4a
      Vasundhara Volam authored
      If crashed kernel does not shutdown the NIC properly, PCIe FLR
      is required in the kdump kernel in order to initialize all the
      functions properly.
      
      Fixes: d629522e ("bnxt_en: Reduce memory usage when running in kdump kernel.")
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8743db4a
    • Vasundhara Volam's avatar
      bnxt_en: Improve device shutdown method. · 5567ae4a
      Vasundhara Volam authored
      Especially when bnxt_shutdown() is called during kexec, we need to
      disable MSIX and disable Bus Master to completely quiesce the device.
      Make these 2 calls unconditionally in the shutdown method.
      
      Fixes: c20dc142 ("bnxt_en: Disable bus master during PCI shutdown and driver unload.")
      Signed-off-by: default avatarVasundhara Volam <vasundhara-v.volam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5567ae4a
    • Nikolay Aleksandrov's avatar
      net: netlink: cap max groups which will be considered in netlink_bind() · 3a20773b
      Nikolay Aleksandrov authored
      Since nl_groups is a u32 we can't bind more groups via ->bind
      (netlink_bind) call, but netlink has supported more groups via
      setsockopt() for a long time and thus nlk->ngroups could be over 32.
      Recently I added support for per-vlan notifications and increased the
      groups to 33 for NETLINK_ROUTE which exposed an old bug in the
      netlink_bind() code causing out-of-bounds access on archs where unsigned
      long is 32 bits via test_bit() on a local variable. Fix this by capping the
      maximum groups in netlink_bind() to BITS_PER_TYPE(u32), effectively
      capping them at 32 which is the minimum of allocated groups and the
      maximum groups which can be bound via netlink_bind().
      
      CC: Christophe Leroy <christophe.leroy@c-s.fr>
      CC: Richard Guy Briggs <rgb@redhat.com>
      Fixes: 4f520900 ("netlink: have netlink per-protocol bind function return an error code.")
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a20773b
  2. 20 Feb, 2020 18 commits
    • Tim Harvey's avatar
      net: thunderx: workaround BGX TX Underflow issue · 971617c3
      Tim Harvey authored
      While it is not yet understood why a TX underflow can easily occur
      for SGMII interfaces resulting in a TX wedge. It has been found that
      disabling/re-enabling the LMAC resolves the issue.
      Signed-off-by: default avatarTim Harvey <tharvey@gateworks.com>
      Reviewed-by: default avatarRobert Jones <rjones@gateworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      971617c3
    • Shannon Nelson's avatar
      ionic: fix fw_status read · 68b759a7
      Shannon Nelson authored
      The fw_status field is only 8 bits, so fix the read.  Also,
      we only want to look at the one status bit, to allow for future
      use of the other bits, and watch for a bad PCI read.
      
      Fixes: 97ca4865 ("ionic: add heartbeat check")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68b759a7
    • Linus Torvalds's avatar
      Merge branch 'next-integrity' of... · ebe7acad
      Linus Torvalds authored
      Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
      
      Pull IMA fixes from Mimi Zohar:
       "Two bug fixes and an associated change for each.
      
        The one that adds SM3 to the IMA list of supported hash algorithms is
        a simple change, but could be considered a new feature"
      
      * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
        ima: add sm3 algorithm to hash algorithm configuration list
        crypto: rename sm3-256 to sm3 in hash_algo_name
        efi: Only print errors about failing to get certs if EFI vars are found
        x86/ima: use correct identifier for SetupMode variable
      ebe7acad
    • Roman Kiryanov's avatar
      net: disable BRIDGE_NETFILTER by default · 98bda63e
      Roman Kiryanov authored
      The description says 'If unsure, say N.' but
      the module is built as M by default (once
      the dependencies are satisfied).
      
      When the module is selected (Y or M), it enables
      NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS
      which alter kernel internal structures.
      
      We (Android Studio Emulator) currently do not
      use this module and think this it is more consistent
      to have it disabled by default as opposite to
      disabling it explicitly to prevent enabling
      NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS.
      Signed-off-by: default avatarRoman Kiryanov <rkir@google.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98bda63e
    • Alexandre Belloni's avatar
      net: macb: Properly handle phylink on at91rm9200 · ac2fcfa9
      Alexandre Belloni authored
      at91ether_init was handling the phy mode and speed but since the switch to
      phylink, the NCFGR register got overwritten by macb_mac_config(). The issue
      is that the RM9200_RMII bit and the MACB_CLK_DIV32 field are cleared
      but never restored as they conflict with the PAE, GBE and PCSSEL bits.
      
      Add new capability to differentiate between EMAC and the other versions of
      the IP and use it to set and avoid clearing the relevant bits.
      
      Also, this fixes a NULL pointer dereference in macb_mac_link_up as the EMAC
      doesn't use any rings/bufffers/queues.
      
      Fixes: 7897b071 ("net: macb: convert to phylink")
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac2fcfa9
    • David S. Miller's avatar
      Merge branch 's390-fixes' · 0d5b8d70
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2020-02-20
      
      please apply the following patch series for qeth to netdev's net tree.
      
      This corrects three minor issues:
      1) return a more fitting errno when VNICC cmds are not supported,
      2) remove a bogus WARN in the NAPI code, and
      3) be _very_ pedantic about the RX copybreak.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d5b8d70
    • Julian Wiedmann's avatar
      s390/qeth: fix off-by-one in RX copybreak check · 54a61fbc
      Julian Wiedmann authored
      The RX copybreak is intended as the _max_ value where the frame's data
      should be copied. So for frame_len == copybreak, don't build an SG skb.
      
      Fixes: 4a71df50 ("qeth: new qeth device driver")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54a61fbc
    • Julian Wiedmann's avatar
      s390/qeth: don't warn for napi with 0 budget · 420579db
      Julian Wiedmann authored
      Calling napi->poll() with 0 budget is a legitimate use by netpoll.
      
      Fixes: a1c3ed4c ("qeth: NAPI support for l2 and l3 discipline")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      420579db
    • Alexandra Winter's avatar
      s390/qeth: vnicc Fix EOPNOTSUPP precedence · 6f3846f0
      Alexandra Winter authored
      When getting or setting VNICC parameters, the error code EOPNOTSUPP
      should have precedence over EBUSY.
      
      EBUSY is used because vnicc feature and bridgeport feature are mutually
      exclusive, which is a temporary condition.
      Whereas EOPNOTSUPP indicates that the HW does not support all or parts of
      the vnicc feature.
      This issue causes the vnicc sysfs params to show 'blocked by bridgeport'
      for HW that does not support VNICC at all.
      
      Fixes: caa1f0b1 ("s390/qeth: add VNICC enable/disable support")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f3846f0
    • Kees Cook's avatar
      openvswitch: Distribute switch variables for initialization · 16a556ee
      Kees Cook authored
      Variables declared in a switch statement before any case statements
      cannot be automatically initialized with compiler instrumentation (as
      they are not part of any execution flow). With GCC's proposed automatic
      stack variable initialization feature, this triggers a warning (and they
      don't get initialized). Clang's automatic stack variable initialization
      (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
      doesn't initialize such variables[1]. Note that these warnings (or silent
      skipping) happen before the dead-store elimination optimization phase,
      so even when the automatic initializations are later elided in favor of
      direct initializations, the warnings remain.
      
      To avoid these problems, move such variables into the "case" where
      they're used or lift them up into the main function body.
      
      net/openvswitch/flow_netlink.c: In function ‘validate_set’:
      net/openvswitch/flow_netlink.c:2711:29: warning: statement will never be executed [-Wswitch-unreachable]
       2711 |  const struct ovs_key_ipv4 *ipv4_key;
            |                             ^~~~~~~~
      
      [1] https://bugs.llvm.org/show_bug.cgi?id=44916Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16a556ee
    • Kees Cook's avatar
      net: ip6_gre: Distribute switch variables for initialization · 46d30cb1
      Kees Cook authored
      Variables declared in a switch statement before any case statements
      cannot be automatically initialized with compiler instrumentation (as
      they are not part of any execution flow). With GCC's proposed automatic
      stack variable initialization feature, this triggers a warning (and they
      don't get initialized). Clang's automatic stack variable initialization
      (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
      doesn't initialize such variables[1]. Note that these warnings (or silent
      skipping) happen before the dead-store elimination optimization phase,
      so even when the automatic initializations are later elided in favor of
      direct initializations, the warnings remain.
      
      To avoid these problems, move such variables into the "case" where
      they're used or lift them up into the main function body.
      
      net/ipv6/ip6_gre.c: In function ‘ip6gre_err’:
      net/ipv6/ip6_gre.c:440:32: warning: statement will never be executed [-Wswitch-unreachable]
        440 |   struct ipv6_tlv_tnl_enc_lim *tel;
            |                                ^~~
      
      net/ipv6/ip6_tunnel.c: In function ‘ip6_tnl_err’:
      net/ipv6/ip6_tunnel.c:520:32: warning: statement will never be executed [-Wswitch-unreachable]
        520 |   struct ipv6_tlv_tnl_enc_lim *tel;
            |                                ^~~
      
      [1] https://bugs.llvm.org/show_bug.cgi?id=44916Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46d30cb1
    • Kees Cook's avatar
      net: core: Distribute switch variables for initialization · 161d1792
      Kees Cook authored
      Variables declared in a switch statement before any case statements
      cannot be automatically initialized with compiler instrumentation (as
      they are not part of any execution flow). With GCC's proposed automatic
      stack variable initialization feature, this triggers a warning (and they
      don't get initialized). Clang's automatic stack variable initialization
      (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
      doesn't initialize such variables[1]. Note that these warnings (or silent
      skipping) happen before the dead-store elimination optimization phase,
      so even when the automatic initializations are later elided in favor of
      direct initializations, the warnings remain.
      
      To avoid these problems, move such variables into the "case" where
      they're used or lift them up into the main function body.
      
      net/core/skbuff.c: In function ‘skb_checksum_setup_ip’:
      net/core/skbuff.c:4809:7: warning: statement will never be executed [-Wswitch-unreachable]
       4809 |   int err;
            |       ^~~
      
      [1] https://bugs.llvm.org/show_bug.cgi?id=44916Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      161d1792
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.6-rc3' of... · ca7e1fd1
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "Fixes to build failures and other test bugs"
      
      * tag 'linux-kselftest-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests: openat2: fix build error on newer glibc
        selftests: use LDLIBS for libraries instead of LDFLAGS
        selftests: fix too long argument
        selftests: allow detection of build failures
        Kernel selftests: tpm2: check for tpm support
        selftests/ftrace: Have pid filter test use instance flag
        selftests: fix spelling mistaked "chaigned" -> "chained"
      ca7e1fd1
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 41f57cfd
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2020-02-19
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 10 non-merge commits during the last 10 day(s) which contain
      a total of 10 files changed, 93 insertions(+), 31 deletions(-).
      
      The main changes are:
      
      1) batched bpf hashtab fixes from Brian and Yonghong.
      
      2) various selftests and libbpf fixes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41f57cfd
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · fca07a93
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2020-02-19
      
      This series contains fixes to the ice driver.
      
      Brett fixes an issue where if a user sets an odd [tx|rx]-usecs value
      through ethtool, the request is denied because the hardware is set to
      have an ITR with 2us granularity.  Also fix an issue where the VF has
      not been completely removed/reset after being unbound from the host
      driver, so resolve this by waiting for the VF remove/reset process to
      happen before checking if the VF is disabled.
      
      Michal fixes an issue, where when the user changes flow control via
      ethtool, the OS is told the link is going down when that may not be the
      case.  Before the fix, the only way to get out of this state was to take
      the interface down and up again.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fca07a93
    • Willem de Bruijn's avatar
      udp: rehash on disconnect · 303d0403
      Willem de Bruijn authored
      As of the below commit, udp sockets bound to a specific address can
      coexist with one bound to the any addr for the same port.
      
      The commit also phased out the use of socket hashing based only on
      port (hslot), in favor of always hashing on {addr, port} (hslot2).
      
      The change broke the following behavior with disconnect (AF_UNSPEC):
      
          server binds to 0.0.0.0:1337
          server connects to 127.0.0.1:80
          server disconnects
          client connects to 127.0.0.1:1337
          client sends "hello"
          server reads "hello"	// times out, packet did not find sk
      
      On connect the server acquires a specific source addr suitable for
      routing to its destination. On disconnect it reverts to the any addr.
      
      The connect call triggers a rehash to a different hslot2. On
      disconnect, add the same to return to the original hslot2.
      
      Skip this step if the socket is going to be unhashed completely.
      
      Fixes: 4cdeeee9 ("net: udp: prefer listeners bound to an address")
      Reported-by: default avatarPavel Roskin <plroskin@gmail.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      303d0403
    • Rohit Maheshwari's avatar
      net/tls: Fix to avoid gettig invalid tls record · 06f5201c
      Rohit Maheshwari authored
      Current code doesn't check if tcp sequence number is starting from (/after)
      1st record's start sequnce number. It only checks if seq number is before
      1st record's end sequnce number. This problem will always be a possibility
      in re-transmit case. If a record which belongs to a requested seq number is
      already deleted, tls_get_record will start looking into list and as per the
      check it will look if seq number is before the end seq of 1st record, which
      will always be true and will return 1st record always, it should in fact
      return NULL.
      As part of the fix, start looking each record only if the sequence number
      lies in the list else return NULL.
      There is one more check added, driver look for the start marker record to
      handle tcp packets which are before the tls offload start sequence number,
      hence return 1st record if the record is tls start marker and seq number is
      before the 1st record's starting sequence number.
      
      Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
      Signed-off-by: default avatarRohit Maheshwari <rohitm@chelsio.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06f5201c
    • Yonghong Song's avatar
      bpf: Fix a potential deadlock with bpf_map_do_batch · b9aff38d
      Yonghong Song authored
      Commit 05799638 ("bpf: Add batch ops to all htab bpf map")
      added lookup_and_delete batch operation for hash table.
      The current implementation has bpf_lru_push_free() inside
      the bucket lock, which may cause a deadlock.
      
      syzbot reports:
         -> #2 (&htab->buckets[i].lock#2){....}:
             __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
             _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
             htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593
             __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline]
             __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266
             bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline]
             bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
             bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499
             prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132
             __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069
             bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585
             bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181
             generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319
             bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
             __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460
             __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
             __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
             do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
         -> #0 (&loc_l->lock){....}:
             check_prev_add kernel/locking/lockdep.c:2475 [inline]
             check_prevs_add kernel/locking/lockdep.c:2580 [inline]
             validate_chain kernel/locking/lockdep.c:2970 [inline]
             __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954
             lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484
             __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
             _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
             bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline]
             bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555
             __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374
             htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491
             bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
             __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456
             __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
             __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
             do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          Possible unsafe locking scenario:
      
                CPU0                    CPU2
                ----                    ----
           lock(&htab->buckets[i].lock#2);
                                        lock(&l->lock);
                                        lock(&htab->buckets[i].lock#2);
           lock(&loc_l->lock);
      
          *** DEADLOCK ***
      
      To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0,
      let us do bpf_lru_push_free() out of the htab bucket lock. This can
      avoid the above deadlock scenario.
      
      Fixes: 05799638 ("bpf: Add batch ops to all htab bpf map")
      Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com
      Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com
      Suggested-by: default avatarHillf Danton <hdanton@sina.com>
      Suggested-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarBrian Vazquez <brianvv@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200219234757.3544014-1-yhs@fb.com
      b9aff38d