1. 16 Jun, 2021 28 commits
    • Aya Levin's avatar
      net/mlx5: Reset mkey index on creation · 0232fc2d
      Aya Levin authored
      Reset only the index part of the mkey and keep the variant part. On
      devlink reload, driver recreates mkeys, so the mkey index may change.
      Trying to preserve the variant part of the mkey, driver mistakenly
      merged the mkey index with current value. In case of a devlink reload,
      current value of index part is dirty, so the index may be corrupted.
      
      Fixes: 54c62e13 ("{IB,net}/mlx5: Setup mkey variant before mr create command invocation")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Signed-off-by: default avatarAmir Tzin <amirtz@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      0232fc2d
    • Dmytro Linkin's avatar
      net/mlx5e: Don't create devices during unload flow · a5ae8fc9
      Dmytro Linkin authored
      Running devlink reload command for port in switchdev mode cause
      resources to corrupt: driver can't release allocated EQ and reclaim
      memory pages, because "rdma" auxiliary device had add CQs which blocks
      EQ from deletion.
      Erroneous sequence happens during reload-down phase, and is following:
      
      1. detach device - suspends auxiliary devices which support it, destroys
         others. During this step "eth-rep" and "rdma-rep" are destroyed,
         "eth" - suspended.
      2. disable SRIOV - moves device to legacy mode; as part of disablement -
         rescans drivers. This step adds "rdma" auxiliary device.
      3. destroy EQ table - <failure>.
      
      Driver shouldn't create any device during unload flows. To handle that
      implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset
      on device attach. If flag is set do no-op on drivers rescan.
      
      Fixes: a925b5e3 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a5ae8fc9
    • Alex Vesker's avatar
      net/mlx5: DR, Fix STEv1 incorrect L3 decapsulation padding · 65fb7d10
      Alex Vesker authored
      Decapsulation L3 on small inner packets which are less than
      64 Bytes was done incorrectly. In small packets there is an
      extra padding added in L2 which should not be included in L3
      length. The issue was that after decapL3 the extra L2 padding
      caused an update on the L3 length.
      
      To avoid this issue the new header is pushed to the beginning
      of the packet (offset 0) which should not cause a HW reparse
      and update the L3 length.
      
      Fixes: c349b413 ("net/mlx5: DR, Add STEv1 modify header logic")
      Reviewed-by: default avatarErez Shitrit <erezsh@nvidia.com>
      Reviewed-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      65fb7d10
    • Parav Pandit's avatar
      net/mlx5: SF_DEV, remove SF device on invalid state · c7d6c19b
      Parav Pandit authored
      When auxiliary bus autoprobe is disabled and SF is in ACTIVE state,
      on SF port deletion it transitions from ACTIVE->ALLOCATED->INVALID.
      
      When VHCA event handler queries the state, it is already transition
      to INVALID state.
      
      In this scenario, event handler missed to delete the SF device.
      
      Fix it by deleting the SF when SF state is INVALID.
      
      Fixes: 90d010b8 ("net/mlx5: SF, Add auxiliary device support")
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarVu Pham <vuhuong@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c7d6c19b
    • Parav Pandit's avatar
      net/mlx5: E-Switch, Allow setting GUID for host PF vport · ca36fc4d
      Parav Pandit authored
      E-switch should be able to set the GUID of host PF vport.
      Currently it returns an error. This results in below error
      when user attempts to configure MAC address of the PF of an
      external controller.
      
      $ devlink port function set pci/0000:03:00.0/196608 \
         hw_addr 00:00:00:11:22:33
      
      mlx5_core 0000:03:00.0: mlx5_esw_set_vport_mac_locked:1876:(pid 6715):\
      "Failed to set vport 0 node guid, err = -22.
      RDMA_CM will not function properly for this VF."
      
      Check for zero vport is no longer needed.
      
      Fixes: 330077d1 ("net/mlx5: E-switch, Supporting setting devlink port function mac address")
      Signed-off-by: default avatarYuval Avnery <yuvalav@nvidia.com>
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarBodong Wang <bodong@nvidia.com>
      Reviewed-by: default avatarAlaa Hleihel <alaa@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ca36fc4d
    • Parav Pandit's avatar
      net/mlx5: E-Switch, Read PF mac address · bbc8222d
      Parav Pandit authored
      External controller PF's MAC address is not read from the device during
      vport setup. Fail to read this results in showing all zeros to user
      while the factory programmed MAC is a valid value.
      
      $ devlink port show eth1 -jp
      {
          "port": {
              "pci/0000:03:00.0/196608": {
                  "type": "eth",
                  "netdev": "eth1",
                  "flavour": "pcipf",
                  "controller": 1,
                  "pfnum": 0,
                  "splittable": false,
                  "function": {
                      "hw_addr": "00:00:00:00:00:00"
                  }
              }
          }
      }
      
      Hence, read it when enabling a vport.
      
      After the fix,
      
      $ devlink port show eth1 -jp
      {
          "port": {
              "pci/0000:03:00.0/196608": {
                  "type": "eth",
                  "netdev": "eth1",
                  "flavour": "pcipf",
                  "controller": 1,
                  "pfnum": 0,
                  "splittable": false,
                  "function": {
                      "hw_addr": "98:03:9b:a0:60:11"
                  }
              }
          }
      }
      
      Fixes: f099fde1 ("net/mlx5: E-switch, Support querying port function mac address")
      Signed-off-by: default avatarBodong Wang <bodong@nvidia.com>
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarAlaa Hleihel <alaa@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      bbc8222d
    • Leon Romanovsky's avatar
      net/mlx5: Check that driver was probed prior attaching the device · 2058cc9c
      Leon Romanovsky authored
      The device can be requested to be attached despite being not probed.
      This situation is possible if devlink reload races with module removal,
      and the following kernel panic is an outcome of such race.
      
       mlx5_core 0000:00:09.0: firmware version: 4.7.9999
       mlx5_core 0000:00:09.0: 0.000 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x255 link)
       BUG: unable to handle page fault for address: fffffffffffffff0
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 3218067 P4D 3218067 PUD 321a067 PMD 0
       Oops: 0000 [#1] SMP KASAN NOPTI
       CPU: 7 PID: 250 Comm: devlink Not tainted 5.12.0-rc2+ #2836
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:mlx5_attach_device+0x80/0x280 [mlx5_core]
       Code: f8 48 c1 e8 03 42 80 3c 38 00 0f 85 80 01 00 00 48 8b 45 68 48 8d 78 f0 48 89 fe 48 c1 ee 03 42 80 3c 3e 00 0f 85 70 01 00 00 <48> 8b 40 f0 48 85 c0 74 0d 48 89 ef ff d0 85 c0 0f 85 84 05 0e 00
       RSP: 0018:ffff8880129675f0 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff827407f1
       RDX: 1ffff110011336cf RSI: 1ffffffffffffffe RDI: fffffffffffffff0
       RBP: ffff888008e0c000 R08: 0000000000000008 R09: ffffffffa0662ee7
       R10: fffffbfff40cc5dc R11: 0000000000000000 R12: ffff88800ea002e0
       R13: ffffed1001d459f7 R14: ffffffffa05ef4f8 R15: dffffc0000000000
       FS:  00007f51dfeaf740(0000) GS:ffff88806d5c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: fffffffffffffff0 CR3: 000000000bc82006 CR4: 0000000000370ea0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        mlx5_load_one+0x117/0x1d0 [mlx5_core]
        devlink_reload+0x2d5/0x520
        ? devlink_remote_reload_actions_performed+0x30/0x30
        ? mutex_trylock+0x24b/0x2d0
        ? devlink_nl_cmd_reload+0x62b/0x1070
        devlink_nl_cmd_reload+0x66d/0x1070
        ? devlink_reload+0x520/0x520
        ? devlink_nl_pre_doit+0x64/0x4d0
        genl_family_rcv_msg_doit+0x1e9/0x2f0
        ? mutex_lock_io_nested+0x1130/0x1130
        ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
        ? security_capable+0x51/0x90
        genl_rcv_msg+0x27f/0x4a0
        ? genl_get_cmd+0x3c0/0x3c0
        ? lock_acquire+0x1a9/0x6d0
        ? devlink_reload+0x520/0x520
        ? lock_release+0x6c0/0x6c0
        netlink_rcv_skb+0x11d/0x340
        ? genl_get_cmd+0x3c0/0x3c0
        ? netlink_ack+0x9f0/0x9f0
        ? lock_release+0x1f9/0x6c0
        genl_rcv+0x24/0x40
        netlink_unicast+0x433/0x700
        ? netlink_attachskb+0x730/0x730
        ? _copy_from_iter_full+0x178/0x650
        ? __alloc_skb+0x113/0x2b0
        netlink_sendmsg+0x6f1/0xbd0
        ? netlink_unicast+0x700/0x700
        ? netlink_unicast+0x700/0x700
        sock_sendmsg+0xb0/0xe0
        __sys_sendto+0x193/0x240
        ? __x64_sys_getpeername+0xb0/0xb0
        ? copy_page_range+0x2300/0x2300
        ? __up_read+0x1a1/0x7b0
        ? do_user_addr_fault+0x219/0xdc0
        __x64_sys_sendto+0xdd/0x1b0
        ? syscall_enter_from_user_mode+0x1d/0x50
        do_syscall_64+0x2d/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f51dffb514a
       Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
       RSP: 002b:00007ffcaef22e78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f51dffb514a
       RDX: 0000000000000030 RSI: 000055750daf2440 RDI: 0000000000000003
       RBP: 000055750daf2410 R08: 00007f51e0081200 R09: 000000000000000c
       R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       Modules linked in: mlx5_core(-) ptp pps_core ib_ipoib rdma_ucm rdma_cm iw_cm ib_cm ib_umad ib_uverbs ib_core [last unloaded: mlx5_ib]
       CR2: fffffffffffffff0
       ---[ end trace 7789831bfe74fa42 ]---
      
      Fixes: a925b5e3 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2058cc9c
    • Leon Romanovsky's avatar
      net/mlx5: Fix error path for set HCA defaults · 94a4b841
      Leon Romanovsky authored
      In the case of the failure to execute mlx5_core_set_hca_defaults(),
      we used wrong goto label to execute error unwind flow.
      
      Fixes: 5bef709d ("net/mlx5: Enable host PF HCA after eswitch is initialized")
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      94a4b841
    • Kees Cook's avatar
      r8169: Avoid memcpy() over-reading of ETH_SS_STATS · da5ac772
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally reading across neighboring array fields.
      
      The memcpy() is copying the entire structure, not just the first array.
      Adjust the source argument so the compiler can do appropriate bounds
      checking.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da5ac772
    • Kees Cook's avatar
      sh_eth: Avoid memcpy() over-reading of ETH_SS_STATS · 224004fb
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally reading across neighboring array fields.
      
      The memcpy() is copying the entire structure, not just the first array.
      Adjust the source argument so the compiler can do appropriate bounds
      checking.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      224004fb
    • Kees Cook's avatar
      r8152: Avoid memcpy() over-reading of ETH_SS_STATS · 99718abd
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally reading across neighboring array fields.
      
      The memcpy() is copying the entire structure, not just the first array.
      Adjust the source argument so the compiler can do appropriate bounds
      checking.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99718abd
    • Andrea Righi's avatar
      selftests: net: use bash to run udpgro_fwd test case · 1b29df0e
      Andrea Righi authored
      udpgro_fwd.sh contains many bash specific operators ("[[", "local -r"),
      but it's using /bin/sh; in some distro /bin/sh is mapped to /bin/dash,
      that doesn't support such operators.
      
      Force the test to use /bin/bash explicitly and prevent false positive
      test failures.
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b29df0e
    • Eric Dumazet's avatar
      net/af_unix: fix a data-race in unix_dgram_sendmsg / unix_release_sock · a494bd64
      Eric Dumazet authored
      While unix_may_send(sk, osk) is called while osk is locked, it appears
      unix_release_sock() can overwrite unix_peer() after this lock has been
      released, making KCSAN unhappy.
      
      Changing unix_release_sock() to access/change unix_peer()
      before lock is released should fix this issue.
      
      BUG: KCSAN: data-race in unix_dgram_sendmsg / unix_release_sock
      
      write to 0xffff88810465a338 of 8 bytes by task 20852 on cpu 1:
       unix_release_sock+0x4ed/0x6e0 net/unix/af_unix.c:558
       unix_release+0x2f/0x50 net/unix/af_unix.c:859
       __sock_release net/socket.c:599 [inline]
       sock_close+0x6c/0x150 net/socket.c:1258
       __fput+0x25b/0x4e0 fs/file_table.c:280
       ____fput+0x11/0x20 fs/file_table.c:313
       task_work_run+0xae/0x130 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:189 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
       exit_to_user_mode_prepare+0x156/0x190 kernel/entry/common.c:209
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:302
       do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88810465a338 of 8 bytes by task 20888 on cpu 0:
       unix_may_send net/unix/af_unix.c:189 [inline]
       unix_dgram_sendmsg+0x923/0x1610 net/unix/af_unix.c:1712
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xffff888167905400 -> 0x0000000000000000
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 20888 Comm: syz-executor.0 Not tainted 5.13.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a494bd64
    • Andrea Righi's avatar
      selftests: net: veth: make test compatible with dash · 0fd158b8
      Andrea Righi authored
      veth.sh is a shell script that uses /bin/sh; some distro (Ubuntu for
      example) use dash as /bin/sh and in this case the test reports the
      following error:
      
       # ./veth.sh: 21: local: -r: bad variable name
       # ./veth.sh: 21: local: -r: bad variable name
      
      This happens because dash doesn't support the option "-r" with local.
      
      Moreover, in case of missing bpf object, the script is exiting -1, that
      is an illegal number for dash:
      
       exit: Illegal number: -1
      
      Change the script to be compatible both with bash and dash and prevent
      the errors above.
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fd158b8
    • David S. Miller's avatar
      Merge branch 'net-packet-data-races' · 1d2ac203
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net/packet: annotate data races
      
      KCSAN sent two reports about data races in af_packet.
      Nothing serious, but worth fixing.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d2ac203
    • Eric Dumazet's avatar
      net/packet: annotate accesses to po->ifindex · e032f7c9
      Eric Dumazet authored
      Like prior patch, we need to annotate lockless accesses to po->ifindex
      For instance, packet_getname() is reading po->ifindex (twice) while
      another thread is able to change po->ifindex.
      
      KCSAN reported:
      
      BUG: KCSAN: data-race in packet_do_bind / packet_getname
      
      write to 0xffff888143ce3cbc of 4 bytes by task 25573 on cpu 1:
       packet_do_bind+0x420/0x7e0 net/packet/af_packet.c:3191
       packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255
       __sys_bind+0x200/0x290 net/socket.c:1637
       __do_sys_bind net/socket.c:1648 [inline]
       __se_sys_bind net/socket.c:1646 [inline]
       __x64_sys_bind+0x3d/0x50 net/socket.c:1646
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888143ce3cbc of 4 bytes by task 25578 on cpu 0:
       packet_getname+0x5b/0x1a0 net/packet/af_packet.c:3525
       __sys_getsockname+0x10e/0x1a0 net/socket.c:1887
       __do_sys_getsockname net/socket.c:1902 [inline]
       __se_sys_getsockname net/socket.c:1899 [inline]
       __x64_sys_getsockname+0x3e/0x50 net/socket.c:1899
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000000 -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 25578 Comm: syz-executor.5 Not tainted 5.13.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e032f7c9
    • Eric Dumazet's avatar
      net/packet: annotate accesses to po->bind · c7d2ef5d
      Eric Dumazet authored
      tpacket_snd(), packet_snd(), packet_getname() and packet_seq_show()
      can read po->num without holding a lock. This means other threads
      can change po->num at the same time.
      
      KCSAN complained about this known fact [1]
      Add READ_ONCE()/WRITE_ONCE() to address the issue.
      
      [1] BUG: KCSAN: data-race in packet_do_bind / packet_sendmsg
      
      write to 0xffff888131a0dcc0 of 2 bytes by task 24714 on cpu 0:
       packet_do_bind+0x3ab/0x7e0 net/packet/af_packet.c:3181
       packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255
       __sys_bind+0x200/0x290 net/socket.c:1637
       __do_sys_bind net/socket.c:1648 [inline]
       __se_sys_bind net/socket.c:1646 [inline]
       __x64_sys_bind+0x3d/0x50 net/socket.c:1646
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888131a0dcc0 of 2 bytes by task 24719 on cpu 1:
       packet_snd net/packet/af_packet.c:2899 [inline]
       packet_sendmsg+0x317/0x3570 net/packet/af_packet.c:3040
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmsg+0x1ed/0x270 net/socket.c:2433
       __do_sys_sendmsg net/socket.c:2442 [inline]
       __se_sys_sendmsg net/socket.c:2440 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2440
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0x1200
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 24719 Comm: syz-executor.5 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7d2ef5d
    • David S. Miller's avatar
      Merge tag 'linux-can-fixes-for-5.13-20210616' of... · e82a35ae
      David S. Miller authored
      Merge tag 'linux-can-fixes-for-5.13-20210616' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can 2021-06-16
      
      this is a pull request of 4 patches for net/master.
      
      The first patch is by Oleksij Rempel and fixes a Use-after-Free found
      by syzbot in the j1939 stack.
      
      The next patch is by Tetsuo Handa and fixes hung task detected by
      syzbot in the bcm, raw and isotp protocols.
      
      Norbert Slusarek's patch fixes a infoleak in bcm's struct
      bcm_msg_head.
      
      Pavel Skripkin's patch fixes a memory leak in the mcba_usb driver.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e82a35ae
    • Chengyang Fan's avatar
      net: ipv4: fix memory leak in ip_mc_add1_src · d8e29730
      Chengyang Fan authored
      BUG: memory leak
      unreferenced object 0xffff888101bc4c00 (size 32):
        comm "syz-executor527", pid 360, jiffies 4294807421 (age 19.329s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
          01 00 00 00 00 00 00 00 ac 14 14 bb 00 00 02 00 ................
        backtrace:
          [<00000000f17c5244>] kmalloc include/linux/slab.h:558 [inline]
          [<00000000f17c5244>] kzalloc include/linux/slab.h:688 [inline]
          [<00000000f17c5244>] ip_mc_add1_src net/ipv4/igmp.c:1971 [inline]
          [<00000000f17c5244>] ip_mc_add_src+0x95f/0xdb0 net/ipv4/igmp.c:2095
          [<000000001cb99709>] ip_mc_source+0x84c/0xea0 net/ipv4/igmp.c:2416
          [<0000000052cf19ed>] do_ip_setsockopt net/ipv4/ip_sockglue.c:1294 [inline]
          [<0000000052cf19ed>] ip_setsockopt+0x114b/0x30c0 net/ipv4/ip_sockglue.c:1423
          [<00000000477edfbc>] raw_setsockopt+0x13d/0x170 net/ipv4/raw.c:857
          [<00000000e75ca9bb>] __sys_setsockopt+0x158/0x270 net/socket.c:2117
          [<00000000bdb993a8>] __do_sys_setsockopt net/socket.c:2128 [inline]
          [<00000000bdb993a8>] __se_sys_setsockopt net/socket.c:2125 [inline]
          [<00000000bdb993a8>] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2125
          [<000000006a1ffdbd>] do_syscall_64+0x40/0x80 arch/x86/entry/common.c:47
          [<00000000b11467c4>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      In commit 24803f38 ("igmp: do not remove igmp souce list info when set
      link down"), the ip_mc_clear_src() in ip_mc_destroy_dev() was removed,
      because it was also called in igmpv3_clear_delrec().
      
      Rough callgraph:
      
      inetdev_destroy
      -> ip_mc_destroy_dev
           -> igmpv3_clear_delrec
              -> ip_mc_clear_src
      -> RCU_INIT_POINTER(dev->ip_ptr, NULL)
      
      However, ip_mc_clear_src() called in igmpv3_clear_delrec() doesn't
      release in_dev->mc_list->sources. And RCU_INIT_POINTER() assigns the
      NULL to dev->ip_ptr. As a result, in_dev cannot be obtained through
      inetdev_by_index() and then in_dev->mc_list->sources cannot be released
      by ip_mc_del1_src() in the sock_close. Rough call sequence goes like:
      
      sock_close
      -> __sock_release
         -> inet_release
            -> ip_mc_drop_socket
               -> inetdev_by_index
               -> ip_mc_leave_src
                  -> ip_mc_del_src
                     -> ip_mc_del1_src
      
      So we still need to call ip_mc_clear_src() in ip_mc_destroy_dev() to free
      in_dev->mc_list->sources.
      
      Fixes: 24803f38 ("igmp: do not remove igmp souce list info ...")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarChengyang Fan <cy.fan@huawei.com>
      Acked-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8e29730
    • David S. Miller's avatar
      Merge branch 'fec-ptp-fixes' · c0d982bf
      David S. Miller authored
      Joakim Zhang says:
      
      ====================
      net: fixes for fec ptp
      
      Small fixes for fec ptp.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0d982bf
    • Joakim Zhang's avatar
      net: fec_ptp: fix issue caused by refactor the fec_devtype · d2376564
      Joakim Zhang authored
      Commit da722186 ("net: fec: set GPR bit on suspend by DT configuration.")
      refactor the fec_devtype, need adjust ptp driver accordingly.
      
      Fixes: da722186 ("net: fec: set GPR bit on suspend by DT configuration.")
      Signed-off-by: default avatarJoakim Zhang <qiangqing.zhang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2376564
    • Fugang Duan's avatar
      net: fec_ptp: add clock rate zero check · cb3cefe3
      Fugang Duan authored
      Add clock rate zero check to fix coverity issue of "divide by 0".
      
      Fixes: commit 85bd1798 ("net: fec: fix spin_lock dead lock")
      Signed-off-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Signed-off-by: default avatarJoakim Zhang <qiangqing.zhang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb3cefe3
    • Dongliang Mu's avatar
      net: usb: fix possible use-after-free in smsc75xx_bind · 56b786d8
      Dongliang Mu authored
      The commit 46a8b29c ("net: usb: fix memory leak in smsc75xx_bind")
      fails to clean up the work scheduled in smsc75xx_reset->
      smsc75xx_set_multicast, which leads to use-after-free if the work is
      scheduled to start after the deallocation. In addition, this patch
      also removes a dangling pointer - dev->data[0].
      
      This patch calls cancel_work_sync to cancel the scheduled work and set
      the dangling pointer to NULL.
      
      Fixes: 46a8b29c ("net: usb: fix memory leak in smsc75xx_bind")
      Signed-off-by: default avatarDongliang Mu <mudongliangabcd@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56b786d8
    • Joakim Zhang's avatar
      net: stmmac: disable clocks in stmmac_remove_config_dt() · 8f269102
      Joakim Zhang authored
      Platform drivers may call stmmac_probe_config_dt() to parse dt, could
      call stmmac_remove_config_dt() in error handing after dt parsed, so need
      disable clocks in stmmac_remove_config_dt().
      
      Go through all platforms drivers which use stmmac_probe_config_dt(),
      none of them disable clocks manually, so it's safe to disable them in
      stmmac_remove_config_dt().
      
      Fixes: commit d2ed0a77 ("net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks")
      Signed-off-by: default avatarJoakim Zhang <qiangqing.zhang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f269102
    • Pavel Skripkin's avatar
      can: mcba_usb: fix memory leak in mcba_usb · 91c02557
      Pavel Skripkin authored
      Syzbot reported memory leak in SocketCAN driver for Microchip CAN BUS
      Analyzer Tool. The problem was in unfreed usb_coherent.
      
      In mcba_usb_start() 20 coherent buffers are allocated and there is
      nothing, that frees them:
      
      1) In callback function the urb is resubmitted and that's all
      2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER
         is not set (see mcba_usb_start) and this flag cannot be used with
         coherent buffers.
      
      Fail log:
      | [ 1354.053291][ T8413] mcba_usb 1-1:0.0 can0: device disconnected
      | [ 1367.059384][ T8420] kmemleak: 20 new suspected memory leaks (see /sys/kernel/debug/kmem)
      
      So, all allocated buffers should be freed with usb_free_coherent()
      explicitly
      
      NOTE:
      The same pattern for allocating and freeing coherent buffers
      is used in drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
      
      Fixes: 51f3baad ("can: mcba_usb: Add support for Microchip CAN BUS Analyzer")
      Link: https://lore.kernel.org/r/20210609215833.30393-1-paskripkin@gmail.com
      Cc: linux-stable <stable@vger.kernel.org>
      Reported-and-tested-by: syzbot+57281c762a3922e14dfe@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      91c02557
    • Norbert Slusarek's avatar
      can: bcm: fix infoleak in struct bcm_msg_head · 5e87ddbe
      Norbert Slusarek authored
      On 64-bit systems, struct bcm_msg_head has an added padding of 4 bytes between
      struct members count and ival1. Even though all struct members are initialized,
      the 4-byte hole will contain data from the kernel stack. This patch zeroes out
      struct bcm_msg_head before usage, preventing infoleaks to userspace.
      
      Fixes: ffd980f9 ("[CAN]: Add broadcast manager (bcm) protocol")
      Link: https://lore.kernel.org/r/trinity-7c1b2e82-e34f-4885-8060-2cd7a13769ce-1623532166177@3c-app-gmx-bs52
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: default avatarNorbert Slusarek <nslusarek@gmx.net>
      Acked-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      5e87ddbe
    • Tetsuo Handa's avatar
      can: bcm/raw/isotp: use per module netdevice notifier · 8d0caedb
      Tetsuo Handa authored
      syzbot is reporting hung task at register_netdevice_notifier() [1] and
      unregister_netdevice_notifier() [2], for cleanup_net() might perform
      time consuming operations while CAN driver's raw/bcm/isotp modules are
      calling {register,unregister}_netdevice_notifier() on each socket.
      
      Change raw/bcm/isotp modules to call register_netdevice_notifier() from
      module's __init function and call unregister_netdevice_notifier() from
      module's __exit function, as with gw/j1939 modules are doing.
      
      Link: https://syzkaller.appspot.com/bug?id=391b9498827788b3cc6830226d4ff5be87107c30 [1]
      Link: https://syzkaller.appspot.com/bug?id=1724d278c83ca6e6df100a2e320c10d991cf2bce [2]
      Link: https://lore.kernel.org/r/54a5f451-05ed-f977-8534-79e7aa2bcc8f@i-love.sakura.ne.jp
      Cc: linux-stable <stable@vger.kernel.org>
      Reported-by: default avatarsyzbot <syzbot+355f8edb2ff45d5f95fa@syzkaller.appspotmail.com>
      Reported-by: default avatarsyzbot <syzbot+0f1827363a305f74996f@syzkaller.appspotmail.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: default avatarsyzbot <syzbot+355f8edb2ff45d5f95fa@syzkaller.appspotmail.com>
      Tested-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      8d0caedb
    • Oleksij Rempel's avatar
      can: j1939: fix Use-after-Free, hold skb ref while in use · 2030043e
      Oleksij Rempel authored
      This patch fixes a Use-after-Free found by the syzbot.
      
      The problem is that a skb is taken from the per-session skb queue,
      without incrementing the ref count. This leads to a Use-after-Free if
      the skb is taken concurrently from the session queue due to a CTS.
      
      Fixes: 9d71dd0c ("can: add support of SAE J1939 protocol")
      Link: https://lore.kernel.org/r/20210521115720.7533-1-o.rempel@pengutronix.de
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Reported-by: syzbot+220c1a29987a9a490903@syzkaller.appspotmail.com
      Reported-by: syzbot+45199c1b73b4013525cf@syzkaller.appspotmail.com
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      2030043e
  2. 15 Jun, 2021 6 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · a4f0377d
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-06-15
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 5 non-merge commits during the last 11 day(s) which contain
      a total of 10 files changed, 115 insertions(+), 16 deletions(-).
      
      The main changes are:
      
      1) Fix marking incorrect umem ring as done in libbpf's
         xsk_socket__create_shared() helper, from Kev Jackson.
      
      2) Fix oob leakage under a spectre v1 type confusion
         attack, from Daniel Borkmann.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4f0377d
    • Aleksander Jan Bajkowski's avatar
      lantiq: net: fix duplicated skb in rx descriptor ring · 7ea6cd16
      Aleksander Jan Bajkowski authored
      The previous commit didn't fix the bug properly. By mistake, it replaces
      the pointer of the next skb in the descriptor ring instead of the current
      one. As a result, the two descriptors are assigned the same SKB. The error
      is seen during the iperf test when skb_put tries to insert a second packet
      and exceeds the available buffer.
      
      Fixes: c7718ee9 ("net: lantiq: fix memory corruption in RX ring ")
      Signed-off-by: default avatarAleksander Jan Bajkowski <olek2@wp.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ea6cd16
    • Kristian Evensen's avatar
      qmi_wwan: Do not call netif_rx from rx_fixup · 057d4933
      Kristian Evensen authored
      When the QMI_WWAN_FLAG_PASS_THROUGH is set, netif_rx() is called from
      qmi_wwan_rx_fixup(). When the call to netif_rx() is successful (which is
      most of the time), usbnet_skb_return() is called (from rx_process()).
      usbnet_skb_return() will then call netif_rx() a second time for the same
      skb.
      
      Simplify the code and avoid the redundant netif_rx() call by changing
      qmi_wwan_rx_fixup() to always return 1 when QMI_WWAN_FLAG_PASS_THROUGH
      is set. We then leave it up to the existing infrastructure to call
      netif_rx().
      Suggested-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarKristian Evensen <kristian.evensen@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      057d4933
    • Maciej Żenczykowski's avatar
      net: cdc_ncm: switch to eth%d interface naming · c1a3d406
      Maciej Żenczykowski authored
      This is meant to make the host side cdc_ncm interface consistently
      named just like the older CDC protocols: cdc_ether & cdc_ecm
      (and even rndis_host), which all use 'FLAG_ETHER | FLAG_POINTTOPOINT'.
      
      include/linux/usb/usbnet.h:
        #define FLAG_ETHER	0x0020		/* maybe use "eth%d" names */
        #define FLAG_WLAN	0x0080		/* use "wlan%d" names */
        #define FLAG_WWAN	0x0400		/* use "wwan%d" names */
        #define FLAG_POINTTOPOINT 0x1000	/* possibly use "usb%d" names */
      
      drivers/net/usb/usbnet.c @ line 1711:
        strcpy (net->name, "usb%d");
        ...
        // heuristic:  "usb%d" for links we know are two-host,
        // else "eth%d" when there's reasonable doubt.  userspace
        // can rename the link if it knows better.
        if ((dev->driver_info->flags & FLAG_ETHER) != 0 &&
            ((dev->driver_info->flags & FLAG_POINTTOPOINT) == 0 ||
             (net->dev_addr [0] & 0x02) == 0))
                strcpy (net->name, "eth%d");
        /* WLAN devices should always be named "wlan%d" */
        if ((dev->driver_info->flags & FLAG_WLAN) != 0)
                strcpy(net->name, "wlan%d");
        /* WWAN devices should always be named "wwan%d" */
        if ((dev->driver_info->flags & FLAG_WWAN) != 0)
                strcpy(net->name, "wwan%d");
      
      So by using ETHER | POINTTOPOINT the interface naming is
      either usb%d or eth%d based on the global uniqueness of the
      mac address of the device.
      
      Without this 2.5gbps ethernet dongles which all seem to use the cdc_ncm
      driver end up being called usb%d instead of eth%d even though they're
      definitely not two-host.  (All 1gbps & 5gbps ethernet usb dongles I've
      tested don't hit this problem due to use of different drivers, primarily
      r8152 and aqc111)
      
      Fixes tag is based purely on git blame, and is really just here to make
      sure this hits LTS branches newer than v4.5.
      
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Fixes: 4d06dd53 ("cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind")
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1a3d406
    • Changbin Du's avatar
      net: inline function get_net_ns_by_fd if NET_NS is disabled · e34492de
      Changbin Du authored
      The function get_net_ns_by_fd() could be inlined when NET_NS is not
      enabled.
      Signed-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e34492de
    • Jakub Kicinski's avatar
      ptp: improve max_adj check against unreasonable values · 475b92f9
      Jakub Kicinski authored
      Scaled PPM conversion to PPB may (on 64bit systems) result
      in a value larger than s32 can hold (freq/scaled_ppm is a long).
      This means the kernel will not correctly reject unreasonably
      high ->freq values (e.g. > 4294967295ppb, 281474976645 scaled PPM).
      
      The conversion is equivalent to a division by ~66 (65.536),
      so the value of ppb is always smaller than ppm, but not small
      enough to assume narrowing the type from long -> s32 is okay.
      
      Note that reasonable user space (e.g. ptp4l) will not use such
      high values, anyway, 4289046510ppb ~= 4.3x, so the fix is
      somewhat pedantic.
      
      Fixes: d39a7435 ("ptp: validate the requested frequency adjustment.")
      Fixes: d94ba80e ("ptp: Added a brand new class driver for ptp clocks.")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      475b92f9
  3. 14 Jun, 2021 6 commits
    • Subash Abhinov Kasiviswanathan's avatar
      net: mhi_net: Update the transmit handler prototype · 2214fb53
      Subash Abhinov Kasiviswanathan authored
      Update the function prototype of mhi_ndo_xmit to match
      ndo_start_xmit. This otherwise leads to run time failures when
      CFI is enabled in kernel.
      
      Fixes: 3ffec6a1 ("net: Add mhi-net driver")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2214fb53
    • Daniel Borkmann's avatar
      bpf, selftests: Adjust few selftest outcomes wrt unreachable code · 973377ff
      Daniel Borkmann authored
      In almost all cases from test_verifier that have been changed in here, we've
      had an unreachable path with a load from a register which has an invalid
      address on purpose. This was basically to make sure that we never walk this
      path and to have the verifier complain if it would otherwise. Change it to
      match on the right error for unprivileged given we now test these paths
      under speculative execution.
      
      There's one case where we match on exact # of insns_processed. Due to the
      extra path, this will of course mismatch on unprivileged. Thus, restrict the
      test->insn_processed check to privileged-only.
      
      In one other case, we result in a 'pointer comparison prohibited' error. This
      is similarly due to verifying an 'invalid' branch where we end up with a value
      pointer on one side of the comparison.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      973377ff
    • Daniel Borkmann's avatar
      bpf: Fix leakage under speculation on mispredicted branches · 9183671a
      Daniel Borkmann authored
      The verifier only enumerates valid control-flow paths and skips paths that
      are unreachable in the non-speculative domain. And so it can miss issues
      under speculative execution on mispredicted branches.
      
      For example, a type confusion has been demonstrated with the following
      crafted program:
      
        // r0 = pointer to a map array entry
        // r6 = pointer to readable stack slot
        // r9 = scalar controlled by attacker
        1: r0 = *(u64 *)(r0) // cache miss
        2: if r0 != 0x0 goto line 4
        3: r6 = r9
        4: if r0 != 0x1 goto line 6
        5: r9 = *(u8 *)(r6)
        6: // leak r9
      
      Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier
      concludes that the pointer dereference on line 5 is safe. But: if the
      attacker trains both the branches to fall-through, such that the following
      is speculatively executed ...
      
        r6 = r9
        r9 = *(u8 *)(r6)
        // leak r9
      
      ... then the program will dereference an attacker-controlled value and could
      leak its content under speculative execution via side-channel. This requires
      to mistrain the branch predictor, which can be rather tricky, because the
      branches are mutually exclusive. However such training can be done at
      congruent addresses in user space using different branches that are not
      mutually exclusive. That is, by training branches in user space ...
      
        A:  if r0 != 0x0 goto line C
        B:  ...
        C:  if r0 != 0x0 goto line D
        D:  ...
      
      ... such that addresses A and C collide to the same CPU branch prediction
      entries in the PHT (pattern history table) as those of the BPF program's
      lines 2 and 4, respectively. A non-privileged attacker could simply brute
      force such collisions in the PHT until observing the attack succeeding.
      
      Alternative methods to mistrain the branch predictor are also possible that
      avoid brute forcing the collisions in the PHT. A reliable attack has been
      demonstrated, for example, using the following crafted program:
      
        // r0 = pointer to a [control] map array entry
        // r7 = *(u64 *)(r0 + 0), training/attack phase
        // r8 = *(u64 *)(r0 + 8), oob address
        // [...]
        // r0 = pointer to a [data] map array entry
        1: if r7 == 0x3 goto line 3
        2: r8 = r0
        // crafted sequence of conditional jumps to separate the conditional
        // branch in line 193 from the current execution flow
        3: if r0 != 0x0 goto line 5
        4: if r0 == 0x0 goto exit
        5: if r0 != 0x0 goto line 7
        6: if r0 == 0x0 goto exit
        [...]
        187: if r0 != 0x0 goto line 189
        188: if r0 == 0x0 goto exit
        // load any slowly-loaded value (due to cache miss in phase 3) ...
        189: r3 = *(u64 *)(r0 + 0x1200)
        // ... and turn it into known zero for verifier, while preserving slowly-
        // loaded dependency when executing:
        190: r3 &= 1
        191: r3 &= 2
        // speculatively bypassed phase dependency
        192: r7 += r3
        193: if r7 == 0x3 goto exit
        194: r4 = *(u8 *)(r8 + 0)
        // leak r4
      
      As can be seen, in training phase (phase != 0x3), the condition in line 1
      turns into false and therefore r8 with the oob address is overridden with
      the valid map value address, which in line 194 we can read out without
      issues. However, in attack phase, line 2 is skipped, and due to the cache
      miss in line 189 where the map value is (zeroed and later) added to the
      phase register, the condition in line 193 takes the fall-through path due
      to prior branch predictor training, where under speculation, it'll load the
      byte at oob address r8 (unknown scalar type at that point) which could then
      be leaked via side-channel.
      
      One way to mitigate these is to 'branch off' an unreachable path, meaning,
      the current verification path keeps following the is_branch_taken() path
      and we push the other branch to the verification stack. Given this is
      unreachable from the non-speculative domain, this branch's vstate is
      explicitly marked as speculative. This is needed for two reasons: i) if
      this path is solely seen from speculative execution, then we later on still
      want the dead code elimination to kick in in order to sanitize these
      instructions with jmp-1s, and ii) to ensure that paths walked in the
      non-speculative domain are not pruned from earlier walks of paths walked in
      the speculative domain. Additionally, for robustness, we mark the registers
      which have been part of the conditional as unknown in the speculative path
      given there should be no assumptions made on their content.
      
      The fix in here mitigates type confusion attacks described earlier due to
      i) all code paths in the BPF program being explored and ii) existing
      verifier logic already ensuring that given memory access instruction
      references one specific data structure.
      
      An alternative to this fix that has also been looked at in this scope was to
      mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as
      well as direction encoding (always-goto, always-fallthrough, unknown), such
      that mixing of different always-* directions themselves as well as mixing of
      always-* with unknown directions would cause a program rejection by the
      verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else
      { x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this
      would result in only single direction always-* taken paths, and unknown taken
      paths being allowed, such that the former could be patched from a conditional
      jump to an unconditional jump (ja). Compared to this approach here, it would
      have two downsides: i) valid programs that otherwise are not performing any
      pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are
      required to turn off path pruning for unprivileged, where both can be avoided
      in this work through pushing the invalid branch to the verification stack.
      
      The issue was originally discovered by Adam and Ofek, and later independently
      discovered and reported as a result of Benedict and Piotr's research work.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: default avatarAdam Morrison <mad@cs.tau.ac.il>
      Reported-by: default avatarOfek Kirzner <ofekkir@gmail.com>
      Reported-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Reported-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Reviewed-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9183671a
    • Daniel Borkmann's avatar
      bpf: Do not mark insn as seen under speculative path verification · fe9a5ca7
      Daniel Borkmann authored
      ... in such circumstances, we do not want to mark the instruction as seen given
      the goal is still to jmp-1 rewrite/sanitize dead code, if it is not reachable
      from the non-speculative path verification. We do however want to verify it for
      safety regardless.
      
      With the patch as-is all the insns that have been marked as seen before the
      patch will also be marked as seen after the patch (just with a potentially
      different non-zero count). An upcoming patch will also verify paths that are
      unreachable in the non-speculative domain, hence this extension is needed.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Reviewed-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fe9a5ca7
    • Daniel Borkmann's avatar
      bpf: Inherit expanded/patched seen count from old aux data · d203b0fd
      Daniel Borkmann authored
      Instead of relying on current env->pass_cnt, use the seen count from the
      old aux data in adjust_insn_aux_data(), and expand it to the new range of
      patched instructions. This change is valid given we always expand 1:n
      with n>=1, so what applies to the old/original instruction needs to apply
      for the replacement as well.
      
      Not relying on env->pass_cnt is a prerequisite for a later change where we
      want to avoid marking an instruction seen when verified under speculative
      execution path.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: default avatarBenedict Schlueter <benedict.schlueter@rub.de>
      Reviewed-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d203b0fd
    • David S. Miller's avatar
      Merge tag 'for-net-2021-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · 45deacc7
      David S. Miller authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - Fix crash on SMP when debug is enabled
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45deacc7