1. 20 Dec, 2022 1 commit
  2. 19 Dec, 2022 17 commits
    • Hou Tao's avatar
      bpf: Define sock security related BTF IDs under CONFIG_SECURITY_NETWORK · cc074822
      Hou Tao authored
      There are warnings reported from resolve_btfids when building vmlinux
      with CONFIG_SECURITY_NETWORK disabled:
      
        WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_free_security
        WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_alloc_security
      
      So only define BTF IDs for these LSM hooks when CONFIG_SECURITY_NETWORK
      is enabled.
      
      Fixes: c0c852dd ("bpf: Do not mark certain LSM hook arguments as trusted")
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20221217062144.2507222-1-houtao@huaweicloud.com
      cc074822
    • Jeremy Kerr's avatar
      mctp: serial: Fix starting value for frame check sequence · 2856a627
      Jeremy Kerr authored
      RFC1662 defines the start state for the crc16 FCS to be 0xffff, but
      we're currently starting at zero.
      
      This change uses the correct start state. We're only early in the
      adoption for the serial binding, so there aren't yet any other users to
      interface to.
      
      Fixes: a0c2ccd9 ("mctp: Add MCTP-over-serial transport binding")
      Reported-by: default avatarHarsh Tyagi <harshtya@google.com>
      Tested-by: default avatarHarsh Tyagi <harshtya@google.com>
      Signed-off-by: default avatarJeremy Kerr <jk@codeconstruct.com.au>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2856a627
    • Huanhuan Wang's avatar
      nfp: fix unaligned io read of capabilities word · 1b0c84a3
      Huanhuan Wang authored
      The address of 32-bit extend capability is not qword aligned,
      and may cause exception in some arch.
      
      Fixes: 484963ce ("nfp: extend capability and control words")
      Signed-off-by: default avatarHuanhuan Wang <huanhuan.wang@corigine.com>
      Reviewed-by: default avatarNiklas Söderlund <niklas.soderlund@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b0c84a3
    • Eric Dumazet's avatar
      net: stream: purge sk_error_queue in sk_stream_kill_queues() · e0c8bccd
      Eric Dumazet authored
      Changheon Lee reported TCP socket leaks, with a nice repro.
      
      It seems we leak TCP sockets with the following sequence:
      
      1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.
      
         Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
         __skb_tstamp_tx() is using skb_clone(), unless
         SOF_TIMESTAMPING_OPT_TSONLY was also requested.
      
      2) If the application is also using MSG_ZEROCOPY, then we put in the
         error queue cloned skbs that had a struct ubuf_info attached to them.
      
         Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
         does a sock_hold().
      
         As long as the cloned skbs are still in sk_error_queue,
         socket refcount is kept elevated.
      
      3) Application closes the socket, while error queue is not empty.
      
      Since tcp_close() no longer purges the socket error queue,
      we might end up with a TCP socket with at least one skb in
      error queue keeping the socket alive forever.
      
      This bug can be (ab)used to consume all kernel memory
      and freeze the host.
      
      We need to purge the error queue, with proper synchronization
      against concurrent writers.
      
      Fixes: 24bcbe1c ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
      Reported-by: default avatarChangheon Lee <darklight2357@icloud.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0c8bccd
    • Christophe JAILLET's avatar
      myri10ge: Fix an error handling path in myri10ge_probe() · d83b950d
      Christophe JAILLET authored
      Some memory allocated in myri10ge_probe_slices() is not released in the
      error handling path of myri10ge_probe().
      
      Add the corresponding kfree(), as already done in the remove function.
      
      Fixes: 0dcffac1 ("myri10ge: add multislices support")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d83b950d
    • Horatiu Vultur's avatar
      net: microchip: vcap: Fix initialization of value and mask · 10073399
      Horatiu Vultur authored
      Fix the following smatch warning:
      
      smatch warnings:
      drivers/net/ethernet/microchip/vcap/vcap_api_debugfs.c:103 vcap_debugfs_show_rule_keyfield() error: uninitialized symbol 'value'.
      drivers/net/ethernet/microchip/vcap/vcap_api_debugfs.c:106 vcap_debugfs_show_rule_keyfield() error: uninitialized symbol 'mask'.
      
      In case the vcap field was VCAP_FIELD_U128 and the key was different
      than IP6_S/DIP then the value and mask were not initialized, therefore
      initialize them.
      
      Fixes: 610c32b2 ("net: microchip: vcap: Add vcap_get_rule")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Reviewed-by: default avatarSaeed Mahameed <saeed@kernel.org>
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10073399
    • David S. Miller's avatar
      Merge branch 'rxrpc-fixes' · 98dbec0a
      David S. Miller authored
      David Howells says:
      
      ====================
      rxrpc: Fixes for I/O thread conversion/SACK table expansion
      
      Here are some fixes for AF_RXRPC:
      
       (1) Fix missing unlock in rxrpc's sendmsg.
      
       (2) Fix (lack of) propagation of security settings to rxrpc_call.
      
       (3) Fix NULL ptr deref in rxrpc_unuse_local().
      
       (4) Fix problem with kthread_run() not invoking the I/O thread function if
           the kthread gets stopped first.  Possibly this should actually be
           fixed in the kthread code.
      
       (5) Fix locking problem as putting a peer (which may be done from RCU) may
           now invoke kthread_stop().
      
       (6) Fix switched parameters in a couple of trace calls.
      
       (7) Fix I/O thread's checking for kthread stop to make sure it completes
           all outstanding work before returning so that calls are cleaned up.
      
       (8) Fix an uninitialised var in the new rxperf test server.
      
       (9) Fix the return value of rxrpc_new_incoming_call() so that the checks
           on it work correctly.
      
      The patches fix at least one syzbot bug[1] and probably some others that
      don't have reproducers[2][3][4].  I think it also fixes another[5], but
      that showed another failure during testing that was different to the
      original.
      
      There's also an outstanding bug in rxrpc_put_peer()[6] that is fixed by a
      combination of several patches in my rxrpc-next branch, but I haven't
      included that here.
      ====================
      Tested-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Tested-by: kafs-testing+fedora36_64checkkafs-build-164@auristor.com
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98dbec0a
    • David Howells's avatar
      rxrpc: Fix the return value of rxrpc_new_incoming_call() · 31d35a02
      David Howells authored
      Dan Carpenter sayeth[1]:
      
        The patch 5e6ef4f1: "rxrpc: Make the I/O thread take over the
        call and local processor work" from Jan 23, 2020, leads to the
        following Smatch static checker warning:
      
      	net/rxrpc/io_thread.c:283 rxrpc_input_packet()
      	warn: bool is not less than zero.
      
      Fix this (for now) by changing rxrpc_new_incoming_call() to return an int
      with 0 or error code rather than bool.  Note that the actual return value
      of rxrpc_input_packet() is currently ignored.  I have a separate patch to
      clean that up.
      
      Fixes: 5e6ef4f1 ("rxrpc: Make the I/O thread take over the call and local processor work")
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006123.html [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31d35a02
    • David Howells's avatar
      rxrpc: rxperf: Fix uninitialised variable · 11e1706b
      David Howells authored
      Dan Carpenter sayeth[1]:
      
        The patch 75bfdbf2: "rxrpc: Implement an in-kernel rxperf server
        for testing purposes" from Nov 3, 2022, leads to the following Smatch
        static checker warning:
      
      	net/rxrpc/rxperf.c:337 rxperf_deliver_to_call()
      	error: uninitialized symbol 'ret'.
      
      Fix this by initialising ret to 0.  The value is only used for tracing
      purposes in the rxperf server.
      
      Fixes: 75bfdbf2 ("rxrpc: Implement an in-kernel rxperf server for testing purposes")
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006124.html [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11e1706b
    • David Howells's avatar
      rxrpc: Fix I/O thread stop · 743d1768
      David Howells authored
      The rxrpc I/O thread checks to see if there's any work it needs to do, and
      if not, checks kthread_should_stop() before scheduling, and if it should
      stop, breaks out of the loop and tries to clean up and exit.
      
      This can, however, race with socket destruction, wherein outstanding calls
      are aborted and released from the socket and then the socket unuses the
      local endpoint, causing kthread_stop() to be issued.  The abort is deferred
      to the I/O thread and the event can by issued between the I/O thread
      checking if there's any work to be done (such as processing call aborts)
      and the stop being seen.
      
      This results in the I/O thread stopping processing of events whilst call
      cleanup events are still outstanding, leading to connections or other
      objects still being around and uncleaned up, which can result in assertions
      being triggered, e.g.:
      
          rxrpc: AF_RXRPC: Leaked client conn 00000000e8009865 {2}
          ------------[ cut here ]------------
          kernel BUG at net/rxrpc/conn_client.c:64!
      
      Fix this by retrieving the kthread_should_stop() indication, then checking
      to see if there's more work to do, and going back round the loop if there
      is, and breaking out of the loop only if there wasn't.
      
      This was triggered by a syzbot test that produced some other symptom[1].
      
      Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/ [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      743d1768
    • David Howells's avatar
      rxrpc: Fix switched parameters in peer tracing · c838f1a7
      David Howells authored
      Fix the switched parameters on rxrpc_alloc_peer() and rxrpc_get_peer().
      The ref argument and the why argument got mixed.
      
      Fixes: 47c810a7 ("rxrpc: trace: Don't use __builtin_return_address for rxrpc_peer tracing")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c838f1a7
    • David Howells's avatar
      rxrpc: Fix locking issues in rxrpc_put_peer_locked() · 608aecd1
      David Howells authored
      Now that rxrpc_put_local() may call kthread_stop(), it can't be called
      under spinlock as it might sleep.  This can cause a problem in the peer
      keepalive code in rxrpc as it tries to avoid dropping the peer_hash_lock
      from the point it needs to re-add peer->keepalive_link to going round the
      loop again in rxrpc_peer_keepalive_dispatch().
      
      Fix this by just dropping the lock when we don't need it and accepting that
      we'll have to take it again.  This code is only called about every 20s for
      each peer, so not very often.
      
      This allows rxrpc_put_peer_unlocked() to be removed also.
      
      If triggered, this bug produces an oops like the following, as reproduced
      by a syzbot reproducer for a different oops[1]:
      
      BUG: sleeping function called from invalid context at kernel/sched/completion.c:101
      ...
      RCU nest depth: 0, expected: 0
      3 locks held by kworker/u9:0/50:
       #0: ffff88810e74a138 ((wq_completion)krxrpcd){+.+.}-{0:0}, at: process_one_work+0x294/0x636
       #1: ffff8881013a7e20 ((work_completion)(&rxnet->peer_keepalive_work)){+.+.}-{0:0}, at: process_one_work+0x294/0x636
       #2: ffff88817d366390 (&rxnet->peer_hash_lock){+.+.}-{2:2}, at: rxrpc_peer_keepalive_dispatch+0x2bd/0x35f
      ...
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4c/0x5f
       __might_resched+0x2cf/0x2f2
       __wait_for_common+0x87/0x1e8
       kthread_stop+0x14d/0x255
       rxrpc_peer_keepalive_dispatch+0x333/0x35f
       rxrpc_peer_keepalive_worker+0x2e9/0x449
       process_one_work+0x3c1/0x636
       worker_thread+0x25f/0x359
       kthread+0x1a6/0x1b5
       ret_from_fork+0x1f/0x30
      
      Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/ [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      608aecd1
    • David Howells's avatar
      rxrpc: Fix I/O thread startup getting skipped · 8fbcc833
      David Howells authored
      When starting a kthread, the __kthread_create_on_node() function, as called
      from kthread_run(), waits for a completion to indicate that the task_struct
      (or failure state) of the new kernel thread is available before continuing.
      
      This does not wait, however, for the thread function to be invoked and,
      indeed, will skip it if kthread_stop() gets called before it gets there.
      
      If this happens, though, kthread_run() will have returned successfully,
      indicating that the thread was started and returning the task_struct
      pointer.  The actual error indication is returned by kthread_stop().
      
      Note that this is ambiguous, as the caller cannot tell whether the -EINTR
      error code came from kthread() or from the thread function.
      
      This was encountered in the new rxrpc I/O thread, where if the system is
      being pounded hard by, say, syzbot, the check of KTHREAD_SHOULD_STOP can be
      delayed long enough for kthread_stop() to get called when rxrpc releases a
      socket - and this causes an oops because the I/O thread function doesn't
      get started and thus doesn't remove the rxrpc_local struct from the
      local_endpoints list.
      
      Fix this by using a completion to wait for the thread to actually enter
      rxrpc_io_thread().  This makes sure the thread can't be prematurely
      stopped and makes sure the relied-upon cleanup is done.
      
      Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
      Reported-by: syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: Hillf Danton <hdanton@sina.com>
      Link: https://lore.kernel.org/r/000000000000229f1505ef2b6159@google.com/Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fbcc833
    • David Howells's avatar
      rxrpc: Fix NULL deref in rxrpc_unuse_local() · eaa02390
      David Howells authored
      Fix rxrpc_unuse_local() to get the debug_id *after* checking to see if
      local is NULL.
      
      Fixes: a2cf3264 ("rxrpc: Fold __rxrpc_unuse_local() into rxrpc_unuse_local()")
      Reported-by: syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eaa02390
    • David Howells's avatar
      rxrpc: Fix security setting propagation · fdb99487
      David Howells authored
      Fix the propagation of the security settings from sendmsg to the rxrpc_call
      struct.
      
      Fixes: f3441d41 ("rxrpc: Copy client call parameters into rxrpc_call earlier")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdb99487
    • David Howells's avatar
      rxrpc: Fix missing unlock in rxrpc_do_sendmsg() · 4feb2c44
      David Howells authored
      One of the error paths in rxrpc_do_sendmsg() doesn't unlock the call mutex
      before returning.  Fix it to do this.
      
      Note that this still doesn't get rid of the checker warning:
      
         ../net/rxrpc/sendmsg.c:617:5: warning: context imbalance in 'rxrpc_do_sendmsg' - wrong count at exit
      
      I think the interplay between the socket lock and the call's user_mutex may
      be too complicated for checker to analyse, especially as
      rxrpc_new_client_call_for_sendmsg(), which it calls, returns with the
      call's user_mutex if successful but unconditionally drops the socket lock.
      
      Fixes: e754eba6 ("rxrpc: Provide a cmsg to specify the amount of Tx data for a call")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4feb2c44
    • Cong Wang's avatar
      net_sched: reject TCF_EM_SIMPLE case for complex ematch module · 9cd3fd20
      Cong Wang authored
      When TCF_EM_SIMPLE was introduced, it is supposed to be convenient
      for ematch implementation:
      
      https://lore.kernel.org/all/20050105110048.GO26856@postel.suug.ch/
      
      "You don't have to, providing a 32bit data chunk without TCF_EM_SIMPLE
      set will simply result in allocating & copy. It's an optimization,
      nothing more."
      
      So if an ematch module provides ops->datalen that means it wants a
      complex data structure (saved in its em->data) instead of a simple u32
      value. We should simply reject such a combination, otherwise this u32
      could be misinterpreted as a pointer.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-and-tested-by: syzbot+4caeae4c7103813598ae@syzkaller.appspotmail.com
      Reported-by: default avatarJun Nie <jun.nie@linaro.org>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cd3fd20
  3. 18 Dec, 2022 1 commit
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 89529367
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-12-15 (igc)
      
      Muhammad Husaini Zulkifli says:
      
      This patch series fixes bugs for the Time-Sensitive Networking(TSN)
      Qbv Scheduling features.
      
      An overview of each patch series is given below:
      
      Patch 1: Using a first flag bit to schedule a packet to the next cycle if
      packet cannot fit in current Qbv cycle.
      Patch 2: Enable strict cycle for Qbv scheduling.
      Patch 3: Prevent user to set basetime less than zero during tc config.
      Patch 4: Allow the basetime enrollment with zero value.
      Patch 5: Calculate the new end time value to exclude the time interval that
      exceed the cycle time as user can specify the cycle time in tc config.
      Patch 6: Resolve the HW bugs where the gate is not fully closed.
      ---
      This contains the net patches from this original pull request:
      https://lore.kernel.org/netdev/20221205212414.3197525-1-anthony.l.nguyen@intel.com/
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89529367
  4. 17 Dec, 2022 3 commits
  5. 16 Dec, 2022 7 commits
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 13e3c779
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-12-16
      
      We've added 7 non-merge commits during the last 2 day(s) which contain
      a total of 9 files changed, 119 insertions(+), 36 deletions(-).
      
      1) Fix for recent syzkaller XDP dispatcher update splat, from Jiri Olsa.
      
      2) Fix BPF program refcount leak in LSM attachment failure path,
         from Milan Landaverde.
      
      3) Fix BPF program type in map compatibility check for fext,
         from Toke Høiland-Jørgensen.
      
      4) Fix a BPF selftest compilation error under !CONFIG_SMP config,
         from Yonghong Song.
      
      5) Fix CI to enable CONFIG_FUNCTION_ERROR_INJECTION after it got changed
         to a prompt, from Song Liu.
      
      6) Various BPF documentation fixes for socket local storage,
         from Donald Hunter.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Add a test for using a cpumap from an freplace-to-XDP program
        bpf: Resolve fext program type when checking map compatibility
        bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_func
        bpf: prevent leak of lsm program after failed attach
        selftests/bpf: Select CONFIG_FUNCTION_ERROR_INJECTION
        selftests/bpf: Fix a selftest compilation error with CONFIG_SMP=n
        docs/bpf: Reword docs for BPF_MAP_TYPE_SK_STORAGE
      ====================
      
      Link: https://lore.kernel.org/r/20221216174540.16598-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      13e3c779
    • Eelco Chaudron's avatar
      openvswitch: Fix flow lookup to use unmasked key · 68bb1010
      Eelco Chaudron authored
      The commit mentioned below causes the ovs_flow_tbl_lookup() function
      to be called with the masked key. However, it's supposed to be called
      with the unmasked key. This due to the fact that the datapath supports
      installing wider flows, and OVS relies on this behavior. For example
      if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider
      flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/
      128.0.0.0) is allowed to be added.
      
      However, if we try to add a wildcard rule, the installation fails:
      
      $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
        ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
      $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
        ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
      ovs-vswitchd: updating flow table (File exists)
      
      The reason is that the key used to determine if the flow is already
      present in the system uses the original key ANDed with the mask.
      This results in the IP address not being part of the (miniflow) key,
      i.e., being substituted with an all-zero value. When doing the actual
      lookup, this results in the key wrongfully matching the first flow,
      and therefore the flow does not get installed.
      
      This change reverses the commit below, but rather than having the key
      on the stack, it's allocated.
      
      Fixes: 190aa3e7 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.")
      Signed-off-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68bb1010
    • David S. Miller's avatar
      Merge branch 'devlink-fixes' · 3e31d209
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      devlink: region snapshot locking fix and selftest adjustments
      
      Minor fix for region snapshot locking and adjustments to selftests.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e31d209
    • Jakub Kicinski's avatar
      selftests: devlink: add a warning for interfaces coming up · d1c4a346
      Jakub Kicinski authored
      NetworkManager (and other daemons) may bring the interface up
      and cause failures in quiescence checks. Print a helpful warning,
      and take the interface down again.
      
      I seem to forget about this every time I run these tests on a new VM.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1c4a346
    • Jakub Kicinski's avatar
      selftests: devlink: fix the fd redirect in dummy_reporter_test · 2fc60e2f
      Jakub Kicinski authored
      $number + > bash means redirect FD $number, e.g. commonly
      used 2> redirects stderr (fd 2). The test uses 8192> to
      write the number 8192 to a file, this results in:
      
        ./devlink.sh: line 499: 8192: Bad file descriptor
      
      Oddly the test also papers over this issue by checking
      for failure (expecting an error rather than success)
      so it passes, anyway.
      
      Fixes: ff18176a ("selftests: Add a test of large binary to devlink health test")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fc60e2f
    • Jakub Kicinski's avatar
      devlink: hold region lock when flushing snapshots · b4cafb3d
      Jakub Kicinski authored
      Netdevsim triggers a splat on reload, when it destroys regions
      with snapshots pending:
      
        WARNING: CPU: 1 PID: 787 at net/core/devlink.c:6291 devlink_region_snapshot_del+0x12e/0x140
        CPU: 1 PID: 787 Comm: devlink Not tainted 6.1.0-07460-g7ae9888d #580
        RIP: 0010:devlink_region_snapshot_del+0x12e/0x140
        Call Trace:
         <TASK>
         devl_region_destroy+0x70/0x140
         nsim_dev_reload_down+0x2f/0x60 [netdevsim]
         devlink_reload+0x1f7/0x360
         devlink_nl_cmd_reload+0x6ce/0x860
         genl_family_rcv_msg_doit.isra.0+0x145/0x1c0
      
      This is the locking assert in devlink_region_snapshot_del(),
      we're supposed to be holding the region->snapshot_lock here.
      
      Fixes: 2dec18ad ("net: devlink: remove region snapshots list dependency on devlink->lock")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4cafb3d
    • Daniel Golle's avatar
      net: dsa: mt7530: remove redundant assignment · 32f1002e
      Daniel Golle authored
      Russell King correctly pointed out that the MAC_2500FD capability is
      already added for port 5 (if not in RGMII mode) and port 6 (which only
      supports SGMII) by mt7531_mac_port_get_caps. Remove the reduntant
      setting of this capability flag which was added by a previous commit.
      
      Fixes: e19de30d ("net: dsa: mt7530: add support for in-band link status")
      Reported-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Link: https://lore.kernel.org/r/Y5qY7x6la5TxZxzX@makrotopia.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32f1002e
  6. 15 Dec, 2022 11 commits
    • Tan Tee Min's avatar
      igc: Set Qbv start_time and end_time to end_time if not being configured in GCL · 72abeedd
      Tan Tee Min authored
      The default setting of end_time minus start_time is whole 1 second.
      Thus, if it's not being configured in any GCL entry then it will be
      staying at original 1 second.
      
      This patch is changing the start_time and end_time to be end_time as
      if setting zero will be having weird HW behavior where the gate will
      not be fully closed.
      
      Fixes: ec50a9d4 ("igc: Add support for taprio offloading")
      Signed-off-by: default avatarTan Tee Min <tee.min.tan@linux.intel.com>
      Signed-off-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      72abeedd
    • Tan Tee Min's avatar
      igc: recalculate Qbv end_time by considering cycle time · 6d05251d
      Tan Tee Min authored
      Qbv users can specify a cycle time that is not equal to the total GCL
      intervals. Hence, recalculation is necessary here to exclude the time
      interval that exceeds the cycle time. As those GCL which exceeds the
      cycle time will be truncated.
      
      According to IEEE Std. 802.1Q-2018 section 8.6.9.2, once the end of
      the list is reached, it will switch to the END_OF_CYCLE state and
      leave the gates in the same state until the next cycle is started.
      
      Fixes: ec50a9d4 ("igc: Add support for taprio offloading")
      Signed-off-by: default avatarTan Tee Min <tee.min.tan@linux.intel.com>
      Signed-off-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      6d05251d
    • Tan Tee Min's avatar
      igc: allow BaseTime 0 enrollment for Qbv · e17090eb
      Tan Tee Min authored
      Introduce qbv_enable flag in igc_adapter struct to store the Qbv on/off.
      So this allow the BaseTime to enroll with zero value.
      
      Fixes: 61572d5f ("igc: Simplify TSN flags handling")
      Signed-off-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Signed-off-by: default avatarTan Tee Min <tee.min.tan@linux.intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      e17090eb
    • Muhammad Husaini Zulkifli's avatar
      igc: Add checking for basetime less than zero · 3b61764f
      Muhammad Husaini Zulkifli authored
      Using the tc qdisc command, the user can set basetime to any value.
      Checking should be done on the driver's side to prevent registering
      basetime values that are less than zero.
      
      Fixes: ec50a9d4 ("igc: Add support for taprio offloading")
      Signed-off-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      3b61764f
    • Vinicius Costa Gomes's avatar
      igc: Use strict cycles for Qbv scheduling · d8f45be0
      Vinicius Costa Gomes authored
      Configuring strict cycle mode in the controller forces more well
      behaved transmissions when taprio is offloaded.
      
      When set this strict_cycle and strict_end, transmission is not
      enabled if the whole packet cannot be completed before end of
      the Qbv cycle.
      
      Fixes: 82faa9b7 ("igc: Add support for ETF offloading")
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarAravindhan Gunasekaran <aravindhan.gunasekaran@intel.com>
      Signed-off-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      d8f45be0
    • Vinicius Costa Gomes's avatar
      igc: Enhance Qbv scheduling by using first flag bit · db0b124f
      Vinicius Costa Gomes authored
      The I225 hardware has a limitation that packets can only be scheduled
      in the [0, cycle-time] interval. So, scheduling a packet to the start
      of the next cycle doesn't usually work.
      
      To overcome this, we use the Transmit Descriptor first flag to indicates
      that a packet should be the first packet (from a queue) in a cycle
      according to the section 7.5.2.9.3.4 The First Packet on Each QBV Cycle
      in Intel Discrete I225/6 User Manual.
      
      But this only works if there was any packet from that queue during the
      current cycle, to avoid this issue, we issue an empty packet if that's
      not the case. Also require one more descriptor to be available, to take
      into account the empty packet that might be issued.
      
      Test Setup:
      
      Talker: Use l2_tai to generate the launchtime into packet load.
      
      Listener: Use timedump.c to compute the delta between packet arrival
      and LaunchTime packet payload.
      
      Test Result:
      
      Before:
      
      1666000610127300000,1666000610127300096,96,621273
      1666000610127400000,1666000610127400192,192,621274
      1666000610127500000,1666000610127500032,32,621275
      1666000610127600000,1666000610127600128,128,621276
      1666000610127700000,1666000610127700224,224,621277
      1666000610127800000,1666000610127800064,64,621278
      1666000610127900000,1666000610127900160,160,621279
      1666000610128000000,1666000610128000000,0,621280
      1666000610128100000,1666000610128100096,96,621281
      1666000610128200000,1666000610128200192,192,621282
      1666000610128300000,1666000610128300032,32,621283
      1666000610128400000,1666000610128301056,-98944,621284
      1666000610128500000,1666000610128302080,-197920,621285
      1666000610128600000,1666000610128302848,-297152,621286
      1666000610128700000,1666000610128303872,-396128,621287
      1666000610128800000,1666000610128304896,-495104,621288
      1666000610128900000,1666000610128305664,-594336,621289
      1666000610129000000,1666000610128306688,-693312,621290
      1666000610129100000,1666000610128307712,-792288,621291
      1666000610129200000,1666000610128308480,-891520,621292
      1666000610129300000,1666000610128309504,-990496,621293
      1666000610129400000,1666000610128310528,-1089472,621294
      1666000610129500000,1666000610128311296,-1188704,621295
      1666000610129600000,1666000610128312320,-1287680,621296
      1666000610129700000,1666000610128313344,-1386656,621297
      1666000610129800000,1666000610128314112,-1485888,621298
      1666000610129900000,1666000610128315136,-1584864,621299
      1666000610130000000,1666000610128316160,-1683840,621300
      1666000610130100000,1666000610128316928,-1783072,621301
      1666000610130200000,1666000610128317952,-1882048,621302
      1666000610130300000,1666000610128318976,-1981024,621303
      1666000610130400000,1666000610128319744,-2080256,621304
      1666000610130500000,1666000610128320768,-2179232,621305
      1666000610130600000,1666000610128321792,-2278208,621306
      1666000610130700000,1666000610128322816,-2377184,621307
      1666000610130800000,1666000610128323584,-2476416,621308
      1666000610130900000,1666000610128324608,-2575392,621309
      1666000610131000000,1666000610128325632,-2674368,621310
      1666000610131100000,1666000610128326400,-2773600,621311
      1666000610131200000,1666000610128327424,-2872576,621312
      1666000610131300000,1666000610128328448,-2971552,621313
      1666000610131400000,1666000610128329216,-3070784,621314
      1666000610131500000,1666000610131500032,32,621315
      1666000610131600000,1666000610131600128,128,621316
      1666000610131700000,1666000610131700224,224,621317
      
      After:
      
      1666073510646200000,1666073510646200064,64,2676462
      1666073510646300000,1666073510646300160,160,2676463
      1666073510646400000,1666073510646400256,256,2676464
      1666073510646500000,1666073510646500096,96,2676465
      1666073510646600000,1666073510646600192,192,2676466
      1666073510646700000,1666073510646700032,32,2676467
      1666073510646800000,1666073510646800128,128,2676468
      1666073510646900000,1666073510646900224,224,2676469
      1666073510647000000,1666073510647000064,64,2676470
      1666073510647100000,1666073510647100160,160,2676471
      1666073510647200000,1666073510647200256,256,2676472
      1666073510647300000,1666073510647300096,96,2676473
      1666073510647400000,1666073510647400192,192,2676474
      1666073510647500000,1666073510647500032,32,2676475
      1666073510647600000,1666073510647600128,128,2676476
      1666073510647700000,1666073510647700224,224,2676477
      1666073510647800000,1666073510647800064,64,2676478
      1666073510647900000,1666073510647900160,160,2676479
      1666073510648000000,1666073510648000000,0,2676480
      1666073510648100000,1666073510648100096,96,2676481
      1666073510648200000,1666073510648200192,192,2676482
      1666073510648300000,1666073510648300032,32,2676483
      1666073510648400000,1666073510648400128,128,2676484
      1666073510648500000,1666073510648500224,224,2676485
      1666073510648600000,1666073510648600064,64,2676486
      1666073510648700000,1666073510648700160,160,2676487
      1666073510648800000,1666073510648800000,0,2676488
      1666073510648900000,1666073510648900096,96,2676489
      1666073510649000000,1666073510649000192,192,2676490
      1666073510649100000,1666073510649100032,32,2676491
      1666073510649200000,1666073510649200128,128,2676492
      1666073510649300000,1666073510649300224,224,2676493
      1666073510649400000,1666073510649400064,64,2676494
      1666073510649500000,1666073510649500160,160,2676495
      1666073510649600000,1666073510649600000,0,2676496
      1666073510649700000,1666073510649700096,96,2676497
      1666073510649800000,1666073510649800192,192,2676498
      1666073510649900000,1666073510649900032,32,2676499
      1666073510650000000,1666073510650000128,128,2676500
      
      Fixes: 82faa9b7 ("igc: Add support for ETF offloading")
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Co-developed-by: default avatarAravindhan Gunasekaran <aravindhan.gunasekaran@intel.com>
      Signed-off-by: default avatarAravindhan Gunasekaran <aravindhan.gunasekaran@intel.com>
      Co-developed-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Signed-off-by: default avatarMuhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
      Signed-off-by: default avatarMalli C <mallikarjuna.chilakala@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      db0b124f
    • Vladimir Oltean's avatar
      net: dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port() · a7d82367
      Vladimir Oltean authored
      In the blamed commit, it was not noticed that one implementation of
      chip->info->ops->phylink_get_caps(), called by mv88e6xxx_get_caps(),
      may access hardware registers, and in doing so, it takes the
      mv88e6xxx_reg_lock(). Namely, this is mv88e6352_phylink_get_caps().
      
      This is a problem because mv88e6xxx_get_caps(), apart from being
      a top-level function (method invoked by dsa_switch_ops), is now also
      directly called from mv88e6xxx_setup_port(), which runs under the
      mv88e6xxx_reg_lock() taken by mv88e6xxx_setup(). Therefore, when running
      on mv88e6352, the reg_lock would be acquired a second time and the
      system would deadlock on driver probe.
      
      The things that mv88e6xxx_setup() can compete with in terms of register
      access with are the IRQ handlers and MDIO bus operations registered by
      mv88e6xxx_probe(). So there is a real need to acquire the register lock.
      
      The register lock can, in principle, be dropped and re-acquired pretty
      much at will within the driver, as long as no operations that involve
      waiting for indirect access to complete (essentially, callers of
      mv88e6xxx_smi_direct_wait() and mv88e6xxx_wait_mask()) are interrupted
      with the lock released. However, I would guess that in mv88e6xxx_setup(),
      the critical section is kept open for such a long time just in order to
      optimize away multiple lock/unlock operations on the registers.
      
      We could, in principle, drop the reg_lock right before the
      mv88e6xxx_setup_port() -> mv88e6xxx_get_caps() call, and
      re-acquire it immediately afterwards. But this would look ugly, because
      mv88e6xxx_setup_port() would release a lock which it didn't acquire, but
      the caller did.
      
      A cleaner solution to this issue comes from the observation that struct
      mv88e6xxxx_ops methods generally assume they are called with the
      reg_lock already acquired. Whereas mv88e6352_phylink_get_caps() is more
      the exception rather than the norm, in that it acquires the lock itself.
      
      Let's enforce the same locking pattern/convention for
      chip->info->ops->phylink_get_caps() as well, and make
      mv88e6xxx_get_caps(), the top-level function, acquire the register lock
      explicitly, for this one implementation that will access registers for
      port 4 to work properly.
      
      This means that mv88e6xxx_setup_port() will no longer call the top-level
      function, but the low-level mv88e6xxx_ops method which expects the
      correct calling context (register lock held).
      
      Compared to chip->info->ops->phylink_get_caps(), mv88e6xxx_get_caps()
      also fixes up the supported_interfaces bitmap for internal ports, since
      that can be done generically and does not require per-switch knowledge.
      That's code which will no longer execute, however mv88e6xxx_setup_port()
      doesn't need that. It just needs to look at the mac_capabilities bitmap.
      
      Fixes: cc1049cc ("net: dsa: mv88e6xxx: fix speed setting for CPU/DSA ports")
      Reported-by: default avatarMaksim Kiselev <bigunclemax@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: default avatarMaksim Kiselev <bigunclemax@gmail.com>
      Link: https://lore.kernel.org/r/20221214110120.3368472-1-vladimir.oltean@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a7d82367
    • Biju Das's avatar
      ravb: Fix "failed to switch device to config mode" message during unbind · c72a7e42
      Biju Das authored
      This patch fixes the error "ravb 11c20000.ethernet eth0: failed to switch
      device to config mode" during unbind.
      
      We are doing register access after pm_runtime_put_sync().
      
      We usually do cleanup in reverse order of init. Currently in
      remove(), the "pm_runtime_put_sync" is not in reverse order.
      
      Probe
      	reset_control_deassert(rstc);
      	pm_runtime_enable(&pdev->dev);
      	pm_runtime_get_sync(&pdev->dev);
      
      remove
      	pm_runtime_put_sync(&pdev->dev);
      	unregister_netdev(ndev);
      	..
      	ravb_mdio_release(priv);
      	pm_runtime_disable(&pdev->dev);
      
      Consider the call to unregister_netdev()
      unregister_netdev->unregister_netdevice_queue->rollback_registered_many
      that calls the below functions which access the registers after
      pm_runtime_put_sync()
       1) ravb_get_stats
       2) ravb_close
      
      Fixes: c156633f ("Renesas Ethernet AVB driver proper")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBiju Das <biju.das.jz@bp.renesas.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20221214105118.2495313-1-biju.das.jz@bp.renesas.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c72a7e42
    • Gaosheng Cui's avatar
      net: stmmac: fix errno when create_singlethread_workqueue() fails · 2cb815cf
      Gaosheng Cui authored
      We should set the return value to -ENOMEM explicitly when
      create_singlethread_workqueue() fails in stmmac_dvr_probe(),
      otherwise we'll lose the error value.
      
      Fixes: a137f3f2 ("net: stmmac: fix possible memory leak in stmmac_dvr_probe()")
      Signed-off-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20221214080117.3514615-1-cuigaosheng1@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      2cb815cf
    • Li Zetao's avatar
      r6040: Fix kmemleak in probe and remove · 7e43039a
      Li Zetao authored
      There is a memory leaks reported by kmemleak:
      
        unreferenced object 0xffff888116111000 (size 2048):
          comm "modprobe", pid 817, jiffies 4294759745 (age 76.502s)
          hex dump (first 32 bytes):
            00 c4 0a 04 81 88 ff ff 08 10 11 16 81 88 ff ff  ................
            08 10 11 16 81 88 ff ff 00 00 00 00 00 00 00 00  ................
          backtrace:
            [<ffffffff815bcd82>] kmalloc_trace+0x22/0x60
            [<ffffffff827e20ee>] phy_device_create+0x4e/0x90
            [<ffffffff827e6072>] get_phy_device+0xd2/0x220
            [<ffffffff827e7844>] mdiobus_scan+0xa4/0x2e0
            [<ffffffff827e8be2>] __mdiobus_register+0x482/0x8b0
            [<ffffffffa01f5d24>] r6040_init_one+0x714/0xd2c [r6040]
            ...
      
      The problem occurs in probe process as follows:
        r6040_init_one:
          mdiobus_register
            mdiobus_scan    <- alloc and register phy_device,
                               the reference count of phy_device is 3
          r6040_mii_probe
            phy_connect     <- connect to the first phy_device,
                               so the reference count of the first
                               phy_device is 4, others are 3
          register_netdev   <- fault inject succeeded, goto error handling path
      
          // error handling path
          err_out_mdio_unregister:
            mdiobus_unregister(lp->mii_bus);
          err_out_mdio:
            mdiobus_free(lp->mii_bus);    <- the reference count of the first
                                             phy_device is 1, it is not released
                                             and other phy_devices are released
        // similarly, the remove process also has the same problem
      
      The root cause is traced to the phy_device is not disconnected when
      removes one r6040 device in r6040_remove_one() or on error handling path
      after r6040_mii probed successfully. In r6040_mii_probe(), a net ethernet
      device is connected to the first PHY device of mii_bus, in order to
      notify the connected driver when the link status changes, which is the
      default behavior of the PHY infrastructure to handle everything.
      Therefore the phy_device should be disconnected when removes one r6040
      device or on error handling path.
      
      Fix it by adding phy_disconnect() when removes one r6040 device or on
      error handling path after r6040_mii probed successfully.
      
      Fixes: 3831861b ("r6040: implement phylib")
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20221213125614.927754-1-lizetao1@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7e43039a
    • Kirill Tkhai's avatar
      unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg() · 3ff8bff7
      Kirill Tkhai authored
      There is a race resulting in alive SOCK_SEQPACKET socket
      may change its state from TCP_ESTABLISHED to TCP_CLOSE:
      
      unix_release_sock(peer)                  unix_dgram_sendmsg(sk)
        sock_orphan(peer)
          sock_set_flag(peer, SOCK_DEAD)
                                                 sock_alloc_send_pskb()
                                                   if !(sk->sk_shutdown & SEND_SHUTDOWN)
                                                     OK
                                                 if sock_flag(peer, SOCK_DEAD)
                                                   sk->sk_state = TCP_CLOSE
        sk->sk_shutdown = SHUTDOWN_MASK
      
      After that socket sk remains almost normal: it is able to connect, listen, accept
      and recvmsg, while it can't sendmsg.
      
      Since this is the only possibility for alive SOCK_SEQPACKET to change
      the state in such way, we should better fix this strange and potentially
      danger corner case.
      
      Note, that we will return EPIPE here like this is normally done in sock_alloc_send_pskb().
      Originally used ECONNREFUSED looks strange, since it's strange to return
      a specific retval in dependence of race in kernel, when user can't affect on this.
      
      Also, move TCP_CLOSE assignment for SOCK_DGRAM sockets under state lock
      to fix race with unix_dgram_connect():
      
      unix_dgram_connect(other)            unix_dgram_sendmsg(sk)
                                             unix_peer(sk) = NULL
                                             unix_state_unlock(sk)
        unix_state_double_lock(sk, other)
        sk->sk_state  = TCP_ESTABLISHED
        unix_peer(sk) = other
        unix_state_double_unlock(sk, other)
                                             sk->sk_state  = TCP_CLOSED
      
      This patch fixes both of these races.
      
      Fixes: 83301b53 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
      Signed-off-by: default avatarKirill Tkhai <tkhai@ya.ru>
      Link: https://lore.kernel.org/r/135fda25-22d5-837a-782b-ceee50e19844@ya.ruSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3ff8bff7