• Daniel Borkmann's avatar
    bpf: Fix too early release of tcx_entry · 1cb6f0ba
    Daniel Borkmann authored
    Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
    an issue that the tcx_entry can be released too early leading to a use
    after free (UAF) when an active old-style ingress or clsact qdisc with a
    shared tc block is later replaced by another ingress or clsact instance.
    
    Essentially, the sequence to trigger the UAF (one example) can be as follows:
    
      1. A network namespace is created
      2. An ingress qdisc is created. This allocates a tcx_entry, and
         &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the
         same time, a tcf block with index 1 is created.
      3. chain0 is attached to the tcf block. chain0 must be connected to
         the block linked to the ingress qdisc to later reach the function
         tcf_chain0_head_change_cb_del() which triggers the UAF.
      4. Create and graft a clsact qdisc. This causes the ingress qdisc
         created in step 1 to be removed, thus freeing the previously linked
         tcx_entry:
    
         rtnetlink_rcv_msg()
           => tc_modify_qdisc()
             => qdisc_create()
               => clsact_init() [a]
             => qdisc_graft()
               => qdisc_destroy()
                 => __qdisc_destroy()
                   => ingress_destroy() [b]
                     => tcx_entry_free()
                       => kfree_rcu() // tcx_entry freed
    
      5. Finally, the network namespace is closed. This registers the
         cleanup_net worker, and during the process of releasing the
         remaining clsact qdisc, it accesses the tcx_entry that was
         already freed in step 4, causing the UAF to occur:
    
         cleanup_net()
           => ops_exit_list()
             => default_device_exit_batch()
               => unregister_netdevice_many()
                 => unregister_netdevice_many_notify()
                   => dev_shutdown()
                     => qdisc_put()
                       => clsact_destroy() [c]
                         => tcf_block_put_ext()
                           => tcf_chain0_head_change_cb_del()
                             => tcf_chain_head_change_item()
                               => clsact_chain_head_change()
                                 => mini_qdisc_pair_swap() // UAF
    
    There are also other variants, the gist is to add an ingress (or clsact)
    qdisc with a specific shared block, then to replace that qdisc, waiting
    for the tcx_entry kfree_rcu() to be executed and subsequently accessing
    the current active qdisc's miniq one way or another.
    
    The correct fix is to turn the miniq_active boolean into a counter. What
    can be observed, at step 2 above, the counter transitions from 0->1, at
    step [a] from 1->2 (in order for the miniq object to remain active during
    the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
    eventual release. The reference counter in general ranges from [0,2] and
    it does not need to be atomic since all access to the counter is protected
    by the rtnl mutex. With this in place, there is no longer a UAF happening
    and the tcx_entry is freed at the correct time.
    
    Fixes: e420bed0 ("bpf: Add fd-based tcx multi-prog infra with link support")
    Reported-by: default avatarPedro Pinto <xten@osec.io>
    Co-developed-by: default avatarPedro Pinto <xten@osec.io>
    Signed-off-by: default avatarPedro Pinto <xten@osec.io>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Cc: Hyunwoo Kim <v4bel@theori.io>
    Cc: Wongi Lee <qwerty@theori.io>
    Cc: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20240708133130.11609-1-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
    1cb6f0ba
sch_ingress.c 9.19 KB