• Yufeng Mo's avatar
    bonding: 3ad: fix the concurrency between __bond_release_one() and bond_3ad_state_machine_handler() · 220ade77
    Yufeng Mo authored
    Some time ago, I reported a calltrace issue
    "did not find a suitable aggregator", please see[1].
    After a period of analysis and reproduction, I find
    that this problem is caused by concurrency.
    
    Before the problem occurs, the bond structure is like follows:
    
    bond0 - slaver0(eth0) - agg0.lag_ports -> port0 - port1
                          \
                            port0
          \
            slaver1(eth1) - agg1.lag_ports -> NULL
                          \
                            port1
    
    If we run 'ifenslave bond0 -d eth1', the process is like below:
    
    excuting __bond_release_one()
    |
    bond_upper_dev_unlink()[step1]
    |                       |                       |
    |                       |                       bond_3ad_lacpdu_recv()
    |                       |                       ->bond_3ad_rx_indication()
    |                       |                       spin_lock_bh()
    |                       |                       ->ad_rx_machine()
    |                       |                       ->__record_pdu()[step2]
    |                       |                       spin_unlock_bh()
    |                       |                       |
    |                       bond_3ad_state_machine_handler()
    |                       spin_lock_bh()
    |                       ->ad_port_selection_logic()
    |                       ->try to find free aggregator[step3]
    |                       ->try to find suitable aggregator[step4]
    |                       ->did not find a suitable aggregator[step5]
    |                       spin_unlock_bh()
    |                       |
    |                       |
    bond_3ad_unbind_slave() |
    spin_lock_bh()
    spin_unlock_bh()
    
    step1: already removed slaver1(eth1) from list, but port1 remains
    step2: receive a lacpdu and update port0
    step3: port0 will be removed from agg0.lag_ports. The struct is
           "agg0.lag_ports -> port1" now, and agg0 is not free. At the
    	   same time, slaver1/agg1 has been removed from the list by step1.
    	   So we can't find a free aggregator now.
    step4: can't find suitable aggregator because of step2
    step5: cause a calltrace since port->aggregator is NULL
    
    To solve this concurrency problem, put bond_upper_dev_unlink()
    after bond_3ad_unbind_slave(). In this way, we can invalid the port
    first and skip this port in bond_3ad_state_machine_handler(). This
    eliminates the situation that the slaver has been removed from the
    list but the port is still valid.
    
    [1]https://lore.kernel.org/netdev/10374.1611947473@famine/Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
    Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    220ade77
bond_main.c 156 KB