• Daniel Borkmann's avatar
    ipvlan, l3mdev: fix broken l3s mode wrt local routes · d5256083
    Daniel Borkmann authored
    While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
    I ran into the issue that while l3 mode is working fine, l3s mode
    does not have any connectivity to kube-apiserver and hence all pods
    end up in Error state as well. The ipvlan master device sits on
    top of a bond device and hostns traffic to kube-apiserver (also running
    in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
    where the latter is the address of the bond0. While in l3 mode, a
    curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
    works fine from hostns, neither of them do in case of l3s. In the
    latter only a curl to https://127.0.0.1:37573 appeared to work where
    for local addresses of bond0 I saw kernel suddenly starting to emit
    ARP requests to query HW address of bond0 which remained unanswered
    and neighbor entries in INCOMPLETE state. These ARP requests only
    happen while in l3s.
    
    Debugging this further, I found the issue is that l3s mode is piggy-
    backing on l3 master device, and in this case local routes are using
    l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
    f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
    if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
    a loopback"). I found that reverting them back into using the
    net->loopback_dev fixed ipvlan l3s connectivity and got everything
    working for the CNI.
    
    Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
    l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
    on l3 master device is to get the l3mdev_ip_rcv() receive hook for
    setting the dst entry of the input route without adding its own
    ipvlan specific hacks into the receive path, however, any l3 domain
    semantics beyond just that are breaking l3s operation. Note that
    ipvlan also has the ability to dynamically switch its internal
    operation from l3 to l3s for all ports via ipvlan_set_port_mode()
    at runtime. In any case, l3 vs l3s soley distinguishes itself by
    'de-confusing' netfilter through switching skb->dev to ipvlan slave
    device late in NF_INET_LOCAL_IN before handing the skb to L4.
    
    Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
    if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
    without any additional l3mdev semantics on top. This should also have
    minimal impact since dev->priv_flags is already hot in cache. With
    this set, l3s mode is working fine and I also get things like
    masquerading pod traffic on the ipvlan master properly working.
    
      [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
    
    Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
    Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
    Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Cc: Mahesh Bandewar <maheshb@google.com>
    Cc: David Ahern <dsa@cumulusnetworks.com>
    Cc: Florian Westphal <fw@strlen.de>
    Cc: Martynas Pumputis <m@lambda.lt>
    Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    d5256083
ipvlan_main.c 29.2 KB