• Lahav Schlesinger's avatar
    vrf: Reset skb conntrack connection on VRF rcv · 09e856d5
    Lahav Schlesinger authored
    To fix the "reverse-NAT" for replies.
    
    When a packet is sent over a VRF, the POST_ROUTING hooks are called
    twice: Once from the VRF interface, and once from the "actual"
    interface the packet will be sent from:
    1) First SNAT: l3mdev_l3_out() -> vrf_l3_out() -> .. -> vrf_output_direct()
         This causes the POST_ROUTING hooks to run.
    2) Second SNAT: 'ip_output()' calls POST_ROUTING hooks again.
    
    Similarly for replies, first ip_rcv() calls PRE_ROUTING hooks, and
    second vrf_l3_rcv() calls them again.
    
    As an example, consider the following SNAT rule:
    > iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j SNAT --to-source 2.2.2.2 -o vrf_1
    
    In this case sending over a VRF will create 2 conntrack entries.
    The first is from the VRF interface, which performs the IP SNAT.
    The second will run the SNAT, but since the "expected reply" will remain
    the same, conntrack randomizes the source port of the packet:
    e..g With a socket bound to 1.1.1.1:10000, sending to 3.3.3.3:53, the conntrack
    rules are:
    udp      17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=0 bytes=0 mark=0 use=1
    udp      17 29 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1
    
    i.e. First SNAT IP from 1.1.1.1 --> 2.2.2.2, and second the src port is
    SNAT-ed from 10000 --> 61033.
    
    But when a reply is sent (3.3.3.3:53 -> 2.2.2.2:61033) only the later
    conntrack entry is matched:
    udp      17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=1 bytes=49 mark=0 use=1
    udp      17 28 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1
    
    And a "port 61033 unreachable" ICMP packet is sent back.
    
    The issue is that when PRE_ROUTING hooks are called from vrf_l3_rcv(),
    the skb already has a conntrack flow attached to it, which means
    nf_conntrack_in() will not resolve the flow again.
    
    This means only the dest port is "reverse-NATed" (61033 -> 10000) but
    the dest IP remains 2.2.2.2, and since the socket is bound to 1.1.1.1 it's
    not received.
    This can be verified by logging the 4-tuple of the packet in '__udp4_lib_rcv()'.
    
    The fix is then to reset the flow when skb is received on a VRF, to let
    conntrack resolve the flow again (which now will hit the earlier flow).
    
    To reproduce: (Without the fix "Got pkt_to_nat_port" will not be printed by
      running 'bash ./repro'):
      $ cat run_in_A1.py
      import logging
      logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
      from scapy.all import *
      import argparse
    
      def get_packet_to_send(udp_dst_port, msg_name):
          return Ether(src='11:22:33:44:55:66', dst=iface_mac)/ \
              IP(src='3.3.3.3', dst='2.2.2.2')/ \
              UDP(sport=53, dport=udp_dst_port)/ \
              Raw(f'{msg_name}\x0012345678901234567890')
    
      parser = argparse.ArgumentParser()
      parser.add_argument('-iface_mac', dest="iface_mac", type=str, required=True,
                          help="From run_in_A3.py")
      parser.add_argument('-socket_port', dest="socket_port", type=str,
                          required=True, help="From run_in_A3.py")
      parser.add_argument('-v1_mac', dest="v1_mac", type=str, required=True,
                          help="From script")
    
      args, _ = parser.parse_known_args()
      iface_mac = args.iface_mac
      socket_port = int(args.socket_port)
      v1_mac = args.v1_mac
    
      print(f'Source port before NAT: {socket_port}')
    
      while True:
          pkts = sniff(iface='_v0', store=True, count=1, timeout=10)
          if 0 == len(pkts):
              print('Something failed, rerun the script :(', flush=True)
              break
          pkt = pkts[0]
          if not pkt.haslayer('UDP'):
              continue
    
          pkt_sport = pkt.getlayer('UDP').sport
          print(f'Source port after NAT: {pkt_sport}', flush=True)
    
          pkt_to_send = get_packet_to_send(pkt_sport, 'pkt_to_nat_port')
          sendp(pkt_to_send, '_v0', verbose=False) # Will not be received
    
          pkt_to_send = get_packet_to_send(socket_port, 'pkt_to_socket_port')
          sendp(pkt_to_send, '_v0', verbose=False)
          break
    
      $ cat run_in_A2.py
      import socket
      import netifaces
    
      print(f"{netifaces.ifaddresses('e00000')[netifaces.AF_LINK][0]['addr']}",
            flush=True)
      s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
      s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE,
                   str('vrf_1' + '\0').encode('utf-8'))
      s.connect(('3.3.3.3', 53))
      print(f'{s. getsockname()[1]}', flush=True)
      s.settimeout(5)
    
      while True:
          try:
              # Periodically send in order to keep the conntrack entry alive.
              s.send(b'a'*40)
              resp = s.recvfrom(1024)
              msg_name = resp[0].decode('utf-8').split('\0')[0]
              print(f"Got {msg_name}", flush=True)
          except Exception as e:
              pass
    
      $ cat repro.sh
      ip netns del A1 2> /dev/null
      ip netns del A2 2> /dev/null
      ip netns add A1
      ip netns add A2
    
      ip -n A1 link add _v0 type veth peer name _v1 netns A2
      ip -n A1 link set _v0 up
    
      ip -n A2 link add e00000 type bond
      ip -n A2 link add lo0 type dummy
      ip -n A2 link add vrf_1 type vrf table 10001
      ip -n A2 link set vrf_1 up
      ip -n A2 link set e00000 master vrf_1
    
      ip -n A2 addr add 1.1.1.1/24 dev e00000
      ip -n A2 link set e00000 up
      ip -n A2 link set _v1 master e00000
      ip -n A2 link set _v1 up
      ip -n A2 link set lo0 up
      ip -n A2 addr add 2.2.2.2/32 dev lo0
    
      ip -n A2 neigh add 1.1.1.10 lladdr 77:77:77:77:77:77 dev e00000
      ip -n A2 route add 3.3.3.3/32 via 1.1.1.10 dev e00000 table 10001
    
      ip netns exec A2 iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j \
    	SNAT --to-source 2.2.2.2 -o vrf_1
    
      sleep 5
      ip netns exec A2 python3 run_in_A2.py > x &
      XPID=$!
      sleep 5
    
      IFACE_MAC=`sed -n 1p x`
      SOCKET_PORT=`sed -n 2p x`
      V1_MAC=`ip -n A2 link show _v1 | sed -n 2p | awk '{print $2'}`
      ip netns exec A1 python3 run_in_A1.py -iface_mac ${IFACE_MAC} -socket_port \
              ${SOCKET_PORT} -v1_mac ${SOCKET_PORT}
      sleep 5
    
      kill -9 $XPID
      wait $XPID 2> /dev/null
      ip netns del A1
      ip netns del A2
      tail x -n 2
      rm x
      set +x
    
    Fixes: 73e20b76 ("net: vrf: Add support for PREROUTING rules on vrf device")
    Signed-off-by: default avatarLahav Schlesinger <lschlesinger@drivenets.com>
    Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20210815120002.2787653-1-lschlesinger@drivenets.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    09e856d5
vrf.c 46.4 KB