• Ido Schimmel's avatar
    nexthop: Fix infinite nexthop dump when using maximum nexthop ID · 913f60ca
    Ido Schimmel authored
    A netlink dump callback can return a positive number to signal that more
    information needs to be dumped or zero to signal that the dump is
    complete. In the second case, the core netlink code will append the
    NLMSG_DONE message to the skb in order to indicate to user space that
    the dump is complete.
    
    The nexthop dump callback always returns a positive number if nexthops
    were filled in the provided skb, even if the dump is complete. This
    means that a dump will span at least two recvmsg() calls as long as
    nexthops are present. In the last recvmsg() call the dump callback will
    not fill in any nexthops because the previous call indicated that the
    dump should restart from the last dumped nexthop ID plus one.
    
     # ip nexthop add id 1 blackhole
     # strace -e sendto,recvmsg -s 5 ip nexthop
     sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
     recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
     recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
     id 1 blackhole
     recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
     recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
     +++ exited with 0 +++
    
    This behavior is both inefficient and buggy. If the last nexthop to be
    dumped had the maximum ID of 0xffffffff, then the dump will restart from
    0 (0xffffffff + 1) and never end:
    
     # ip nexthop add id $((2**32-1)) blackhole
     # ip nexthop
     id 4294967295 blackhole
     id 4294967295 blackhole
     [...]
    
    Fix by adjusting the dump callback to return zero when the dump is
    complete. After the fix only one recvmsg() call is made and the
    NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:
    
     # ip nexthop add id $((2**32-1)) blackhole
     # strace -e sendto,recvmsg -s 5 ip nexthop
     sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
     recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56
     recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56
     id 4294967295 blackhole
     +++ exited with 0 +++
    
    Note that if the NLMSG_DONE message cannot be appended because of size
    limitations, then another recvmsg() will be needed, but the core netlink
    code will not invoke the dump callback and simply reply with a
    NLMSG_DONE message since it knows that the callback previously returned
    zero.
    
    Add a test that fails before the fix:
    
     # ./fib_nexthops.sh -t basic
     [...]
     TEST: Maximum nexthop ID dump                                       [FAIL]
     [...]
    
    And passes after it:
    
     # ./fib_nexthops.sh -t basic
     [...]
     TEST: Maximum nexthop ID dump                                       [ OK ]
     [...]
    
    Fixes: ab84be7e ("net: Initial nexthop code")
    Reported-by: default avatarPetr Machata <petrm@nvidia.com>
    Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
    Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
    Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    913f60ca
fib_nexthops.sh 68.9 KB