Commit 7e225619 authored by David S. Miller's avatar David S. Miller

Merge branch 'vrf-allow-simultaneous-service-instances-in-default-and-other-VRFs'

Mike Manning says:

====================
vrf: allow simultaneous service instances in default and other VRFs

Services currently have to be VRF-aware if they are using an unbound
socket. One cannot have multiple service instances running in the
default and other VRFs for services that are not VRF-aware and listen
on an unbound socket. This is because there is no easy way of isolating
packets received in the default VRF from those arriving in other VRFs.

This series provides this isolation for stream sockets subject to the
existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set,
given that this is documented as allowing a single service instance to
work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is
checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is
introduced for raw sockets. The functionality applies to UDP & TCP
services as well as those using raw sockets, and is for IPv4 and IPv6.

Example of running ssh instances in default and blue VRF:

$ /usr/sbin/sshd -D
$ ip vrf exec vrf-blue /usr/sbin/sshd
$ ss -ta | egrep 'State|ssh'
State   Recv-Q   Send-Q           Local Address:Port       Peer Address:Port
LISTEN  0        128           0.0.0.0%vrf-blue:ssh             0.0.0.0:*
LISTEN  0        128                    0.0.0.0:ssh             0.0.0.0:*
ESTAB   0        0              192.168.122.220:ssh       192.168.122.1:50282
LISTEN  0        128              [::]%vrf-blue:ssh                [::]:*
LISTEN  0        128                       [::]:ssh                [::]:*
ESTAB   0        0           [3000::2]%vrf-blue:ssh           [3000::9]:45896
ESTAB   0        0                    [2000::2]:ssh           [2000::9]:46398

v1:
   - Address Paolo Abeni's comments (patch 4/5)
   - Fix build when CONFIG_NET_L3_MASTER_DEV not defined (patch 1/5)
v2:
   - Address David Aherns' comments (patches 4/5 and 5/5)
   - Remove patches 3/5 and 5/5 from series for individual submissions
   - Include a sysctl for raw sockets as recommended by David Ahern
   - Expand series into 10 patches and provide improved descriptions
v3:
   - Update description for patch 1/10 and remove patch 6/10
v4:
   - Set default to enabled for raw socket sysctl as recommended by David Ahern
v5:
   - Address review comments from David Ahern in patches 2-5
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents f601a85b 7bd2db40
...@@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN ...@@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN
derived from the listen socket to be bound to the L3 domain in derived from the listen socket to be bound to the L3 domain in
which the packets originated. Only valid when the kernel was which the packets originated. Only valid when the kernel was
compiled with CONFIG_NET_L3_MASTER_DEV. compiled with CONFIG_NET_L3_MASTER_DEV.
Default: 0 (disabled)
tcp_low_latency - BOOLEAN tcp_low_latency - BOOLEAN
This is a legacy option, it has no effect anymore. This is a legacy option, it has no effect anymore.
...@@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN ...@@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN
being received regardless of the L3 domain in which they being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV. CONFIG_NET_L3_MASTER_DEV.
Default: 0 (disabled)
udp_mem - vector of 3 INTEGERs: min, pressure, max udp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all UDP sockets. Number of pages allowed for queueing by all UDP sockets.
...@@ -799,6 +801,16 @@ udp_wmem_min - INTEGER ...@@ -799,6 +801,16 @@ udp_wmem_min - INTEGER
total pages of UDP sockets exceed udp_mem pressure. The unit is byte. total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
Default: 4K Default: 4K
RAW variables:
raw_l3mdev_accept - BOOLEAN
Enabling this option allows a "global" bound socket to work
across L3 master domains (e.g., VRFs) with packets capable of
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
Default: 1 (enabled)
CIPSOv4 Variables: CIPSOv4 Variables:
cipso_cache_enable - BOOLEAN cipso_cache_enable - BOOLEAN
......
...@@ -103,19 +103,33 @@ VRF device: ...@@ -103,19 +103,33 @@ VRF device:
or to specify the output device using cmsg and IP_PKTINFO. or to specify the output device using cmsg and IP_PKTINFO.
By default the scope of the port bindings for unbound sockets is
limited to the default VRF. That is, it will not be matched by packets
arriving on interfaces enslaved to an l3mdev and processes may bind to
the same port if they bind to an l3mdev.
TCP & UDP services running in the default VRF context (ie., not bound TCP & UDP services running in the default VRF context (ie., not bound
to any VRF device) can work across all VRF domains by enabling the to any VRF device) can work across all VRF domains by enabling the
tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1
These options are disabled by default so that a socket in a VRF is only
selected for packets in that VRF. There is a similar option for RAW
sockets, which is enabled by default for reasons of backwards compatibility.
This is so as to specify the output device with cmsg and IP_PKTINFO, but
using a socket not bound to the corresponding VRF. This allows e.g. older ping
implementations to be run with specifying the device but without executing it
in the VRF. This option can be disabled so that packets received in a VRF
context are only handled by a raw socket bound to the VRF, and packets in the
default VRF are only handled by a socket not bound to any VRF:
sysctl -w net.ipv4.raw_l3mdev_accept=0
netfilter rules on the VRF device can be used to limit access to services netfilter rules on the VRF device can be used to limit access to services
running in the default VRF context as well. running in the default VRF context as well.
The default VRF does not have limited scope with respect to port bindings.
That is, if a process does a wildcard bind to a port in the default VRF it
owns the port across all VRF domains within the network namespace.
################################################################################ ################################################################################
Using iproute2 for VRFs Using iproute2 for VRFs
......
...@@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev, ...@@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev,
struct sk_buff *skb) struct sk_buff *skb)
{ {
int orig_iif = skb->skb_iif; int orig_iif = skb->skb_iif;
bool need_strict; bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
bool is_ndisc = ipv6_ndisc_frame(skb);
/* loopback traffic; do not push through packet taps again. /* loopback, multicast & non-ND link-local traffic; do not push through
* Reset pkt_type for upper layers to process skb * packet taps again. Reset pkt_type for upper layers to process skb
*/ */
if (skb->pkt_type == PACKET_LOOPBACK) { if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) {
skb->dev = vrf_dev; skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex; skb->skb_iif = vrf_dev->ifindex;
IP6CB(skb)->flags |= IP6SKB_L3SLAVE; IP6CB(skb)->flags |= IP6SKB_L3SLAVE;
if (skb->pkt_type == PACKET_LOOPBACK)
skb->pkt_type = PACKET_HOST; skb->pkt_type = PACKET_HOST;
goto out; goto out;
} }
/* if packet is NDISC or addressed to multicast or link-local /* if packet is NDISC then keep the ingress interface */
* then keep the ingress interface if (!is_ndisc) {
*/
need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
if (!ipv6_ndisc_frame(skb) && !need_strict) {
vrf_rx_stats(vrf_dev, skb->len); vrf_rx_stats(vrf_dev, skb->len);
skb->dev = vrf_dev; skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex; skb->skb_iif = vrf_dev->ifindex;
......
...@@ -115,8 +115,7 @@ int inet6_hash(struct sock *sk); ...@@ -115,8 +115,7 @@ int inet6_hash(struct sock *sk);
((__sk)->sk_family == AF_INET6) && \ ((__sk)->sk_family == AF_INET6) && \
ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr)) && \ ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr)) && \
ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr)) && \ ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr)) && \
(!(__sk)->sk_bound_dev_if || \ (((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \ ((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net))) net_eq(sock_net(__sk), (__net)))
......
...@@ -79,6 +79,7 @@ struct inet_ehash_bucket { ...@@ -79,6 +79,7 @@ struct inet_ehash_bucket {
struct inet_bind_bucket { struct inet_bind_bucket {
possible_net_t ib_net; possible_net_t ib_net;
int l3mdev;
unsigned short port; unsigned short port;
signed char fastreuse; signed char fastreuse;
signed char fastreuseport; signed char fastreuseport;
...@@ -188,10 +189,21 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo) ...@@ -188,10 +189,21 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
hashinfo->ehash_locks = NULL; hashinfo->ehash_locks = NULL;
} }
static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!net->ipv4.sysctl_tcp_l3mdev_accept,
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}
struct inet_bind_bucket * struct inet_bind_bucket *
inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
struct inet_bind_hashbucket *head, struct inet_bind_hashbucket *head,
const unsigned short snum); const unsigned short snum, int l3mdev);
void inet_bind_bucket_destroy(struct kmem_cache *cachep, void inet_bind_bucket_destroy(struct kmem_cache *cachep,
struct inet_bind_bucket *tb); struct inet_bind_bucket *tb);
...@@ -282,8 +294,7 @@ static inline struct sock *inet_lookup_listener(struct net *net, ...@@ -282,8 +294,7 @@ static inline struct sock *inet_lookup_listener(struct net *net,
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \ #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
(((__sk)->sk_portpair == (__ports)) && \ (((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_addrpair == (__cookie)) && \ ((__sk)->sk_addrpair == (__cookie)) && \
(!(__sk)->sk_bound_dev_if || \ (((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \ ((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net))) net_eq(sock_net(__sk), (__net)))
#else /* 32-bit arch */ #else /* 32-bit arch */
...@@ -294,8 +305,7 @@ static inline struct sock *inet_lookup_listener(struct net *net, ...@@ -294,8 +305,7 @@ static inline struct sock *inet_lookup_listener(struct net *net,
(((__sk)->sk_portpair == (__ports)) && \ (((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_daddr == (__saddr)) && \ ((__sk)->sk_daddr == (__saddr)) && \
((__sk)->sk_rcv_saddr == (__daddr)) && \ ((__sk)->sk_rcv_saddr == (__daddr)) && \
(!(__sk)->sk_bound_dev_if || \ (((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \ ((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net))) net_eq(sock_net(__sk), (__net)))
#endif /* 64-bit arch */ #endif /* 64-bit arch */
......
...@@ -130,6 +130,27 @@ static inline int inet_request_bound_dev_if(const struct sock *sk, ...@@ -130,6 +130,27 @@ static inline int inet_request_bound_dev_if(const struct sock *sk,
return sk->sk_bound_dev_if; return sk->sk_bound_dev_if;
} }
static inline int inet_sk_bound_l3mdev(const struct sock *sk)
{
#ifdef CONFIG_NET_L3_MASTER_DEV
struct net *net = sock_net(sk);
if (!net->ipv4.sysctl_tcp_l3mdev_accept)
return l3mdev_master_ifindex_by_index(net,
sk->sk_bound_dev_if);
#endif
return 0;
}
static inline bool inet_bound_dev_eq(bool l3mdev_accept, int bound_dev_if,
int dif, int sdif)
{
if (!bound_dev_if)
return !sdif || l3mdev_accept;
return bound_dev_if == dif || bound_dev_if == sdif;
}
struct inet_cork { struct inet_cork {
unsigned int flags; unsigned int flags;
__be32 addr; __be32 addr;
......
...@@ -103,6 +103,9 @@ struct netns_ipv4 { ...@@ -103,6 +103,9 @@ struct netns_ipv4 {
/* Shall we try to damage output packets if routing dev changes? */ /* Shall we try to damage output packets if routing dev changes? */
int sysctl_ip_dynaddr; int sysctl_ip_dynaddr;
int sysctl_ip_early_demux; int sysctl_ip_early_demux;
#ifdef CONFIG_NET_L3_MASTER_DEV
int sysctl_raw_l3mdev_accept;
#endif
int sysctl_tcp_early_demux; int sysctl_tcp_early_demux;
int sysctl_udp_early_demux; int sysctl_udp_early_demux;
......
...@@ -17,7 +17,7 @@ ...@@ -17,7 +17,7 @@
#ifndef _RAW_H #ifndef _RAW_H
#define _RAW_H #define _RAW_H
#include <net/inet_sock.h>
#include <net/protocol.h> #include <net/protocol.h>
#include <linux/icmp.h> #include <linux/icmp.h>
...@@ -61,6 +61,7 @@ void raw_seq_stop(struct seq_file *seq, void *v); ...@@ -61,6 +61,7 @@ void raw_seq_stop(struct seq_file *seq, void *v);
int raw_hash_sk(struct sock *sk); int raw_hash_sk(struct sock *sk);
void raw_unhash_sk(struct sock *sk); void raw_unhash_sk(struct sock *sk);
void raw_init(void);
struct raw_sock { struct raw_sock {
/* inet_sock has to be the first member */ /* inet_sock has to be the first member */
...@@ -74,4 +75,15 @@ static inline struct raw_sock *raw_sk(const struct sock *sk) ...@@ -74,4 +75,15 @@ static inline struct raw_sock *raw_sk(const struct sock *sk)
return (struct raw_sock *)sk; return (struct raw_sock *)sk;
} }
static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!net->ipv4.sysctl_raw_l3mdev_accept,
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}
#endif /* _RAW_H */ #endif /* _RAW_H */
...@@ -252,6 +252,17 @@ static inline int udp_rqueue_get(struct sock *sk) ...@@ -252,6 +252,17 @@ static inline int udp_rqueue_get(struct sock *sk)
return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit); return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit);
} }
static inline bool udp_sk_bound_dev_eq(struct net *net, int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!net->ipv4.sysctl_udp_l3mdev_accept,
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}
/* net/ipv4/udp.c */ /* net/ipv4/udp.c */
void udp_destruct_sock(struct sock *sk); void udp_destruct_sock(struct sock *sk);
void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len); void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
......
...@@ -567,6 +567,8 @@ static int sock_setbindtodevice(struct sock *sk, char __user *optval, ...@@ -567,6 +567,8 @@ static int sock_setbindtodevice(struct sock *sk, char __user *optval,
lock_sock(sk); lock_sock(sk);
sk->sk_bound_dev_if = index; sk->sk_bound_dev_if = index;
if (sk->sk_prot->rehash)
sk->sk_prot->rehash(sk);
sk_dst_reset(sk); sk_dst_reset(sk);
release_sock(sk); release_sock(sk);
......
...@@ -1964,6 +1964,8 @@ static int __init inet_init(void) ...@@ -1964,6 +1964,8 @@ static int __init inet_init(void)
/* Add UDP-Lite (RFC 3828) */ /* Add UDP-Lite (RFC 3828) */
udplite4_register(); udplite4_register();
raw_init();
ping_init(); ping_init();
/* /*
......
...@@ -183,7 +183,9 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int * ...@@ -183,7 +183,9 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
int i, low, high, attempt_half; int i, low, high, attempt_half;
struct inet_bind_bucket *tb; struct inet_bind_bucket *tb;
u32 remaining, offset; u32 remaining, offset;
int l3mdev;
l3mdev = inet_sk_bound_l3mdev(sk);
attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0; attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
other_half_scan: other_half_scan:
inet_get_local_port_range(net, &low, &high); inet_get_local_port_range(net, &low, &high);
...@@ -219,7 +221,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int * ...@@ -219,7 +221,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
hinfo->bhash_size)]; hinfo->bhash_size)];
spin_lock_bh(&head->lock); spin_lock_bh(&head->lock);
inet_bind_bucket_for_each(tb, &head->chain) inet_bind_bucket_for_each(tb, &head->chain)
if (net_eq(ib_net(tb), net) && tb->port == port) { if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port) {
if (!inet_csk_bind_conflict(sk, tb, false, false)) if (!inet_csk_bind_conflict(sk, tb, false, false))
goto success; goto success;
goto next_port; goto next_port;
...@@ -293,6 +296,9 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum) ...@@ -293,6 +296,9 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
struct net *net = sock_net(sk); struct net *net = sock_net(sk);
struct inet_bind_bucket *tb = NULL; struct inet_bind_bucket *tb = NULL;
kuid_t uid = sock_i_uid(sk); kuid_t uid = sock_i_uid(sk);
int l3mdev;
l3mdev = inet_sk_bound_l3mdev(sk);
if (!port) { if (!port) {
head = inet_csk_find_open_port(sk, &tb, &port); head = inet_csk_find_open_port(sk, &tb, &port);
...@@ -306,11 +312,12 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum) ...@@ -306,11 +312,12 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
hinfo->bhash_size)]; hinfo->bhash_size)];
spin_lock_bh(&head->lock); spin_lock_bh(&head->lock);
inet_bind_bucket_for_each(tb, &head->chain) inet_bind_bucket_for_each(tb, &head->chain)
if (net_eq(ib_net(tb), net) && tb->port == port) if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port)
goto tb_found; goto tb_found;
tb_not_found: tb_not_found:
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
net, head, port); net, head, port, l3mdev);
if (!tb) if (!tb)
goto fail_unlock; goto fail_unlock;
tb_found: tb_found:
......
...@@ -65,12 +65,14 @@ static u32 sk_ehashfn(const struct sock *sk) ...@@ -65,12 +65,14 @@ static u32 sk_ehashfn(const struct sock *sk)
struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep, struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
struct net *net, struct net *net,
struct inet_bind_hashbucket *head, struct inet_bind_hashbucket *head,
const unsigned short snum) const unsigned short snum,
int l3mdev)
{ {
struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC); struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
if (tb) { if (tb) {
write_pnet(&tb->ib_net, net); write_pnet(&tb->ib_net, net);
tb->l3mdev = l3mdev;
tb->port = snum; tb->port = snum;
tb->fastreuse = 0; tb->fastreuse = 0;
tb->fastreuseport = 0; tb->fastreuseport = 0;
...@@ -135,6 +137,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) ...@@ -135,6 +137,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
table->bhash_size); table->bhash_size);
struct inet_bind_hashbucket *head = &table->bhash[bhash]; struct inet_bind_hashbucket *head = &table->bhash[bhash];
struct inet_bind_bucket *tb; struct inet_bind_bucket *tb;
int l3mdev;
spin_lock(&head->lock); spin_lock(&head->lock);
tb = inet_csk(sk)->icsk_bind_hash; tb = inet_csk(sk)->icsk_bind_hash;
...@@ -143,6 +146,8 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) ...@@ -143,6 +146,8 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
return -ENOENT; return -ENOENT;
} }
if (tb->port != port) { if (tb->port != port) {
l3mdev = inet_sk_bound_l3mdev(sk);
/* NOTE: using tproxy and redirecting skbs to a proxy /* NOTE: using tproxy and redirecting skbs to a proxy
* on a different listener port breaks the assumption * on a different listener port breaks the assumption
* that the listener socket's icsk_bind_hash is the same * that the listener socket's icsk_bind_hash is the same
...@@ -150,12 +155,13 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) ...@@ -150,12 +155,13 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
* create a new bind bucket for the child here. */ * create a new bind bucket for the child here. */
inet_bind_bucket_for_each(tb, &head->chain) { inet_bind_bucket_for_each(tb, &head->chain) {
if (net_eq(ib_net(tb), sock_net(sk)) && if (net_eq(ib_net(tb), sock_net(sk)) &&
tb->port == port) tb->l3mdev == l3mdev && tb->port == port)
break; break;
} }
if (!tb) { if (!tb) {
tb = inet_bind_bucket_create(table->bind_bucket_cachep, tb = inet_bind_bucket_create(table->bind_bucket_cachep,
sock_net(sk), head, port); sock_net(sk), head, port,
l3mdev);
if (!tb) { if (!tb) {
spin_unlock(&head->lock); spin_unlock(&head->lock);
return -ENOMEM; return -ENOMEM;
...@@ -229,6 +235,7 @@ static inline int compute_score(struct sock *sk, struct net *net, ...@@ -229,6 +235,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
{ {
int score = -1; int score = -1;
struct inet_sock *inet = inet_sk(sk); struct inet_sock *inet = inet_sk(sk);
bool dev_match;
if (net_eq(sock_net(sk), net) && inet->inet_num == hnum && if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
!ipv6_only_sock(sk)) { !ipv6_only_sock(sk)) {
...@@ -239,15 +246,12 @@ static inline int compute_score(struct sock *sk, struct net *net, ...@@ -239,15 +246,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1; return -1;
score += 4; score += 4;
} }
if (sk->sk_bound_dev_if || exact_dif) { dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
bool dev_match = (sk->sk_bound_dev_if == dif || dif, sdif);
sk->sk_bound_dev_if == sdif);
if (!dev_match) if (!dev_match)
return -1; return -1;
if (sk->sk_bound_dev_if)
score += 4; score += 4;
}
if (sk->sk_incoming_cpu == raw_smp_processor_id()) if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++; score++;
} }
...@@ -675,6 +679,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -675,6 +679,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
u32 remaining, offset; u32 remaining, offset;
int ret, i, low, high; int ret, i, low, high;
static u32 hint; static u32 hint;
int l3mdev;
if (port) { if (port) {
head = &hinfo->bhash[inet_bhashfn(net, port, head = &hinfo->bhash[inet_bhashfn(net, port,
...@@ -693,6 +698,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -693,6 +698,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
return ret; return ret;
} }
l3mdev = inet_sk_bound_l3mdev(sk);
inet_get_local_port_range(net, &low, &high); inet_get_local_port_range(net, &low, &high);
high++; /* [32768, 60999] -> [32768, 61000[ */ high++; /* [32768, 60999] -> [32768, 61000[ */
remaining = high - low; remaining = high - low;
...@@ -719,7 +726,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -719,7 +726,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
* the established check is already unique enough. * the established check is already unique enough.
*/ */
inet_bind_bucket_for_each(tb, &head->chain) { inet_bind_bucket_for_each(tb, &head->chain) {
if (net_eq(ib_net(tb), net) && tb->port == port) { if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port) {
if (tb->fastreuse >= 0 || if (tb->fastreuse >= 0 ||
tb->fastreuseport >= 0) tb->fastreuseport >= 0)
goto next_port; goto next_port;
...@@ -732,7 +740,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -732,7 +740,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
} }
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
net, head, port); net, head, port, l3mdev);
if (!tb) { if (!tb) {
spin_unlock_bh(&head->lock); spin_unlock_bh(&head->lock);
return -ENOMEM; return -ENOMEM;
......
...@@ -131,8 +131,7 @@ struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, ...@@ -131,8 +131,7 @@ struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
if (net_eq(sock_net(sk), net) && inet->inet_num == num && if (net_eq(sock_net(sk), net) && inet->inet_num == num &&
!(inet->inet_daddr && inet->inet_daddr != raddr) && !(inet->inet_daddr && inet->inet_daddr != raddr) &&
!(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) && !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
!(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif && raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
sk->sk_bound_dev_if != sdif))
goto found; /* gotcha */ goto found; /* gotcha */
} }
sk = NULL; sk = NULL;
...@@ -805,7 +804,7 @@ static int raw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, ...@@ -805,7 +804,7 @@ static int raw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
return copied; return copied;
} }
static int raw_init(struct sock *sk) static int raw_sk_init(struct sock *sk)
{ {
struct raw_sock *rp = raw_sk(sk); struct raw_sock *rp = raw_sk(sk);
...@@ -970,7 +969,7 @@ struct proto raw_prot = { ...@@ -970,7 +969,7 @@ struct proto raw_prot = {
.connect = ip4_datagram_connect, .connect = ip4_datagram_connect,
.disconnect = __udp_disconnect, .disconnect = __udp_disconnect,
.ioctl = raw_ioctl, .ioctl = raw_ioctl,
.init = raw_init, .init = raw_sk_init,
.setsockopt = raw_setsockopt, .setsockopt = raw_setsockopt,
.getsockopt = raw_getsockopt, .getsockopt = raw_getsockopt,
.sendmsg = raw_sendmsg, .sendmsg = raw_sendmsg,
...@@ -1133,4 +1132,28 @@ void __init raw_proc_exit(void) ...@@ -1133,4 +1132,28 @@ void __init raw_proc_exit(void)
{ {
unregister_pernet_subsys(&raw_net_ops); unregister_pernet_subsys(&raw_net_ops);
} }
static void raw_sysctl_init_net(struct net *net)
{
#ifdef CONFIG_NET_L3_MASTER_DEV
net->ipv4.sysctl_raw_l3mdev_accept = 1;
#endif
}
static int __net_init raw_sysctl_init(struct net *net)
{
raw_sysctl_init_net(net);
return 0;
}
static struct pernet_operations __net_initdata raw_sysctl_ops = {
.init = raw_sysctl_init,
};
void __init raw_init(void)
{
raw_sysctl_init_net(&init_net);
if (register_pernet_subsys(&raw_sysctl_ops))
panic("RAW: failed to init sysctl parameters.\n");
}
#endif /* CONFIG_PROC_FS */ #endif /* CONFIG_PROC_FS */
...@@ -602,6 +602,17 @@ static struct ctl_table ipv4_net_table[] = { ...@@ -602,6 +602,17 @@ static struct ctl_table ipv4_net_table[] = {
.mode = 0644, .mode = 0644,
.proc_handler = ipv4_ping_group_range, .proc_handler = ipv4_ping_group_range,
}, },
#ifdef CONFIG_NET_L3_MASTER_DEV
{
.procname = "raw_l3mdev_accept",
.data = &init_net.ipv4.sysctl_raw_l3mdev_accept,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &zero,
.extra2 = &one,
},
#endif
{ {
.procname = "tcp_ecn", .procname = "tcp_ecn",
.data = &init_net.ipv4.sysctl_tcp_ecn, .data = &init_net.ipv4.sysctl_tcp_ecn,
......
...@@ -371,6 +371,7 @@ static int compute_score(struct sock *sk, struct net *net, ...@@ -371,6 +371,7 @@ static int compute_score(struct sock *sk, struct net *net,
{ {
int score; int score;
struct inet_sock *inet; struct inet_sock *inet;
bool dev_match;
if (!net_eq(sock_net(sk), net) || if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum || udp_sk(sk)->udp_port_hash != hnum ||
...@@ -398,15 +399,11 @@ static int compute_score(struct sock *sk, struct net *net, ...@@ -398,15 +399,11 @@ static int compute_score(struct sock *sk, struct net *net,
score += 4; score += 4;
} }
if (sk->sk_bound_dev_if || exact_dif) { dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
bool dev_match = (sk->sk_bound_dev_if == dif || dif, sdif);
sk->sk_bound_dev_if == sdif);
if (!dev_match) if (!dev_match)
return -1; return -1;
if (sk->sk_bound_dev_if)
score += 4; score += 4;
}
if (sk->sk_incoming_cpu == raw_smp_processor_id()) if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++; score++;
......
...@@ -772,6 +772,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk, ...@@ -772,6 +772,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
case IPV6_2292PKTINFO: case IPV6_2292PKTINFO:
{ {
struct net_device *dev = NULL; struct net_device *dev = NULL;
int src_idx;
if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct in6_pktinfo))) { if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct in6_pktinfo))) {
err = -EINVAL; err = -EINVAL;
...@@ -779,12 +780,15 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk, ...@@ -779,12 +780,15 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
} }
src_info = (struct in6_pktinfo *)CMSG_DATA(cmsg); src_info = (struct in6_pktinfo *)CMSG_DATA(cmsg);
src_idx = src_info->ipi6_ifindex;
if (src_info->ipi6_ifindex) { if (src_idx) {
if (fl6->flowi6_oif && if (fl6->flowi6_oif &&
src_info->ipi6_ifindex != fl6->flowi6_oif) src_idx != fl6->flowi6_oif &&
(sk->sk_bound_dev_if != fl6->flowi6_oif ||
!sk_dev_equal_l3scope(sk, src_idx)))
return -EINVAL; return -EINVAL;
fl6->flowi6_oif = src_info->ipi6_ifindex; fl6->flowi6_oif = src_idx;
} }
addr_type = __ipv6_addr_type(&src_info->ipi6_addr); addr_type = __ipv6_addr_type(&src_info->ipi6_addr);
......
...@@ -99,6 +99,7 @@ static inline int compute_score(struct sock *sk, struct net *net, ...@@ -99,6 +99,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
const int dif, const int sdif, bool exact_dif) const int dif, const int sdif, bool exact_dif)
{ {
int score = -1; int score = -1;
bool dev_match;
if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum && if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum &&
sk->sk_family == PF_INET6) { sk->sk_family == PF_INET6) {
...@@ -109,15 +110,12 @@ static inline int compute_score(struct sock *sk, struct net *net, ...@@ -109,15 +110,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1; return -1;
score++; score++;
} }
if (sk->sk_bound_dev_if || exact_dif) { dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
bool dev_match = (sk->sk_bound_dev_if == dif || dif, sdif);
sk->sk_bound_dev_if == sdif);
if (!dev_match) if (!dev_match)
return -1; return -1;
if (sk->sk_bound_dev_if)
score++; score++;
}
if (sk->sk_incoming_cpu == raw_smp_processor_id()) if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++; score++;
} }
......
...@@ -359,6 +359,8 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk ...@@ -359,6 +359,8 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
} }
} else if (ipprot->flags & INET6_PROTO_FINAL) { } else if (ipprot->flags & INET6_PROTO_FINAL) {
const struct ipv6hdr *hdr; const struct ipv6hdr *hdr;
int sdif = inet6_sdif(skb);
struct net_device *dev;
/* Only do this once for first final protocol */ /* Only do this once for first final protocol */
have_final = true; have_final = true;
...@@ -371,8 +373,18 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk ...@@ -371,8 +373,18 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
skb_postpull_rcsum(skb, skb_network_header(skb), skb_postpull_rcsum(skb, skb_network_header(skb),
skb_network_header_len(skb)); skb_network_header_len(skb));
hdr = ipv6_hdr(skb); hdr = ipv6_hdr(skb);
/* skb->dev passed may be master dev for vrfs. */
if (sdif) {
dev = dev_get_by_index_rcu(net, sdif);
if (!dev)
goto discard;
} else {
dev = skb->dev;
}
if (ipv6_addr_is_multicast(&hdr->daddr) && if (ipv6_addr_is_multicast(&hdr->daddr) &&
!ipv6_chk_mcast_addr(skb->dev, &hdr->daddr, !ipv6_chk_mcast_addr(dev, &hdr->daddr,
&hdr->saddr) && &hdr->saddr) &&
!ipv6_is_mld(skb, nexthdr, skb_network_header_len(skb))) !ipv6_is_mld(skb, nexthdr, skb_network_header_len(skb)))
goto discard; goto discard;
...@@ -432,15 +444,32 @@ EXPORT_SYMBOL_GPL(ip6_input); ...@@ -432,15 +444,32 @@ EXPORT_SYMBOL_GPL(ip6_input);
int ip6_mc_input(struct sk_buff *skb) int ip6_mc_input(struct sk_buff *skb)
{ {
int sdif = inet6_sdif(skb);
const struct ipv6hdr *hdr; const struct ipv6hdr *hdr;
struct net_device *dev;
bool deliver; bool deliver;
__IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev), __IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev),
__in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST, __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST,
skb->len); skb->len);
/* skb->dev passed may be master dev for vrfs. */
if (sdif) {
rcu_read_lock();
dev = dev_get_by_index_rcu(dev_net(skb->dev), sdif);
if (!dev) {
rcu_read_unlock();
kfree_skb(skb);
return -ENODEV;
}
} else {
dev = skb->dev;
}
hdr = ipv6_hdr(skb); hdr = ipv6_hdr(skb);
deliver = ipv6_chk_mcast_addr(skb->dev, &hdr->daddr, NULL); deliver = ipv6_chk_mcast_addr(dev, &hdr->daddr, NULL);
if (sdif)
rcu_read_unlock();
#ifdef CONFIG_IPV6_MROUTE #ifdef CONFIG_IPV6_MROUTE
/* /*
......
...@@ -486,7 +486,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname, ...@@ -486,7 +486,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
retv = -EFAULT; retv = -EFAULT;
break; break;
} }
if (sk->sk_bound_dev_if && pkt.ipi6_ifindex != sk->sk_bound_dev_if) if (!sk_dev_equal_l3scope(sk, pkt.ipi6_ifindex))
goto e_inval; goto e_inval;
np->sticky_pktinfo.ipi6_ifindex = pkt.ipi6_ifindex; np->sticky_pktinfo.ipi6_ifindex = pkt.ipi6_ifindex;
......
...@@ -86,9 +86,8 @@ struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, ...@@ -86,9 +86,8 @@ struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
!ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr))
continue; continue;
if (sk->sk_bound_dev_if && if (!raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
sk->sk_bound_dev_if != dif && dif, sdif))
sk->sk_bound_dev_if != sdif)
continue; continue;
if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
......
...@@ -117,6 +117,7 @@ static int compute_score(struct sock *sk, struct net *net, ...@@ -117,6 +117,7 @@ static int compute_score(struct sock *sk, struct net *net,
{ {
int score; int score;
struct inet_sock *inet; struct inet_sock *inet;
bool dev_match;
if (!net_eq(sock_net(sk), net) || if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum || udp_sk(sk)->udp_port_hash != hnum ||
...@@ -144,15 +145,10 @@ static int compute_score(struct sock *sk, struct net *net, ...@@ -144,15 +145,10 @@ static int compute_score(struct sock *sk, struct net *net,
score++; score++;
} }
if (sk->sk_bound_dev_if || exact_dif) { dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif);
bool dev_match = (sk->sk_bound_dev_if == dif ||
sk->sk_bound_dev_if == sdif);
if (!dev_match) if (!dev_match)
return -1; return -1;
if (sk->sk_bound_dev_if)
score++; score++;
}
if (sk->sk_incoming_cpu == raw_smp_processor_id()) if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++; score++;
...@@ -641,7 +637,7 @@ static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) ...@@ -641,7 +637,7 @@ static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk, static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
__be16 loc_port, const struct in6_addr *loc_addr, __be16 loc_port, const struct in6_addr *loc_addr,
__be16 rmt_port, const struct in6_addr *rmt_addr, __be16 rmt_port, const struct in6_addr *rmt_addr,
int dif, unsigned short hnum) int dif, int sdif, unsigned short hnum)
{ {
struct inet_sock *inet = inet_sk(sk); struct inet_sock *inet = inet_sk(sk);
...@@ -653,7 +649,7 @@ static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk, ...@@ -653,7 +649,7 @@ static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
(inet->inet_dport && inet->inet_dport != rmt_port) || (inet->inet_dport && inet->inet_dport != rmt_port) ||
(!ipv6_addr_any(&sk->sk_v6_daddr) && (!ipv6_addr_any(&sk->sk_v6_daddr) &&
!ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) || !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) ||
(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif) || !udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif) ||
(!ipv6_addr_any(&sk->sk_v6_rcv_saddr) && (!ipv6_addr_any(&sk->sk_v6_rcv_saddr) &&
!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr))) !ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr)))
return false; return false;
...@@ -687,6 +683,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, ...@@ -687,6 +683,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
unsigned int offset = offsetof(typeof(*sk), sk_node); unsigned int offset = offsetof(typeof(*sk), sk_node);
unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10); unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
int dif = inet6_iif(skb); int dif = inet6_iif(skb);
int sdif = inet6_sdif(skb);
struct hlist_node *node; struct hlist_node *node;
struct sk_buff *nskb; struct sk_buff *nskb;
...@@ -701,7 +698,8 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, ...@@ -701,7 +698,8 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) { sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
if (!__udp_v6_is_mcast_sock(net, sk, uh->dest, daddr, if (!__udp_v6_is_mcast_sock(net, sk, uh->dest, daddr,
uh->source, saddr, dif, hnum)) uh->source, saddr, dif, sdif,
hnum))
continue; continue;
/* If zero checksum and no_check is not on for /* If zero checksum and no_check is not on for
* the socket then skip it. * the socket then skip it.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment