Commit 42bbcb78 authored by David S. Miller's avatar David S. Miller

Merge branch 'netlink-mmap'

Patrick McHardy says:

====================
The following patches contain an implementation of memory mapped I/O for
netlink. The implementation is modelled after AF_PACKET memory mapped I/O
with a few differences:

- In order to perform memory mapped I/O to userspace, the kernel allocates
  skbs with the data area pointing to the data area of the mapped frames.
  All netlink subsystems assume a linear data area, so for the sake of
  simplicity, the mapped data area is not attached to the paged area but
  to skb->data. This requires introduction of a special skb alloction
  function that just allocates an skb head without the data area. Since this
  is a quite rare use case, I introduced a new function based on __alloc_skb
  instead of splitting it up into head and data alloction. The alternative
  would be to   introduce an __alloc_skb_head and __alloc_skb_data function,
  which would actually be useful for a specific error case in memory mapped
  netlink, but would require a couple of extra instructions for the common
  skb allocation case, so it doesn't really seem worth it.

  In order to get the destination memory area for skb->data before message
  construction, memory mapped netlink I/O needs to look up the destination
  socket during allocation instead of during transmission because the
  ring is owned by the receiveing socket/process. A special skb allocation
  function (netlink_alloc_skb) taking the destination pid as an argument is
  used for this, all subsystems that want to support memory mapped I/O need
  to use this function, automatic fallback to the receive queue happens
  for unconverted subsystems. Dumps automatically use memory mapped I/O if
  the receiving socket has enabled it.

  The visible effect of looking up the destination socket during allocation
  instead of transmission is that message ordering in userspace might
  change in case allocation and transmission aren't performed atomically.
  This usually doesn't matter since most subsystems have a BKL-like lock
  like the rtnl mutex, to my knowledge the currently only existing case
  where it might matter is nfnetlink_queue combined with the recently
  introduced batched verdicts, but a) that subsystem already includes
  sequence numbers which allow userspace to reorder messages in case it
  cares to, also the reodering window is quite small and b) with memory
  mapped transmission batching can be performed in a subsystem indepandant
  manner.

- AF_NETLINK contains flow control for database dumps, with regular I/O
  dump continuation are triggered based on the sockets receive queue space
  and by recvmsg() calls. Since with memory mapped I/O there are no
  recvmsg() calls under normal operation, this is done in netlink_poll(),
  under the assumption that userspace has processed all pending frames
  before invoking poll(), thus the ring is expected to have room for new
  messages. Dumps currently don't benefit as much as they could from
  memory mapped I/O because each single continuation requires a poll()
  call. A more agressive approach seems like a good idea to me, especially
  in case the socket is not subscribed to any multicast groups (IOW only
  receiving explicitly requested data).

Besides that, the memory mapped netlink implementation extends the states
defined by AF_PACKET between userspace and the kernel by a SKIP status, this
is intended for the case that userspace wants to queue frames (specifically
when using nfnetlink_queue, an IDS and stream reassembly, requested by
Eric Leblond) for a longer period of time. The kernel skips over all frames
marked with SKIP when looking or unused frames and only fails when not finding
a free frame or when having skipped the entire ring.

Also noteworthy is memory mapped sendmsg: the kernel performs validation
of messages before accepting and processing them, in order to prevent
userspace from changing the messages contents after validation, the
kernel checks that the ring is only mapped once and the file descriptor
is not shared (in order to avoid having userspace set up another mapping
after the first mentioned check). If either of both is not true, the
message copied to an allocated skb and processed as with regular I/O.
I'd especially appreciate review of this part since I'm not really versed
in memory, file and process management,

The remaining interesting details are included in the changelogs of the
individual patches and the documentation, so I won't repeat them here.

As an example, nfnetlink_queue is convererted to support memory mapped
I/O. Other subsystems that would probably benefit are nfnetlink_log,
audit and maybe ISCSI, not sure.

Following are some numbers collected by Florian Westphal based on a
slightly older version, which included an experimental patch for the
nfnetlink_queue ordering issue.

===

Test hardware is a 12-core machine
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
ixgbe interfaces are used (i.e., multiqueue nics).
irqs are distributed across the cpus.

I've made several tests.

The simple one consists of 3GBit UDP traffic, packets are 1500 bytes
in size (i.e., no fragmentation), with a single nfqueue
and the test client programs in libmnl examples directory.
Packets are sent from one /24 net to another /24 net, i.e.
there are a few hundred flows active at any given time.

I've also tested with snort, but I disabled all rules.
6Gbit UDP traffic is generated in the snort case, and
6 nfqueues are used (i.e., 6 snorts run in parallel).

I've tested with 3 different kernels, all based on 3.7.1.
- 3.7.1, without the mmap patches
- 3.7.1, with Patricks mmap patches
- 3.7.1, with mmap patches and extended spinlock to ensure packet ids are
  monotonically increasing and cannot be re-ordered.  This is what we
  currently ship in our product.

  [ the spinlock that is extended is the per nfqueue spinlock, it will
    be held from the time the netlink skb is allocated until the netlink
    skb is sent to userspace:

    http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9
  ]

snort is normally used in "batch mode", i.e., after processing 25 packets
a single "batch verdict" is sent to accept the packets seen so far.
"mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this
time (except where noted below).

One reason is that snort has a reload thread, so kernel needs to copy;
also in the snort case no payload rewrite takes place, so compared
to the rx path the tx path is cheap.

Results:

3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone
nfq-queue:           1.7 gbit out
snort-recv-batch-25  5.1 gbit out
snort-recv-no-batch  3.1 gbit out

3.7.1 + mmap + without extended spinlocked section
nfq-queue:           1.7 gbit out (recv/sendmsg)
nfq-queue-mmap:      2.4 gbit out
snort-mmap-batch-25	 5.6 gbit out  (warning: since ids can be
                                        re-ordered, this version is "broken").
snort-recv-batch-25	 5.1 gbit out
snort-mmap-no-batch	 4.6 gbit out (i.e., one verdict per packet)

Kernel 3.7.1 + mmap + extended spinlock section:
nfq-queue:	1.4 gbit out
nfq-queue-mmap: 2.3 gbit out
snort:          5.6 gbit out

Conclusions:
- The "extended spinlocked section" hurts performance in the
  single queue case; with 6 snorts there is no measureable slowdown.
- I tried to re-write the mmap-snort to work without batch verdicts, but
  results were not very encouraging:

kernel 3.7.1 + mmap (without extended spinlocked section):

snort-mmap-batch-25      5.6 gbit out (what we currenlty ship)
snort-recv-batch-25      5.1 gbit out (without using mmap)
snort-mmap-batch-1       4.6 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-25     5.2 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-1      4.6 gbit out (with mmap but without batch verdicts)

The difference between the last two is that in the txring-25 case, we
put a verdict into the tx ring after every packet, but will only
invoke sendmsg(, NULL, 0) after processing 25 packets.  So the only
difference is the number of sendmsg calls/context switches.

So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster
than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet.
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 447b816f 3ab1f683
This diff is collapsed.
...@@ -29,10 +29,13 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n); ...@@ -29,10 +29,13 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n); extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
extern int nfnetlink_has_listeners(struct net *net, unsigned int group); extern int nfnetlink_has_listeners(struct net *net, unsigned int group);
extern int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, unsigned int group, extern struct sk_buff *nfnetlink_alloc_skb(struct net *net, unsigned int size,
int echo, gfp_t flags); u32 dst_portid, gfp_t gfp_mask);
extern int nfnetlink_set_err(struct net *net, u32 pid, u32 group, int error); extern int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid,
extern int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u_int32_t pid, int flags); unsigned int group, int echo, gfp_t flags);
extern int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error);
extern int nfnetlink_unicast(struct sk_buff *skb, struct net *net,
u32 portid, int flags);
extern void nfnl_lock(__u8 subsys_id); extern void nfnl_lock(__u8 subsys_id);
extern void nfnl_unlock(__u8 subsys_id); extern void nfnl_unlock(__u8 subsys_id);
......
...@@ -15,11 +15,18 @@ static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb) ...@@ -15,11 +15,18 @@ static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb)
return (struct nlmsghdr *)skb->data; return (struct nlmsghdr *)skb->data;
} }
enum netlink_skb_flags {
NETLINK_SKB_MMAPED = 0x1, /* Packet data is mmaped */
NETLINK_SKB_TX = 0x2, /* Packet was sent by userspace */
NETLINK_SKB_DELIVERED = 0x4, /* Packet was delivered */
};
struct netlink_skb_parms { struct netlink_skb_parms {
struct scm_creds creds; /* Skb credentials */ struct scm_creds creds; /* Skb credentials */
__u32 portid; __u32 portid;
__u32 dst_group; __u32 dst_group;
struct sock *ssk; __u32 flags;
struct sock *sk;
}; };
#define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) #define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb))
...@@ -57,6 +64,8 @@ extern void __netlink_clear_multicast_users(struct sock *sk, unsigned int group) ...@@ -57,6 +64,8 @@ extern void __netlink_clear_multicast_users(struct sock *sk, unsigned int group)
extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group); extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group);
extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
extern int netlink_has_listeners(struct sock *sk, unsigned int group); extern int netlink_has_listeners(struct sock *sk, unsigned int group);
extern struct sk_buff *netlink_alloc_skb(struct sock *ssk, unsigned int size,
u32 dst_portid, gfp_t gfp_mask);
extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 portid, int nonblock); extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 portid, int nonblock);
extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 portid, extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 portid,
__u32 group, gfp_t allocation); __u32 group, gfp_t allocation);
......
...@@ -651,6 +651,12 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size, ...@@ -651,6 +651,12 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
} }
extern struct sk_buff *__alloc_skb_head(gfp_t priority, int node);
static inline struct sk_buff *alloc_skb_head(gfp_t priority)
{
return __alloc_skb_head(priority, -1);
}
extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
extern int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask); extern int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
extern struct sk_buff *skb_clone(struct sk_buff *skb, extern struct sk_buff *skb_clone(struct sk_buff *skb,
......
...@@ -184,7 +184,7 @@ extern int nf_conntrack_hash_check_insert(struct nf_conn *ct); ...@@ -184,7 +184,7 @@ extern int nf_conntrack_hash_check_insert(struct nf_conn *ct);
extern void nf_ct_delete_from_lists(struct nf_conn *ct); extern void nf_ct_delete_from_lists(struct nf_conn *ct);
extern void nf_ct_dying_timeout(struct nf_conn *ct); extern void nf_ct_dying_timeout(struct nf_conn *ct);
extern void nf_conntrack_flush_report(struct net *net, u32 pid, int report); extern void nf_conntrack_flush_report(struct net *net, u32 portid, int report);
extern bool nf_ct_get_tuplepr(const struct sk_buff *skb, extern bool nf_ct_get_tuplepr(const struct sk_buff *skb,
unsigned int nhoff, u_int16_t l3num, unsigned int nhoff, u_int16_t l3num,
......
...@@ -88,7 +88,7 @@ nf_ct_find_expectation(struct net *net, u16 zone, ...@@ -88,7 +88,7 @@ nf_ct_find_expectation(struct net *net, u16 zone,
const struct nf_conntrack_tuple *tuple); const struct nf_conntrack_tuple *tuple);
void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp, void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
u32 pid, int report); u32 portid, int report);
static inline void nf_ct_unlink_expect(struct nf_conntrack_expect *exp) static inline void nf_ct_unlink_expect(struct nf_conntrack_expect *exp)
{ {
nf_ct_unlink_expect_report(exp, 0, 0); nf_ct_unlink_expect_report(exp, 0, 0);
...@@ -106,7 +106,7 @@ void nf_ct_expect_init(struct nf_conntrack_expect *, unsigned int, u_int8_t, ...@@ -106,7 +106,7 @@ void nf_ct_expect_init(struct nf_conntrack_expect *, unsigned int, u_int8_t,
u_int8_t, const __be16 *, const __be16 *); u_int8_t, const __be16 *, const __be16 *);
void nf_ct_expect_put(struct nf_conntrack_expect *exp); void nf_ct_expect_put(struct nf_conntrack_expect *exp);
int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
u32 pid, int report); u32 portid, int report);
static inline int nf_ct_expect_related(struct nf_conntrack_expect *expect) static inline int nf_ct_expect_related(struct nf_conntrack_expect *expect)
{ {
return nf_ct_expect_related_report(expect, 0, 0); return nf_ct_expect_related_report(expect, 0, 0);
......
#ifndef _UAPI__LINUX_NETLINK_H #ifndef _UAPI__LINUX_NETLINK_H
#define _UAPI__LINUX_NETLINK_H #define _UAPI__LINUX_NETLINK_H
#include <linux/kernel.h>
#include <linux/socket.h> /* for __kernel_sa_family_t */ #include <linux/socket.h> /* for __kernel_sa_family_t */
#include <linux/types.h> #include <linux/types.h>
...@@ -105,11 +106,42 @@ struct nlmsgerr { ...@@ -105,11 +106,42 @@ struct nlmsgerr {
#define NETLINK_PKTINFO 3 #define NETLINK_PKTINFO 3
#define NETLINK_BROADCAST_ERROR 4 #define NETLINK_BROADCAST_ERROR 4
#define NETLINK_NO_ENOBUFS 5 #define NETLINK_NO_ENOBUFS 5
#define NETLINK_RX_RING 6
#define NETLINK_TX_RING 7
struct nl_pktinfo { struct nl_pktinfo {
__u32 group; __u32 group;
}; };
struct nl_mmap_req {
unsigned int nm_block_size;
unsigned int nm_block_nr;
unsigned int nm_frame_size;
unsigned int nm_frame_nr;
};
struct nl_mmap_hdr {
unsigned int nm_status;
unsigned int nm_len;
__u32 nm_group;
/* credentials */
__u32 nm_pid;
__u32 nm_uid;
__u32 nm_gid;
};
enum nl_mmap_status {
NL_MMAP_STATUS_UNUSED,
NL_MMAP_STATUS_RESERVED,
NL_MMAP_STATUS_VALID,
NL_MMAP_STATUS_COPY,
NL_MMAP_STATUS_SKIP,
};
#define NL_MMAP_MSG_ALIGNMENT NLMSG_ALIGNTO
#define NL_MMAP_MSG_ALIGN(sz) __ALIGN_KERNEL(sz, NL_MMAP_MSG_ALIGNMENT)
#define NL_MMAP_HDRLEN NL_MMAP_MSG_ALIGN(sizeof(struct nl_mmap_hdr))
#define NET_MAJOR 36 /* Major 36 is reserved for networking */ #define NET_MAJOR 36 /* Major 36 is reserved for networking */
enum { enum {
......
...@@ -25,9 +25,18 @@ struct netlink_diag_msg { ...@@ -25,9 +25,18 @@ struct netlink_diag_msg {
__u32 ndiag_cookie[2]; __u32 ndiag_cookie[2];
}; };
struct netlink_diag_ring {
__u32 ndr_block_size;
__u32 ndr_block_nr;
__u32 ndr_frame_size;
__u32 ndr_frame_nr;
};
enum { enum {
NETLINK_DIAG_MEMINFO, NETLINK_DIAG_MEMINFO,
NETLINK_DIAG_GROUPS, NETLINK_DIAG_GROUPS,
NETLINK_DIAG_RX_RING,
NETLINK_DIAG_TX_RING,
__NETLINK_DIAG_MAX, __NETLINK_DIAG_MAX,
}; };
...@@ -38,5 +47,6 @@ enum { ...@@ -38,5 +47,6 @@ enum {
#define NDIAG_SHOW_MEMINFO 0x00000001 /* show memory info of a socket */ #define NDIAG_SHOW_MEMINFO 0x00000001 /* show memory info of a socket */
#define NDIAG_SHOW_GROUPS 0x00000002 /* show groups of a netlink socket */ #define NDIAG_SHOW_GROUPS 0x00000002 /* show groups of a netlink socket */
#define NDIAG_SHOW_RING_CFG 0x00000004 /* show ring configuration */
#endif #endif
...@@ -23,6 +23,15 @@ menuconfig NET ...@@ -23,6 +23,15 @@ menuconfig NET
if NET if NET
config NETLINK_MMAP
bool "Netlink: mmaped IO"
help
This option enables support for memory mapped netlink IO. This
reduces overhead by avoiding copying data between kernel- and
userspace.
If unsure, say N.
config WANT_COMPAT_NETLINK_MESSAGES config WANT_COMPAT_NETLINK_MESSAGES
bool bool
help help
......
...@@ -179,6 +179,33 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, ...@@ -179,6 +179,33 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
* *
*/ */
struct sk_buff *__alloc_skb_head(gfp_t gfp_mask, int node)
{
struct sk_buff *skb;
/* Get the HEAD */
skb = kmem_cache_alloc_node(skbuff_head_cache,
gfp_mask & ~__GFP_DMA, node);
if (!skb)
goto out;
/*
* Only clear those fields we need to clear, not those that we will
* actually initialise below. Hence, don't put any more fields after
* the tail pointer in struct sk_buff!
*/
memset(skb, 0, offsetof(struct sk_buff, tail));
skb->data = NULL;
skb->truesize = sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
#ifdef NET_SKBUFF_DATA_USES_OFFSET
skb->mac_header = ~0U;
#endif
out:
return skb;
}
/** /**
* __alloc_skb - allocate a network buffer * __alloc_skb - allocate a network buffer
* @size: size to allocate * @size: size to allocate
...@@ -584,7 +611,8 @@ static void skb_release_head_state(struct sk_buff *skb) ...@@ -584,7 +611,8 @@ static void skb_release_head_state(struct sk_buff *skb)
static void skb_release_all(struct sk_buff *skb) static void skb_release_all(struct sk_buff *skb)
{ {
skb_release_head_state(skb); skb_release_head_state(skb);
skb_release_data(skb); if (likely(skb->data))
skb_release_data(skb);
} }
/** /**
......
...@@ -324,7 +324,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *in_s ...@@ -324,7 +324,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *in_s
} }
err = sk_diag_fill(sk, rep, req, err = sk_diag_fill(sk, rep, req,
sk_user_ns(NETLINK_CB(in_skb).ssk), sk_user_ns(NETLINK_CB(in_skb).sk),
NETLINK_CB(in_skb).portid, NETLINK_CB(in_skb).portid,
nlh->nlmsg_seq, 0, nlh); nlh->nlmsg_seq, 0, nlh);
if (err < 0) { if (err < 0) {
...@@ -630,7 +630,7 @@ static int inet_csk_diag_dump(struct sock *sk, ...@@ -630,7 +630,7 @@ static int inet_csk_diag_dump(struct sock *sk,
return 0; return 0;
return inet_csk_diag_fill(sk, skb, r, return inet_csk_diag_fill(sk, skb, r,
sk_user_ns(NETLINK_CB(cb->skb).ssk), sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh); cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh);
} }
...@@ -805,7 +805,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk, ...@@ -805,7 +805,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
} }
err = inet_diag_fill_req(skb, sk, req, err = inet_diag_fill_req(skb, sk, req,
sk_user_ns(NETLINK_CB(cb->skb).ssk), sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, cb->nlh); cb->nlh->nlmsg_seq, cb->nlh);
if (err < 0) { if (err < 0) {
......
...@@ -25,7 +25,7 @@ static int sk_diag_dump(struct sock *sk, struct sk_buff *skb, ...@@ -25,7 +25,7 @@ static int sk_diag_dump(struct sock *sk, struct sk_buff *skb,
return 0; return 0;
return inet_sk_diag_fill(sk, NULL, skb, req, return inet_sk_diag_fill(sk, NULL, skb, req,
sk_user_ns(NETLINK_CB(cb->skb).ssk), sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh); cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh);
} }
...@@ -71,7 +71,7 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb, ...@@ -71,7 +71,7 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb,
goto out; goto out;
err = inet_sk_diag_fill(sk, NULL, rep, req, err = inet_sk_diag_fill(sk, NULL, rep, req,
sk_user_ns(NETLINK_CB(in_skb).ssk), sk_user_ns(NETLINK_CB(in_skb).sk),
NETLINK_CB(in_skb).portid, NETLINK_CB(in_skb).portid,
nlh->nlmsg_seq, 0, nlh); nlh->nlmsg_seq, 0, nlh);
if (err < 0) { if (err < 0) {
......
...@@ -1260,7 +1260,7 @@ void nf_ct_iterate_cleanup(struct net *net, ...@@ -1260,7 +1260,7 @@ void nf_ct_iterate_cleanup(struct net *net,
EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup); EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup);
struct __nf_ct_flush_report { struct __nf_ct_flush_report {
u32 pid; u32 portid;
int report; int report;
}; };
...@@ -1275,7 +1275,7 @@ static int kill_report(struct nf_conn *i, void *data) ...@@ -1275,7 +1275,7 @@ static int kill_report(struct nf_conn *i, void *data)
/* If we fail to deliver the event, death_by_timeout() will retry */ /* If we fail to deliver the event, death_by_timeout() will retry */
if (nf_conntrack_event_report(IPCT_DESTROY, i, if (nf_conntrack_event_report(IPCT_DESTROY, i,
fr->pid, fr->report) < 0) fr->portid, fr->report) < 0)
return 1; return 1;
/* Avoid the delivery of the destroy event in death_by_timeout(). */ /* Avoid the delivery of the destroy event in death_by_timeout(). */
...@@ -1298,10 +1298,10 @@ void nf_ct_free_hashtable(void *hash, unsigned int size) ...@@ -1298,10 +1298,10 @@ void nf_ct_free_hashtable(void *hash, unsigned int size)
} }
EXPORT_SYMBOL_GPL(nf_ct_free_hashtable); EXPORT_SYMBOL_GPL(nf_ct_free_hashtable);
void nf_conntrack_flush_report(struct net *net, u32 pid, int report) void nf_conntrack_flush_report(struct net *net, u32 portid, int report)
{ {
struct __nf_ct_flush_report fr = { struct __nf_ct_flush_report fr = {
.pid = pid, .portid = portid,
.report = report, .report = report,
}; };
nf_ct_iterate_cleanup(net, kill_report, &fr); nf_ct_iterate_cleanup(net, kill_report, &fr);
......
...@@ -40,7 +40,7 @@ static struct kmem_cache *nf_ct_expect_cachep __read_mostly; ...@@ -40,7 +40,7 @@ static struct kmem_cache *nf_ct_expect_cachep __read_mostly;
/* nf_conntrack_expect helper functions */ /* nf_conntrack_expect helper functions */
void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp, void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
u32 pid, int report) u32 portid, int report)
{ {
struct nf_conn_help *master_help = nfct_help(exp->master); struct nf_conn_help *master_help = nfct_help(exp->master);
struct net *net = nf_ct_exp_net(exp); struct net *net = nf_ct_exp_net(exp);
...@@ -54,7 +54,7 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp, ...@@ -54,7 +54,7 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
hlist_del(&exp->lnode); hlist_del(&exp->lnode);
master_help->expecting[exp->class]--; master_help->expecting[exp->class]--;
nf_ct_expect_event_report(IPEXP_DESTROY, exp, pid, report); nf_ct_expect_event_report(IPEXP_DESTROY, exp, portid, report);
nf_ct_expect_put(exp); nf_ct_expect_put(exp);
NF_CT_STAT_INC(net, expect_delete); NF_CT_STAT_INC(net, expect_delete);
...@@ -412,7 +412,7 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect) ...@@ -412,7 +412,7 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect)
} }
int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
u32 pid, int report) u32 portid, int report)
{ {
int ret; int ret;
...@@ -425,7 +425,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, ...@@ -425,7 +425,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
if (ret < 0) if (ret < 0)
goto out; goto out;
spin_unlock_bh(&nf_conntrack_lock); spin_unlock_bh(&nf_conntrack_lock);
nf_ct_expect_event_report(IPEXP_NEW, expect, pid, report); nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report);
return ret; return ret;
out: out:
spin_unlock_bh(&nf_conntrack_lock); spin_unlock_bh(&nf_conntrack_lock);
......
...@@ -112,22 +112,30 @@ int nfnetlink_has_listeners(struct net *net, unsigned int group) ...@@ -112,22 +112,30 @@ int nfnetlink_has_listeners(struct net *net, unsigned int group)
} }
EXPORT_SYMBOL_GPL(nfnetlink_has_listeners); EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, struct sk_buff *nfnetlink_alloc_skb(struct net *net, unsigned int size,
u32 dst_portid, gfp_t gfp_mask)
{
return netlink_alloc_skb(net->nfnl, size, dst_portid, gfp_mask);
}
EXPORT_SYMBOL_GPL(nfnetlink_alloc_skb);
int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid,
unsigned int group, int echo, gfp_t flags) unsigned int group, int echo, gfp_t flags)
{ {
return nlmsg_notify(net->nfnl, skb, pid, group, echo, flags); return nlmsg_notify(net->nfnl, skb, portid, group, echo, flags);
} }
EXPORT_SYMBOL_GPL(nfnetlink_send); EXPORT_SYMBOL_GPL(nfnetlink_send);
int nfnetlink_set_err(struct net *net, u32 pid, u32 group, int error) int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error)
{ {
return netlink_set_err(net->nfnl, pid, group, error); return netlink_set_err(net->nfnl, portid, group, error);
} }
EXPORT_SYMBOL_GPL(nfnetlink_set_err); EXPORT_SYMBOL_GPL(nfnetlink_set_err);
int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u_int32_t pid, int flags) int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid,
int flags)
{ {
return netlink_unicast(net->nfnl, skb, pid, flags); return netlink_unicast(net->nfnl, skb, portid, flags);
} }
EXPORT_SYMBOL_GPL(nfnetlink_unicast); EXPORT_SYMBOL_GPL(nfnetlink_unicast);
......
...@@ -318,7 +318,7 @@ nfulnl_set_flags(struct nfulnl_instance *inst, u_int16_t flags) ...@@ -318,7 +318,7 @@ nfulnl_set_flags(struct nfulnl_instance *inst, u_int16_t flags)
} }
static struct sk_buff * static struct sk_buff *
nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size) nfulnl_alloc_skb(u32 peer_portid, unsigned int inst_size, unsigned int pkt_size)
{ {
struct sk_buff *skb; struct sk_buff *skb;
unsigned int n; unsigned int n;
...@@ -327,13 +327,14 @@ nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size) ...@@ -327,13 +327,14 @@ nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size)
* message. WARNING: has to be <= 128k due to slab restrictions */ * message. WARNING: has to be <= 128k due to slab restrictions */
n = max(inst_size, pkt_size); n = max(inst_size, pkt_size);
skb = alloc_skb(n, GFP_ATOMIC); skb = nfnetlink_alloc_skb(&init_net, n, peer_portid, GFP_ATOMIC);
if (!skb) { if (!skb) {
if (n > pkt_size) { if (n > pkt_size) {
/* try to allocate only as much as we need for current /* try to allocate only as much as we need for current
* packet */ * packet */
skb = alloc_skb(pkt_size, GFP_ATOMIC); skb = nfnetlink_alloc_skb(&init_net, pkt_size,
peer_portid, GFP_ATOMIC);
if (!skb) if (!skb)
pr_err("nfnetlink_log: can't even alloc %u bytes\n", pr_err("nfnetlink_log: can't even alloc %u bytes\n",
pkt_size); pkt_size);
...@@ -696,7 +697,8 @@ nfulnl_log_packet(u_int8_t pf, ...@@ -696,7 +697,8 @@ nfulnl_log_packet(u_int8_t pf,
} }
if (!inst->skb) { if (!inst->skb) {
inst->skb = nfulnl_alloc_skb(inst->nlbufsiz, size); inst->skb = nfulnl_alloc_skb(inst->peer_portid, inst->nlbufsiz,
size);
if (!inst->skb) if (!inst->skb)
goto alloc_failure; goto alloc_failure;
} }
...@@ -824,7 +826,7 @@ nfulnl_recv_config(struct sock *ctnl, struct sk_buff *skb, ...@@ -824,7 +826,7 @@ nfulnl_recv_config(struct sock *ctnl, struct sk_buff *skb,
inst = instance_create(net, group_num, inst = instance_create(net, group_num,
NETLINK_CB(skb).portid, NETLINK_CB(skb).portid,
sk_user_ns(NETLINK_CB(skb).ssk)); sk_user_ns(NETLINK_CB(skb).sk));
if (IS_ERR(inst)) { if (IS_ERR(inst)) {
ret = PTR_ERR(inst); ret = PTR_ERR(inst);
goto out; goto out;
......
...@@ -339,7 +339,8 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue, ...@@ -339,7 +339,8 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
if (queue->flags & NFQA_CFG_F_CONNTRACK) if (queue->flags & NFQA_CFG_F_CONNTRACK)
ct = nfqnl_ct_get(entskb, &size, &ctinfo); ct = nfqnl_ct_get(entskb, &size, &ctinfo);
skb = alloc_skb(size, GFP_ATOMIC); skb = nfnetlink_alloc_skb(&init_net, size, queue->peer_portid,
GFP_ATOMIC);
if (!skb) if (!skb)
return NULL; return NULL;
......
This diff is collapsed.
...@@ -6,6 +6,20 @@ ...@@ -6,6 +6,20 @@
#define NLGRPSZ(x) (ALIGN(x, sizeof(unsigned long) * 8) / 8) #define NLGRPSZ(x) (ALIGN(x, sizeof(unsigned long) * 8) / 8)
#define NLGRPLONGS(x) (NLGRPSZ(x)/sizeof(unsigned long)) #define NLGRPLONGS(x) (NLGRPSZ(x)/sizeof(unsigned long))
struct netlink_ring {
void **pg_vec;
unsigned int head;
unsigned int frames_per_block;
unsigned int frame_size;
unsigned int frame_max;
unsigned int pg_vec_order;
unsigned int pg_vec_pages;
unsigned int pg_vec_len;
atomic_t pending;
};
struct netlink_sock { struct netlink_sock {
/* struct sock has to be the first member of netlink_sock */ /* struct sock has to be the first member of netlink_sock */
struct sock sk; struct sock sk;
...@@ -24,6 +38,12 @@ struct netlink_sock { ...@@ -24,6 +38,12 @@ struct netlink_sock {
void (*netlink_rcv)(struct sk_buff *skb); void (*netlink_rcv)(struct sk_buff *skb);
void (*netlink_bind)(int group); void (*netlink_bind)(int group);
struct module *module; struct module *module;
#ifdef CONFIG_NETLINK_MMAP
struct mutex pg_vec_lock;
struct netlink_ring rx_ring;
struct netlink_ring tx_ring;
atomic_t mapped;
#endif /* CONFIG_NETLINK_MMAP */
}; };
static inline struct netlink_sock *nlk_sk(struct sock *sk) static inline struct netlink_sock *nlk_sk(struct sock *sk)
......
...@@ -7,6 +7,34 @@ ...@@ -7,6 +7,34 @@
#include "af_netlink.h" #include "af_netlink.h"
static int sk_diag_put_ring(struct netlink_ring *ring, int nl_type,
struct sk_buff *nlskb)
{
struct netlink_diag_ring ndr;
ndr.ndr_block_size = ring->pg_vec_pages << PAGE_SHIFT;
ndr.ndr_block_nr = ring->pg_vec_len;
ndr.ndr_frame_size = ring->frame_size;
ndr.ndr_frame_nr = ring->frame_max + 1;
return nla_put(nlskb, nl_type, sizeof(ndr), &ndr);
}
static int sk_diag_put_rings_cfg(struct sock *sk, struct sk_buff *nlskb)
{
struct netlink_sock *nlk = nlk_sk(sk);
int ret;
mutex_lock(&nlk->pg_vec_lock);
ret = sk_diag_put_ring(&nlk->rx_ring, NETLINK_DIAG_RX_RING, nlskb);
if (!ret)
ret = sk_diag_put_ring(&nlk->tx_ring, NETLINK_DIAG_TX_RING,
nlskb);
mutex_unlock(&nlk->pg_vec_lock);
return ret;
}
static int sk_diag_dump_groups(struct sock *sk, struct sk_buff *nlskb) static int sk_diag_dump_groups(struct sock *sk, struct sk_buff *nlskb)
{ {
struct netlink_sock *nlk = nlk_sk(sk); struct netlink_sock *nlk = nlk_sk(sk);
...@@ -51,6 +79,10 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, ...@@ -51,6 +79,10 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
sock_diag_put_meminfo(sk, skb, NETLINK_DIAG_MEMINFO)) sock_diag_put_meminfo(sk, skb, NETLINK_DIAG_MEMINFO))
goto out_nlmsg_trim; goto out_nlmsg_trim;
if ((req->ndiag_show & NDIAG_SHOW_RING_CFG) &&
sk_diag_put_rings_cfg(sk, skb))
goto out_nlmsg_trim;
return nlmsg_end(skb, nlh); return nlmsg_end(skb, nlh);
out_nlmsg_trim: out_nlmsg_trim:
......
...@@ -393,7 +393,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb, ...@@ -393,7 +393,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
return -EOPNOTSUPP; return -EOPNOTSUPP;
if ((keymask & (FLOW_KEY_SKUID|FLOW_KEY_SKGID)) && if ((keymask & (FLOW_KEY_SKUID|FLOW_KEY_SKGID)) &&
sk_user_ns(NETLINK_CB(in_skb).ssk) != &init_user_ns) sk_user_ns(NETLINK_CB(in_skb).sk) != &init_user_ns)
return -EOPNOTSUPP; return -EOPNOTSUPP;
} }
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment