Commit 3ca6c3b4 authored by David S. Miller's avatar David S. Miller

Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

rxrpc changes

David Howells says:

====================
rxrpc: Increasing SACK size and moving away from softirq, part 1

AF_RXRPC has some issues that need addressing:

 (1) The SACK table has a maximum capacity of 255, but for modern networks
     that isn't sufficient.  This is hard to increase in the upstream code
     because of the way the application thread is coupled to the softirq
     and retransmission side through a ring buffer.  Adjustments to the rx
     protocol allows a capacity of up to 8192, and having a ring
     sufficiently large to accommodate that would use an excessive amount
     of memory as this is per-call.

 (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
     only the most recent being considered.  Whilst this has the upside
     that the retransmission algorithm only needs to deal with the most
     recent ACK, it causes DATA transmission for a call to be very bursty
     because DATA packets cannot be transmitted in softirq mode.  Rather
     transmission must be delegated to either the application thread or a
     workqueue, so there tend to be sudden bursts of traffic for any
     particular call due to scheduling delays.

 (3) All crypto in a single call is done in series; however, each DATA
     packet is individually encrypted so encryption and decryption of large
     calls could be parallelised if spare CPU resources are available.

This is the first of a number of sets of patches that try and address them.
The overall aims of these changes include:

 (1) To get rid of the TxRx ring and instead pass the packets round in
     queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
     a SACK table that can be parsed as-is, so there's no particular need
     to maintain our own; we just have to refer to the ACK.

     On the Rx side, we do need to maintain a SACK table with one bit per
     entry - but only if packets go missing - and we don't want to have to
     perform a complex transformation to get the information into an ACK
     packet.

 (2) To try and move almost all processing of received packets out of the
     softirq handler and into a high-priority kernel I/O thread.  Only the
     transferral of packets would be left there.  I would still use the
     encap_rcv hook to receive packets as there's a noticeable performance
     drop from letting the UDP socket put the packets into its own queue
     and then getting them out of there.

 (3) To make the I/O thread also do all the transmission.  The app thread
     would be responsible for packaging the data into packets and then
     buffering them for the I/O thread to transmit.  This would make it
     easier for the app thread to run ahead of the I/O thread, and would
     mean the I/O thread is less likely to have to wait around for a new
     packet to come available for transmission.

 (4) To logically partition the socket/UAPI/KAPI side of things from the
     I/O side of things.  The local endpoint, connection, peer and call
     objects would belong to the I/O side.  The socket side would not then
     touch the private internals of calls and suchlike and would not change
     their states.  It would only look at the send queue, receive queue and
     a way to pass a message to cause an abort.

 (5) To remove as much locking, synchronisation, barriering and atomic ops
     as possible from the I/O side.  Exclusion would be achieved by
     limiting modification of state to the I/O thread only.  Locks would
     still need to be used in communication with the UDP socket and the
     AF_RXRPC socket API.

 (6) To provide crypto offload kernel threads that, when there's slack in
     the system, can see packets that need crypting and provide
     parallelisation in dealing with them.

 (7) To remove the use of system timers.  Since each timer would then send
     a poke to the I/O thread, which would then deal with it when it had
     the opportunity, there seems no point in using system timers if,
     instead, a list of timeouts can be sensibly consulted.  An I/O thread
     only then needs to schedule with a timeout when it is idle.

 (8) To use zero-copy sendmsg to send packets.  This would make use of the
     I/O thread being the sole transmitter on the socket to manage the
     dead-reckoning sequencing of the completion notifications.  There is a
     problem with zero-copy, though: the UDP socket doesn't handle running
     out of option memory very gracefully.

With regard to this first patchset, the changes made include:

 (1) Some fixes, including a fallback for proc_create_net_single_write(),
     setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
     congestion management, which shouldn't be saving the cwnd value
     between calls.

 (2) Improvements in rxrpc tracepoints, including splitting the timer
     tracepoint into a set-timer and a timer-expired trace.

 (3) Addition of a new proc file to display some stats.

 (4) Some code cleanups, including removing some unused bits and
     unnecessary header inclusions.

 (5) A change to the recently added UDP encap_err_rcv hook so that it has
     the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
     point its UDP socket's hook directly at those.

 (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
     transmissible packets of DATA and ACK type in a single 2KiB block
     rather than using an sk_buff.  This allows the buffer to be on a
     number of queues simultaneously more easily, and also guarantees that
     the entire block is in a single unit for zerocopy purposes and that
     the data payload is aligned for in-place crypto purposes.

 (7) ACK txbufs are allocated at proposal and queued for later transmission
     rather than being stored in a single place in the rxrpc_call struct,
     which means only a single ACK can be pending transmission at a time.
     The queue is then drained at various points.  This allows the ACK
     generation code to be simplified.

 (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
     comprises a number of ordinary DATA packets glued together), it used
     to be pointed to by the ring multiple times, with an annotation in a
     side ring indicating which subpacket was in that slot - but this is no
     longer possible.  Instead, the packet is cloned once for each
     subpacket, barring the last, and the range of data is set in the skb
     private area.  This makes it easier for the subpackets in a jumbo
     packet to be decrypted in parallel.

 (9) The Tx ring buffer is removed.  The side annotation ring that held the
     SACK information is also removed.  Instead, in the event of packet
     loss, the SACK data attached an ACK packet is parsed.

(10) Allocate an skcipher request when needed in the rxkad security class
     rather than caching one in the rxrpc_call struct.  This deals with a
     race between externally-driven call disconnection getting rid of the
     skcipher request and sendmsg/recvmsg trying to use it because they
     haven't seen the completion yet.  This is also needed to support
     parallelisation as the skcipher request cannot be used by two or more
     threads simultaneously.

(11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
     through kernel_sendmsg() so that we can provide our own iterator
     (zerocopy explicitly doesn't work with a KVEC iterator).  This also
     lets us avoid the overhead of the security hook.
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents bd039b5e 30d95efe
......@@ -208,8 +208,10 @@ static inline void proc_remove(struct proc_dir_entry *de) {}
static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) { return 0; }
#define proc_create_net_data(name, mode, parent, ops, state_size, data) ({NULL;})
#define proc_create_net_data_write(name, mode, parent, ops, write, state_size, data) ({NULL;})
#define proc_create_net(name, mode, parent, state_size, ops) ({NULL;})
#define proc_create_net_single(name, mode, parent, show, data) ({NULL;})
#define proc_create_net_single_write(name, mode, parent, show, write, data) ({NULL;})
static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
{
......
......@@ -70,7 +70,8 @@ struct udp_sock {
* For encapsulation sockets.
*/
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
void (*encap_err_rcv)(struct sock *sk, struct sk_buff *skb, unsigned int udp_offset);
void (*encap_err_rcv)(struct sock *sk, struct sk_buff *skb, int err,
__be16 port, u32 info, u8 *payload);
int (*encap_err_lookup)(struct sock *sk, struct sk_buff *skb);
void (*encap_destroy)(struct sock *sk);
......
......@@ -68,8 +68,8 @@ typedef int (*udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
typedef int (*udp_tunnel_encap_err_lookup_t)(struct sock *sk,
struct sk_buff *skb);
typedef void (*udp_tunnel_encap_err_rcv_t)(struct sock *sk,
struct sk_buff *skb,
unsigned int udp_offset);
struct sk_buff *skb, int err,
__be16 port, u32 info, u8 *payload);
typedef void (*udp_tunnel_encap_destroy_t)(struct sock *sk);
typedef struct sk_buff *(*udp_tunnel_gro_receive_t)(struct sock *sk,
struct list_head *head,
......
This diff is collapsed.
......@@ -6435,6 +6435,7 @@ void skb_condense(struct sk_buff *skb)
*/
skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
}
EXPORT_SYMBOL(skb_condense);
#ifdef CONFIG_SKB_EXTENSIONS
static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
......
......@@ -433,6 +433,7 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
}
kfree_skb(skb);
}
EXPORT_SYMBOL_GPL(ip_icmp_error);
void ip_local_error(struct sock *sk, int err, __be32 daddr, __be16 port, u32 info)
{
......
......@@ -784,7 +784,8 @@ int __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
if (tunnel) {
/* ...not for tunnels though: we don't have a sending socket */
if (udp_sk(sk)->encap_err_rcv)
udp_sk(sk)->encap_err_rcv(sk, skb, iph->ihl << 2);
udp_sk(sk)->encap_err_rcv(sk, skb, err, uh->dest, info,
(u8 *)(uh+1));
goto out;
}
if (!inet->recverr) {
......
......@@ -334,6 +334,7 @@ void ipv6_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
if (sock_queue_err_skb(sk, skb))
kfree_skb(skb);
}
EXPORT_SYMBOL_GPL(ipv6_icmp_error);
void ipv6_local_error(struct sock *sk, int err, struct flowi6 *fl6, u32 info)
{
......
......@@ -632,7 +632,8 @@ int __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
/* Tunnels don't have an application socket: don't pass errors back */
if (tunnel) {
if (udp_sk(sk)->encap_err_rcv)
udp_sk(sk)->encap_err_rcv(sk, skb, offset);
udp_sk(sk)->encap_err_rcv(sk, skb, err, uh->dest,
ntohl(info), (u8 *)(uh+1));
goto out;
}
......@@ -1639,6 +1640,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
err = 0;
goto out;
}
EXPORT_SYMBOL(udpv6_sendmsg);
void udpv6_destroy_sock(struct sock *sk)
{
......
......@@ -30,6 +30,7 @@ rxrpc-y := \
sendmsg.o \
server_key.o \
skbuff.o \
txbuf.o \
utils.o
rxrpc-$(CONFIG_PROC_FS) += proc.o
......
......@@ -39,7 +39,7 @@ atomic_t rxrpc_debug_id;
EXPORT_SYMBOL(rxrpc_debug_id);
/* count of skbs currently in use */
atomic_t rxrpc_n_tx_skbs, rxrpc_n_rx_skbs;
atomic_t rxrpc_n_rx_skbs;
struct workqueue_struct *rxrpc_workqueue;
......@@ -979,7 +979,7 @@ static int __init af_rxrpc_init(void)
goto error_call_jar;
}
rxrpc_workqueue = alloc_workqueue("krxrpcd", 0, 1);
rxrpc_workqueue = alloc_workqueue("krxrpcd", WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
if (!rxrpc_workqueue) {
pr_notice("Failed to allocate work queue\n");
goto error_work_queue;
......@@ -1059,7 +1059,6 @@ static void __exit af_rxrpc_exit(void)
sock_unregister(PF_RXRPC);
proto_unregister(&rxrpc_proto);
unregister_pernet_device(&rxrpc_net_ops);
ASSERTCMP(atomic_read(&rxrpc_n_tx_skbs), ==, 0);
ASSERTCMP(atomic_read(&rxrpc_n_rx_skbs), ==, 0);
/* Make sure the local and peer records pinned by any dying connections
......
This diff is collapsed.
......@@ -248,8 +248,7 @@ static void rxrpc_send_ping(struct rxrpc_call *call, struct sk_buff *skb)
if (call->peer->rtt_count < 3 ||
ktime_before(ktime_add_ms(call->peer->rtt_last_req, 1000), now))
rxrpc_propose_ACK(call, RXRPC_ACK_PING, sp->hdr.serial,
true, true,
rxrpc_send_ACK(call, RXRPC_ACK_PING, sp->hdr.serial,
rxrpc_propose_ack_ping_for_params);
}
......@@ -325,7 +324,8 @@ static struct rxrpc_call *rxrpc_alloc_incoming_call(struct rxrpc_sock *rx,
call->security = conn->security;
call->security_ix = conn->security_ix;
call->peer = rxrpc_get_peer(conn->params.peer);
call->cong_cwnd = call->peer->cong_cwnd;
call->cong_ssthresh = call->peer->cong_ssthresh;
call->tx_last_sent = ktime_get_real();
return call;
}
......
This diff is collapsed.
......@@ -52,7 +52,7 @@ static void rxrpc_call_timer_expired(struct timer_list *t)
_enter("%d", call->debug_id);
if (call->state < RXRPC_CALL_COMPLETE) {
trace_rxrpc_timer(call, rxrpc_timer_expired, jiffies);
trace_rxrpc_timer_expired(call, jiffies);
__rxrpc_queue_call(call);
} else {
rxrpc_put_call(call, rxrpc_call_put);
......@@ -129,16 +129,6 @@ struct rxrpc_call *rxrpc_alloc_call(struct rxrpc_sock *rx, gfp_t gfp,
if (!call)
return NULL;
call->rxtx_buffer = kcalloc(RXRPC_RXTX_BUFF_SIZE,
sizeof(struct sk_buff *),
gfp);
if (!call->rxtx_buffer)
goto nomem;
call->rxtx_annotations = kcalloc(RXRPC_RXTX_BUFF_SIZE, sizeof(u8), gfp);
if (!call->rxtx_annotations)
goto nomem_2;
mutex_init(&call->user_mutex);
/* Prevent lockdep reporting a deadlock false positive between the afs
......@@ -155,37 +145,39 @@ struct rxrpc_call *rxrpc_alloc_call(struct rxrpc_sock *rx, gfp_t gfp,
INIT_LIST_HEAD(&call->accept_link);
INIT_LIST_HEAD(&call->recvmsg_link);
INIT_LIST_HEAD(&call->sock_link);
INIT_LIST_HEAD(&call->tx_buffer);
skb_queue_head_init(&call->recvmsg_queue);
skb_queue_head_init(&call->rx_oos_queue);
init_waitqueue_head(&call->waitq);
spin_lock_init(&call->lock);
spin_lock_init(&call->notify_lock);
spin_lock_init(&call->tx_lock);
spin_lock_init(&call->input_lock);
spin_lock_init(&call->acks_ack_lock);
rwlock_init(&call->state_lock);
refcount_set(&call->ref, 1);
call->debug_id = debug_id;
call->tx_total_len = -1;
call->next_rx_timo = 20 * HZ;
call->next_req_timo = 1 * HZ;
atomic64_set(&call->ackr_window, 0x100000001ULL);
memset(&call->sock_node, 0xed, sizeof(call->sock_node));
/* Leave space in the ring to handle a maxed-out jumbo packet */
call->rx_winsize = rxrpc_rx_window_size;
call->tx_winsize = 16;
call->rx_expect_next = 1;
if (RXRPC_TX_SMSS > 2190)
call->cong_cwnd = 2;
call->cong_ssthresh = RXRPC_RXTX_BUFF_SIZE - 1;
else if (RXRPC_TX_SMSS > 1095)
call->cong_cwnd = 3;
else
call->cong_cwnd = 4;
call->cong_ssthresh = RXRPC_TX_MAX_WINDOW;
call->rxnet = rxnet;
call->rtt_avail = RXRPC_CALL_RTT_AVAIL_MASK;
atomic_inc(&rxnet->nr_calls);
return call;
nomem_2:
kfree(call->rxtx_buffer);
nomem:
kmem_cache_free(rxrpc_call_jar, call);
return NULL;
}
/*
......@@ -206,7 +198,6 @@ static struct rxrpc_call *rxrpc_alloc_client_call(struct rxrpc_sock *rx,
return ERR_PTR(-ENOMEM);
call->state = RXRPC_CALL_CLIENT_AWAIT_CONN;
call->service_id = srx->srx_service;
call->tx_phase = true;
now = ktime_get_real();
call->acks_latest_ts = now;
call->cong_tstamp = now;
......@@ -223,7 +214,7 @@ static void rxrpc_start_call_timer(struct rxrpc_call *call)
unsigned long now = jiffies;
unsigned long j = now + MAX_JIFFY_OFFSET;
call->ack_at = j;
call->delay_ack_at = j;
call->ack_lost_at = j;
call->resend_at = j;
call->ping_at = j;
......@@ -510,16 +501,12 @@ void rxrpc_get_call(struct rxrpc_call *call, enum rxrpc_call_trace op)
}
/*
* Clean up the RxTx skb ring.
* Clean up the Rx skb ring.
*/
static void rxrpc_cleanup_ring(struct rxrpc_call *call)
{
int i;
for (i = 0; i < RXRPC_RXTX_BUFF_SIZE; i++) {
rxrpc_free_skb(call->rxtx_buffer[i], rxrpc_skb_cleaned);
call->rxtx_buffer[i] = NULL;
}
skb_queue_purge(&call->recvmsg_queue);
skb_queue_purge(&call->rx_oos_queue);
}
/*
......@@ -539,10 +526,8 @@ void rxrpc_release_call(struct rxrpc_sock *rx, struct rxrpc_call *call)
ASSERTCMP(call->state, ==, RXRPC_CALL_COMPLETE);
spin_lock_bh(&call->lock);
if (test_and_set_bit(RXRPC_CALL_RELEASED, &call->flags))
BUG();
spin_unlock_bh(&call->lock);
rxrpc_put_call_slot(call);
rxrpc_delete_call_timer(call);
......@@ -656,8 +641,6 @@ static void rxrpc_destroy_call(struct work_struct *work)
rxrpc_put_connection(call->conn);
rxrpc_put_peer(call->peer);
kfree(call->rxtx_buffer);
kfree(call->rxtx_annotations);
kmem_cache_free(rxrpc_call_jar, call);
if (atomic_dec_and_test(&rxnet->nr_calls))
wake_up_var(&rxnet->nr_calls);
......@@ -684,6 +667,8 @@ static void rxrpc_rcu_destroy_call(struct rcu_head *rcu)
*/
void rxrpc_cleanup_call(struct rxrpc_call *call)
{
struct rxrpc_txbuf *txb;
_net("DESTROY CALL %d", call->debug_id);
memset(&call->sock_node, 0xcd, sizeof(call->sock_node));
......@@ -692,7 +677,13 @@ void rxrpc_cleanup_call(struct rxrpc_call *call)
ASSERT(test_bit(RXRPC_CALL_RELEASED, &call->flags));
rxrpc_cleanup_ring(call);
rxrpc_free_skb(call->tx_pending, rxrpc_skb_cleaned);
while ((txb = list_first_entry_or_null(&call->tx_buffer,
struct rxrpc_txbuf, call_link))) {
list_del(&txb->call_link);
rxrpc_put_txbuf(txb, rxrpc_txbuf_put_cleaned);
}
rxrpc_put_txbuf(call->tx_pending, rxrpc_txbuf_put_cleaned);
rxrpc_free_skb(call->acks_soft_tbl, rxrpc_skb_cleaned);
call_rcu(&call->rcu, rxrpc_rcu_destroy_call);
}
......
......@@ -363,7 +363,8 @@ static struct rxrpc_bundle *rxrpc_prep_call(struct rxrpc_sock *rx,
if (!cp->peer)
goto error;
call->cong_cwnd = cp->peer->cong_cwnd;
call->tx_last_sent = ktime_get_real();
call->cong_ssthresh = cp->peer->cong_ssthresh;
if (call->cong_cwnd >= call->cong_ssthresh)
call->cong_mode = RXRPC_CALL_CONGEST_AVOIDANCE;
else
......
......@@ -175,7 +175,7 @@ void __rxrpc_disconnect_call(struct rxrpc_connection *conn,
trace_rxrpc_disconnect_call(call);
switch (call->completion) {
case RXRPC_CALL_SUCCEEDED:
chan->last_seq = call->rx_hard_ack;
chan->last_seq = call->rx_highest_seq;
chan->last_type = RXRPC_PACKET_TYPE_ACK;
break;
case RXRPC_CALL_LOCALLY_ABORTED:
......@@ -207,7 +207,7 @@ void rxrpc_disconnect_call(struct rxrpc_call *call)
{
struct rxrpc_connection *conn = call->conn;
call->peer->cong_cwnd = call->cong_cwnd;
call->peer->cong_ssthresh = call->cong_ssthresh;
if (!hlist_unhashed(&call->error_link)) {
spin_lock_bh(&call->peer->lock);
......
This diff is collapsed.
......@@ -25,16 +25,16 @@ static int none_how_much_data(struct rxrpc_call *call, size_t remain,
return 0;
}
static int none_secure_packet(struct rxrpc_call *call, struct sk_buff *skb,
size_t data_size)
static int none_secure_packet(struct rxrpc_call *call, struct rxrpc_txbuf *txb)
{
return 0;
}
static int none_verify_packet(struct rxrpc_call *call, struct sk_buff *skb,
unsigned int offset, unsigned int len,
rxrpc_seq_t seq, u16 expected_cksum)
static int none_verify_packet(struct rxrpc_call *call, struct sk_buff *skb)
{
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
sp->flags |= RXRPC_RX_VERIFIED;
return 0;
}
......@@ -42,11 +42,6 @@ static void none_free_call_crypto(struct rxrpc_call *call)
{
}
static void none_locate_data(struct rxrpc_call *call, struct sk_buff *skb,
unsigned int *_offset, unsigned int *_len)
{
}
static int none_respond_to_challenge(struct rxrpc_connection *conn,
struct sk_buff *skb,
u32 *_abort_code)
......@@ -95,7 +90,6 @@ const struct rxrpc_security rxrpc_no_security = {
.how_much_data = none_how_much_data,
.secure_packet = none_secure_packet,
.verify_packet = none_verify_packet,
.locate_data = none_locate_data,
.respond_to_challenge = none_respond_to_challenge,
.verify_response = none_verify_response,
.clear = none_clear,
......
......@@ -23,6 +23,19 @@
static void rxrpc_local_processor(struct work_struct *);
static void rxrpc_local_rcu(struct rcu_head *);
/*
* Handle an ICMP/ICMP6 error turning up at the tunnel. Push it through the
* usual mechanism so that it gets parsed and presented through the UDP
* socket's error_report().
*/
static void rxrpc_encap_err_rcv(struct sock *sk, struct sk_buff *skb, int err,
__be16 port, u32 info, u8 *payload)
{
if (ip_hdr(skb)->version == IPVERSION)
return ip_icmp_error(sk, skb, err, port, info, payload);
return ipv6_icmp_error(sk, skb, err, port, info, payload);
}
/*
* Compare a local to an address. Return -ve, 0 or +ve to indicate less than,
* same or greater than.
......@@ -84,6 +97,8 @@ static struct rxrpc_local *rxrpc_alloc_local(struct rxrpc_net *rxnet,
local->rxnet = rxnet;
INIT_HLIST_NODE(&local->link);
INIT_WORK(&local->processor, rxrpc_local_processor);
INIT_LIST_HEAD(&local->ack_tx_queue);
spin_lock_init(&local->ack_tx_lock);
init_rwsem(&local->defrag_sem);
skb_queue_head_init(&local->reject_queue);
skb_queue_head_init(&local->event_queue);
......@@ -419,6 +434,11 @@ static void rxrpc_local_processor(struct work_struct *work)
break;
}
if (!list_empty(&local->ack_tx_queue)) {
rxrpc_transmit_ack_packets(local);
again = true;
}
if (!skb_queue_empty(&local->reject_queue)) {
rxrpc_reject_packets(local);
again = true;
......
......@@ -16,12 +16,6 @@
*/
unsigned int rxrpc_max_backlog __read_mostly = 10;
/*
* How long to wait before scheduling ACK generation after seeing a
* packet with RXRPC_REQUEST_ACK set (in jiffies).
*/
unsigned long rxrpc_requested_ack_delay = 1;
/*
* How long to wait before scheduling an ACK with subtype DELAY (in jiffies).
*
......@@ -46,10 +40,7 @@ unsigned long rxrpc_idle_ack_delay = HZ / 2;
* limit is hit, we should generate an EXCEEDS_WINDOW ACK and discard further
* packets.
*/
unsigned int rxrpc_rx_window_size = RXRPC_INIT_RX_WINDOW_SIZE;
#if (RXRPC_RXTX_BUFF_SIZE - 1) < RXRPC_INIT_RX_WINDOW_SIZE
#error Need to reduce RXRPC_INIT_RX_WINDOW_SIZE
#endif
unsigned int rxrpc_rx_window_size = 255;
/*
* Maximum Rx MTU size. This indicates to the sender the size of jumbo packet
......@@ -62,15 +53,3 @@ unsigned int rxrpc_rx_mtu = 5692;
* sender that we're willing to handle.
*/
unsigned int rxrpc_rx_jumbo_max = 4;
const s8 rxrpc_ack_priority[] = {
[0] = 0,
[RXRPC_ACK_DELAY] = 1,
[RXRPC_ACK_REQUESTED] = 2,
[RXRPC_ACK_IDLE] = 3,
[RXRPC_ACK_DUPLICATE] = 4,
[RXRPC_ACK_OUT_OF_SEQUENCE] = 5,
[RXRPC_ACK_EXCEEDS_WINDOW] = 6,
[RXRPC_ACK_NOSPACE] = 7,
[RXRPC_ACK_PING_RESPONSE] = 8,
};
......@@ -101,6 +101,8 @@ static __net_init int rxrpc_init_net(struct net *net)
proc_create_net("locals", 0444, rxnet->proc_net,
&rxrpc_local_seq_ops,
sizeof(struct seq_net_private));
proc_create_net_single_write("stats", S_IFREG | 0644, rxnet->proc_net,
rxrpc_stats_show, rxrpc_stats_clear, NULL);
return 0;
err_proc:
......
This diff is collapsed.
......@@ -16,257 +16,12 @@
#include <net/sock.h>
#include <net/af_rxrpc.h>
#include <net/ip.h>
#include <net/icmp.h>
#include "ar-internal.h"
static void rxrpc_adjust_mtu(struct rxrpc_peer *, unsigned int);
static void rxrpc_store_error(struct rxrpc_peer *, struct sock_exterr_skb *);
static void rxrpc_distribute_error(struct rxrpc_peer *, int,
enum rxrpc_call_completion);
/*
* Find the peer associated with an ICMPv4 packet.
*/
static struct rxrpc_peer *rxrpc_lookup_peer_icmp_rcu(struct rxrpc_local *local,
struct sk_buff *skb,
unsigned int udp_offset,
unsigned int *info,
struct sockaddr_rxrpc *srx)
{
struct iphdr *ip, *ip0 = ip_hdr(skb);
struct icmphdr *icmp = icmp_hdr(skb);
struct udphdr *udp = (struct udphdr *)(skb->data + udp_offset);
_enter("%u,%u,%u", ip0->protocol, icmp->type, icmp->code);
switch (icmp->type) {
case ICMP_DEST_UNREACH:
*info = ntohs(icmp->un.frag.mtu);
fallthrough;
case ICMP_TIME_EXCEEDED:
case ICMP_PARAMETERPROB:
ip = (struct iphdr *)((void *)icmp + 8);
break;
default:
return NULL;
}
memset(srx, 0, sizeof(*srx));
srx->transport_type = local->srx.transport_type;
srx->transport_len = local->srx.transport_len;
srx->transport.family = local->srx.transport.family;
/* Can we see an ICMP4 packet on an ICMP6 listening socket? and vice
* versa?
*/
switch (srx->transport.family) {
case AF_INET:
srx->transport_len = sizeof(srx->transport.sin);
srx->transport.family = AF_INET;
srx->transport.sin.sin_port = udp->dest;
memcpy(&srx->transport.sin.sin_addr, &ip->daddr,
sizeof(struct in_addr));
break;
#ifdef CONFIG_AF_RXRPC_IPV6
case AF_INET6:
srx->transport_len = sizeof(srx->transport.sin);
srx->transport.family = AF_INET;
srx->transport.sin.sin_port = udp->dest;
memcpy(&srx->transport.sin.sin_addr, &ip->daddr,
sizeof(struct in_addr));
break;
#endif
default:
WARN_ON_ONCE(1);
return NULL;
}
_net("ICMP {%pISp}", &srx->transport);
return rxrpc_lookup_peer_rcu(local, srx);
}
#ifdef CONFIG_AF_RXRPC_IPV6
/*
* Find the peer associated with an ICMPv6 packet.
*/
static struct rxrpc_peer *rxrpc_lookup_peer_icmp6_rcu(struct rxrpc_local *local,
struct sk_buff *skb,
unsigned int udp_offset,
unsigned int *info,
struct sockaddr_rxrpc *srx)
{
struct icmp6hdr *icmp = icmp6_hdr(skb);
struct ipv6hdr *ip, *ip0 = ipv6_hdr(skb);
struct udphdr *udp = (struct udphdr *)(skb->data + udp_offset);
_enter("%u,%u,%u", ip0->nexthdr, icmp->icmp6_type, icmp->icmp6_code);
switch (icmp->icmp6_type) {
case ICMPV6_DEST_UNREACH:
*info = ntohl(icmp->icmp6_mtu);
fallthrough;
case ICMPV6_PKT_TOOBIG:
case ICMPV6_TIME_EXCEED:
case ICMPV6_PARAMPROB:
ip = (struct ipv6hdr *)((void *)icmp + 8);
break;
default:
return NULL;
}
memset(srx, 0, sizeof(*srx));
srx->transport_type = local->srx.transport_type;
srx->transport_len = local->srx.transport_len;
srx->transport.family = local->srx.transport.family;
/* Can we see an ICMP4 packet on an ICMP6 listening socket? and vice
* versa?
*/
switch (srx->transport.family) {
case AF_INET:
_net("Rx ICMP6 on v4 sock");
srx->transport_len = sizeof(srx->transport.sin);
srx->transport.family = AF_INET;
srx->transport.sin.sin_port = udp->dest;
memcpy(&srx->transport.sin.sin_addr,
&ip->daddr.s6_addr32[3], sizeof(struct in_addr));
break;
case AF_INET6:
_net("Rx ICMP6");
srx->transport.sin.sin_port = udp->dest;
memcpy(&srx->transport.sin6.sin6_addr, &ip->daddr,
sizeof(struct in6_addr));
break;
default:
WARN_ON_ONCE(1);
return NULL;
}
_net("ICMP {%pISp}", &srx->transport);
return rxrpc_lookup_peer_rcu(local, srx);
}
#endif /* CONFIG_AF_RXRPC_IPV6 */
/*
* Handle an error received on the local endpoint as a tunnel.
*/
void rxrpc_encap_err_rcv(struct sock *sk, struct sk_buff *skb,
unsigned int udp_offset)
{
struct sock_extended_err ee;
struct sockaddr_rxrpc srx;
struct rxrpc_local *local;
struct rxrpc_peer *peer;
unsigned int info = 0;
int err;
u8 version = ip_hdr(skb)->version;
u8 type = icmp_hdr(skb)->type;
u8 code = icmp_hdr(skb)->code;
rcu_read_lock();
local = rcu_dereference_sk_user_data(sk);
if (unlikely(!local)) {
rcu_read_unlock();
return;
}
rxrpc_new_skb(skb, rxrpc_skb_received);
switch (ip_hdr(skb)->version) {
case IPVERSION:
peer = rxrpc_lookup_peer_icmp_rcu(local, skb, udp_offset,
&info, &srx);
break;
#ifdef CONFIG_AF_RXRPC_IPV6
case 6:
peer = rxrpc_lookup_peer_icmp6_rcu(local, skb, udp_offset,
&info, &srx);
break;
#endif
default:
rcu_read_unlock();
return;
}
if (peer && !rxrpc_get_peer_maybe(peer))
peer = NULL;
if (!peer) {
rcu_read_unlock();
return;
}
memset(&ee, 0, sizeof(ee));
switch (version) {
case IPVERSION:
switch (type) {
case ICMP_DEST_UNREACH:
switch (code) {
case ICMP_FRAG_NEEDED:
rxrpc_adjust_mtu(peer, info);
rcu_read_unlock();
rxrpc_put_peer(peer);
return;
default:
break;
}
err = EHOSTUNREACH;
if (code <= NR_ICMP_UNREACH) {
/* Might want to do something different with
* non-fatal errors
*/
//harderr = icmp_err_convert[code].fatal;
err = icmp_err_convert[code].errno;
}
break;
case ICMP_TIME_EXCEEDED:
err = EHOSTUNREACH;
break;
default:
err = EPROTO;
break;
}
ee.ee_origin = SO_EE_ORIGIN_ICMP;
ee.ee_type = type;
ee.ee_code = code;
ee.ee_errno = err;
break;
#ifdef CONFIG_AF_RXRPC_IPV6
case 6:
switch (type) {
case ICMPV6_PKT_TOOBIG:
rxrpc_adjust_mtu(peer, info);
rcu_read_unlock();
rxrpc_put_peer(peer);
return;
}
icmpv6_err_convert(type, code, &err);
if (err == EACCES)
err = EHOSTUNREACH;
ee.ee_origin = SO_EE_ORIGIN_ICMP6;
ee.ee_type = type;
ee.ee_code = code;
ee.ee_errno = err;
break;
#endif
}
trace_rxrpc_rx_icmp(peer, &ee, &srx);
rxrpc_distribute_error(peer, err, RXRPC_CALL_NETWORK_ERROR);
rcu_read_unlock();
rxrpc_put_peer(peer);
}
/*
* Find the peer associated with a local error.
*/
......@@ -283,6 +38,9 @@ static struct rxrpc_peer *rxrpc_lookup_peer_local_rcu(struct rxrpc_local *local,
srx->transport_len = local->srx.transport_len;
srx->transport.family = local->srx.transport.family;
/* Can we see an ICMP4 packet on an ICMP6 listening socket? and vice
* versa?
*/
switch (srx->transport.family) {
case AF_INET:
srx->transport_len = sizeof(srx->transport.sin);
......@@ -412,20 +170,38 @@ void rxrpc_error_report(struct sock *sk)
}
rxrpc_new_skb(skb, rxrpc_skb_received);
serr = SKB_EXT_ERR(skb);
if (!skb->len && serr->ee.ee_origin == SO_EE_ORIGIN_TIMESTAMPING) {
_leave("UDP empty message");
rcu_read_unlock();
rxrpc_free_skb(skb, rxrpc_skb_freed);
return;
}
if (serr->ee.ee_origin == SO_EE_ORIGIN_LOCAL) {
peer = rxrpc_lookup_peer_local_rcu(local, skb, &srx);
if (peer && !rxrpc_get_peer_maybe(peer))
peer = NULL;
if (peer) {
trace_rxrpc_rx_icmp(peer, &serr->ee, &srx);
rxrpc_store_error(peer, serr);
if (!peer) {
rcu_read_unlock();
rxrpc_free_skb(skb, rxrpc_skb_freed);
_leave(" [no peer]");
return;
}
trace_rxrpc_rx_icmp(peer, &serr->ee, &srx);
if ((serr->ee.ee_origin == SO_EE_ORIGIN_ICMP &&
serr->ee.ee_type == ICMP_DEST_UNREACH &&
serr->ee.ee_code == ICMP_FRAG_NEEDED)) {
rxrpc_adjust_mtu(peer, serr->ee.ee_info);
goto out;
}
rxrpc_store_error(peer, serr);
out:
rcu_read_unlock();
rxrpc_free_skb(skb, rxrpc_skb_freed);
rxrpc_put_peer(peer);
_leave("");
}
......
......@@ -227,12 +227,7 @@ struct rxrpc_peer *rxrpc_alloc_peer(struct rxrpc_local *local, gfp_t gfp)
rxrpc_peer_init_rtt(peer);
if (RXRPC_TX_SMSS > 2190)
peer->cong_cwnd = 2;
else if (RXRPC_TX_SMSS > 1095)
peer->cong_cwnd = 3;
else
peer->cong_cwnd = 4;
peer->cong_ssthresh = RXRPC_TX_MAX_WINDOW;
trace_rxrpc_peer(peer->debug_id, rxrpc_peer_new, 1, here);
}
......
......@@ -54,8 +54,9 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
struct rxrpc_call *call;
struct rxrpc_net *rxnet = rxrpc_net(seq_file_net(seq));
unsigned long timeout = 0;
rxrpc_seq_t tx_hard_ack, rx_hard_ack;
rxrpc_seq_t acks_hard_ack;
char lbuff[50], rbuff[50];
u64 wtmp;
if (v == &rxnet->calls) {
seq_puts(seq,
......@@ -90,8 +91,8 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
timeout -= jiffies;
}
tx_hard_ack = READ_ONCE(call->tx_hard_ack);
rx_hard_ack = READ_ONCE(call->rx_hard_ack);
acks_hard_ack = READ_ONCE(call->acks_hard_ack);
wtmp = atomic64_read_acquire(&call->ackr_window);
seq_printf(seq,
"UDP %-47.47s %-47.47s %4x %08x %08x %s %3u"
" %-8.8s %08x %08x %08x %02x %08x %02x %08x %06lx\n",
......@@ -105,8 +106,8 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
rxrpc_call_states[call->state],
call->abort_code,
call->debug_id,
tx_hard_ack, READ_ONCE(call->tx_top) - tx_hard_ack,
rx_hard_ack, READ_ONCE(call->rx_top) - rx_hard_ack,
acks_hard_ack, READ_ONCE(call->tx_top) - acks_hard_ack,
lower_32_bits(wtmp), upper_32_bits(wtmp) - lower_32_bits(wtmp),
call->rx_serial,
timeout);
......@@ -216,7 +217,7 @@ static int rxrpc_peer_seq_show(struct seq_file *seq, void *v)
seq_puts(seq,
"Proto Local "
" Remote "
" Use CW MTU LastUse RTT RTO\n"
" Use SST MTU LastUse RTT RTO\n"
);
return 0;
}
......@@ -234,7 +235,7 @@ static int rxrpc_peer_seq_show(struct seq_file *seq, void *v)
lbuff,
rbuff,
refcount_read(&peer->ref),
peer->cong_cwnd,
peer->cong_ssthresh,
peer->mtu,
now - peer->last_tx_at,
peer->srtt_us >> 3,
......@@ -397,3 +398,98 @@ const struct seq_operations rxrpc_local_seq_ops = {
.stop = rxrpc_local_seq_stop,
.show = rxrpc_local_seq_show,
};
/*
* Display stats in /proc/net/rxrpc/stats
*/
int rxrpc_stats_show(struct seq_file *seq, void *v)
{
struct rxrpc_net *rxnet = rxrpc_net(seq_file_single_net(seq));
seq_printf(seq,
"Data : send=%u sendf=%u\n",
atomic_read(&rxnet->stat_tx_data_send),
atomic_read(&rxnet->stat_tx_data_send_frag));
seq_printf(seq,
"Data-Tx : nr=%u retrans=%u\n",
atomic_read(&rxnet->stat_tx_data),
atomic_read(&rxnet->stat_tx_data_retrans));
seq_printf(seq,
"Data-Rx : nr=%u reqack=%u jumbo=%u\n",
atomic_read(&rxnet->stat_rx_data),
atomic_read(&rxnet->stat_rx_data_reqack),
atomic_read(&rxnet->stat_rx_data_jumbo));
seq_printf(seq,
"Ack : fill=%u send=%u skip=%u\n",
atomic_read(&rxnet->stat_tx_ack_fill),
atomic_read(&rxnet->stat_tx_ack_send),
atomic_read(&rxnet->stat_tx_ack_skip));
seq_printf(seq,
"Ack-Tx : req=%u dup=%u oos=%u exw=%u nos=%u png=%u prs=%u dly=%u idl=%u\n",
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_REQUESTED]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_DUPLICATE]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_OUT_OF_SEQUENCE]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_EXCEEDS_WINDOW]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_NOSPACE]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_PING]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_PING_RESPONSE]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_DELAY]),
atomic_read(&rxnet->stat_tx_acks[RXRPC_ACK_IDLE]));
seq_printf(seq,
"Ack-Rx : req=%u dup=%u oos=%u exw=%u nos=%u png=%u prs=%u dly=%u idl=%u\n",
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_REQUESTED]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_DUPLICATE]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_OUT_OF_SEQUENCE]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_EXCEEDS_WINDOW]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_NOSPACE]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_PING]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_PING_RESPONSE]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_DELAY]),
atomic_read(&rxnet->stat_rx_acks[RXRPC_ACK_IDLE]));
seq_printf(seq,
"Why-Req-A: acklost=%u already=%u mrtt=%u ortt=%u\n",
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_ack_lost]),
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_already_on]),
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_more_rtt]),
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_old_rtt]));
seq_printf(seq,
"Why-Req-A: nolast=%u retx=%u slows=%u smtxw=%u\n",
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_no_srv_last]),
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_retrans]),
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_slow_start]),
atomic_read(&rxnet->stat_why_req_ack[rxrpc_reqack_small_txwin]));
seq_printf(seq,
"Buffers : txb=%u rxb=%u\n",
atomic_read(&rxrpc_nr_txbuf),
atomic_read(&rxrpc_n_rx_skbs));
return 0;
}
/*
* Clear stats if /proc/net/rxrpc/stats is written to.
*/
int rxrpc_stats_clear(struct file *file, char *buf, size_t size)
{
struct seq_file *m = file->private_data;
struct rxrpc_net *rxnet = rxrpc_net(seq_file_single_net(m));
if (size > 1 || (size == 1 && buf[0] != '\n'))
return -EINVAL;
atomic_set(&rxnet->stat_tx_data, 0);
atomic_set(&rxnet->stat_tx_data_retrans, 0);
atomic_set(&rxnet->stat_tx_data_send, 0);
atomic_set(&rxnet->stat_tx_data_send_frag, 0);
atomic_set(&rxnet->stat_rx_data, 0);
atomic_set(&rxnet->stat_rx_data_reqack, 0);
atomic_set(&rxnet->stat_rx_data_jumbo, 0);
atomic_set(&rxnet->stat_tx_ack_fill, 0);
atomic_set(&rxnet->stat_tx_ack_send, 0);
atomic_set(&rxnet->stat_tx_ack_skip, 0);
memset(&rxnet->stat_tx_acks, 0, sizeof(rxnet->stat_tx_acks));
memset(&rxnet->stat_rx_acks, 0, sizeof(rxnet->stat_rx_acks));
memset(&rxnet->stat_why_req_ack, 0, sizeof(rxnet->stat_why_req_ack));
return size;
}
......@@ -84,7 +84,7 @@ struct rxrpc_jumbo_header {
__be16 _rsvd; /* reserved */
__be16 cksum; /* kerberos security checksum */
};
};
} __packed;
#define RXRPC_JUMBO_DATALEN 1412 /* non-terminal jumbo packet data length */
#define RXRPC_JUMBO_SUBPKTLEN (RXRPC_JUMBO_DATALEN + sizeof(struct rxrpc_jumbo_header))
......@@ -132,13 +132,6 @@ struct rxrpc_ackpacket {
} __packed;
/* Some ACKs refer to specific packets and some are general and can be updated. */
#define RXRPC_ACK_UPDATEABLE ((1 << RXRPC_ACK_REQUESTED) | \
(1 << RXRPC_ACK_PING_RESPONSE) | \
(1 << RXRPC_ACK_DELAY) | \
(1 << RXRPC_ACK_IDLE))
/*
* ACK packets can have a further piece of information tagged on the end
*/
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -14,7 +14,7 @@ static struct ctl_table_header *rxrpc_sysctl_reg_table;
static const unsigned int four = 4;
static const unsigned int max_backlog = RXRPC_BACKLOG_MAX - 1;
static const unsigned int n_65535 = 65535;
static const unsigned int n_max_acks = RXRPC_RXTX_BUFF_SIZE - 1;
static const unsigned int n_max_acks = 255;
static const unsigned long one_jiffy = 1;
static const unsigned long max_jiffies = MAX_JIFFY_OFFSET;
......@@ -26,15 +26,6 @@ static const unsigned long max_jiffies = MAX_JIFFY_OFFSET;
*/
static struct ctl_table rxrpc_sysctl_table[] = {
/* Values measured in milliseconds but used in jiffies */
{
.procname = "req_ack_delay",
.data = &rxrpc_requested_ack_delay,
.maxlen = sizeof(unsigned long),
.mode = 0644,
.proc_handler = proc_doulongvec_ms_jiffies_minmax,
.extra1 = (void *)&one_jiffy,
.extra2 = (void *)&max_jiffies,
},
{
.procname = "soft_ack_delay",
.data = &rxrpc_soft_ack_delay,
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment