Merge branch 'xen-netback-next'

Zoltan Kiss says: ==================== xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy A long known problem of the upstream netback implementation that on the TX path (from guest to Dom0) it copies the whole packet from guest memory into Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a huge perfomance penalty. The classic kernel version of netback used grant mapping, and to get notified when the page can be unmapped, it used page destructors. Unfortunately that destructor is not an upstreamable solution. Ian Campbell's skb fragment destructor patch series [1] tried to solve this problem, however it seems to be very invasive on the network stack's code, and therefore haven't progressed very well. This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to know when the skb is freed up. That is the way KVM solved the same problem, and based on my initial tests it can do the same for us. Avoiding the extra copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node, running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb switch) Based on my investigations the packet get only copied if it is delivered to Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit unfortunate, but luckily it doesn't cause a major regression for this usecase. In the future we should try to eliminate that copy somehow. There are a few spinoff tasks which will be addressed in separate patches: - grant copy the header directly instead of map and memcpy. This should help us avoiding TLB flushing - use something else than ballooned pages - fix grant map to use page->index properly I've tried to broke it down to smaller patches, with mixed results, so I welcome suggestions on that part as well: 1: Use skb->cb to store pending_idx 2: Some refactoring 3: Change RX path for mapped SKB fragments (moved here to keep bisectability, review it after #4) 4: Introduce TX grant mapping 5: Remove old TX grant copy definitons and fix indentations 6: Add stat counters for zerocopy 7: Handle guests with too many frags 8: Timeout packets in RX path 9: Aggregate TX unmap operations v2: I've fixed some smaller things, see the individual patches. I've added a few new stat counters, and handling the important use case when an older guest sends lots of slots. Instead of delayed copy now we timeout packets on the RX path, based on the assumption that otherwise packets should get stucked anywhere else. Finally some unmap batching to avoid too much TLB flush v3: Apart from fixing a few things mentioned in responses the important change is the use the hypercall directly for grant [un]mapping, therefore we can avoid m2p override. v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue timeout logic changed also. v5: Only minor fixes based on Wei's comments v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see first 2 patches. Also rework of handling packets with too many slots, and reorder the series a bit. v7: Small fixes in comments/log messages/error paths, and merging the frag overflow stats patch into its parent. [1] http://lwn.net/Articles/491522/ [2] https://lkml.org/lkml/2012/7/20/363 ==================== Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xen-netback-next'
Zoltan Kiss says: ==================== xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy A long known problem of the upstream netback implementation that on the TX path (from guest to Dom0) it copies the whole packet from guest memory into Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a huge perfomance penalty. The classic kernel version of netback used grant mapping, and to get notified when the page can be unmapped, it used page destructors. Unfortunately that destructor is not an upstreamable solution. Ian Campbell's skb fragment destructor patch series [1] tried to solve this problem, however it seems to be very invasive on the network stack's code, and therefore haven't progressed very well. This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to know when the skb is freed up. That is the way KVM solved the same problem, and based on my initial tests it can do the same for us. Avoiding the extra copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node, running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb switch) Based on my investigations the packet get only copied if it is delivered to Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit unfortunate, but luckily it doesn't cause a major regression for this usecase. In the future we should try to eliminate that copy somehow. There are a few spinoff tasks which will be addressed in separate patches: - grant copy the header directly instead of map and memcpy. This should help us avoiding TLB flushing - use something else than ballooned pages - fix grant map to use page->index properly I've tried to broke it down to smaller patches, with mixed results, so I welcome suggestions on that part as well: 1: Use skb->cb to store pending_idx 2: Some refactoring 3: Change RX path for mapped SKB fragments (moved here to keep bisectability, review it after #4) 4: Introduce TX grant mapping 5: Remove old TX grant copy definitons and fix indentations 6: Add stat counters for zerocopy 7: Handle guests with too many frags 8: Timeout packets in RX path 9: Aggregate TX unmap operations v2: I've fixed some smaller things, see the individual patches. I've added a few new stat counters, and handling the important use case when an older guest sends lots of slots. Instead of delayed copy now we timeout packets on the RX path, based on the assumption that otherwise packets should get stucked anywhere else. Finally some unmap batching to avoid too much TLB flush v3: Apart from fixing a few things mentioned in responses the important change is the use the hypercall directly for grant [un]mapping, therefore we can avoid m2p override. v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue timeout logic changed also. v5: Only minor fixes based on Wei's comments v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see first 2 patches. Also rework of handling packets with too many slots, and reorder the series a bit. v7: Small fixes in comments/log messages/error paths, and merging the frag overflow stats patch into its parent. [1] http://lwn.net/Articles/491522/ [2] https://lkml.org/lkml/2012/7/20/363 ==================== Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
4caeccb4 · David S. Miller · 31c70d59 · e9275f5e · 4caeccb4 · 4caeccb4
Commit 4caeccb4 authored Mar 07, 2014 by David S. Miller
3 changed files
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -48,37 +48,19 @@
 typedef unsigned int pending_ring_idx_t;
 #define INVALID_PENDING_RING_IDX (~0U)
-/* For the head field in pending_tx_info: it is used to indicate
- * whether this tx info is the head of one or more coalesced requests.
- *
- * When head != INVALID_PENDING_RING_IDX, it means the start of a new
- * tx requests queue and the end of previous queue.
- *
- * An example sequence of head fields (I = INVALID_PENDING_RING_IDX):
- *
- * ...|0 I I I|5 I|9 I I I|...
- * -->|<-INUSE----------------
- *
- * After consuming the first slot(s) we have:
- *
- * ...|V V V V|5 I|9 I I I|...
- * -----FREE->|<-INUSE--------
- *
- * where V stands for "valid pending ring index". Any number other
- * than INVALID_PENDING_RING_IDX is OK. These entries are considered
- * free and can contain any number other than
- * INVALID_PENDING_RING_IDX. In practice we use 0.
- *
- * The in use non-INVALID_PENDING_RING_IDX (say 0, 5 and 9 in the
- * above example) number is the index into pending_tx_info and
- * mmap_pages arrays.
- */
 struct pending_tx_info {
-	struct xen_netif_tx_request req; /* coalesced tx request */
+	struct xen_netif_tx_request req; /* tx request */
-	pending_ring_idx_t head; /* head != INVALID_PENDING_RING_IDX
+	/* Callback data for released SKBs. The callback is always
-				  * if it is head of one or more tx
+	 * xenvif_zerocopy_callback, desc contains the pending_idx, which is
-				  * reqs
+	 * also an index in pending_tx_info array. It is initialized in
-				  */
+	 * xenvif_alloc and it never changes.
+	 * skb_shinfo(skb)->destructor_arg points to the first mapped slot's
+	 * callback_struct in this array of struct pending_tx_info's, then ctx
+	 * to the next, or NULL if there is no more slot for this skb.
+	 * ubuf_to_vif is a helper which finds the struct xenvif from a pointer
+	 * to this field.
+	 */
+	struct ubuf_info callback_struct;
 };
 #define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
@@ -108,6 +90,15 @@ struct xenvif_rx_meta {
 */
 #define MAX_GRANT_COPY_OPS (MAX_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE)
+#define NETBACK_INVALID_HANDLE -1
+/* To avoid confusion, we define XEN_NETBK_LEGACY_SLOTS_MAX indicating
+ * the maximum slots a valid packet can use. Now this value is defined
+ * to be XEN_NETIF_NR_SLOTS_MIN, which is supposed to be supported by
+ * all backend.
+ */
+#define XEN_NETBK_LEGACY_SLOTS_MAX XEN_NETIF_NR_SLOTS_MIN
 struct xenvif {
 	/* Unique identifier for this interface. */
 	domid_t          domid;
@@ -126,13 +117,28 @@ struct xenvif {
 	pending_ring_idx_t pending_cons;
 	u16 pending_ring[MAX_PENDING_REQS];
 	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
+	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
-	/* Coalescing tx requests before copying makes number of grant
-	 * copy ops greater or equal to number of slots required. In
+	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
-	 * worst case a tx request consumes 2 gnttab_copy.
+	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
+	/* passed to gnttab_[un]map_refs with pages under (un)mapping */
+	struct page *pages_to_map[MAX_PENDING_REQS];
+	struct page *pages_to_unmap[MAX_PENDING_REQS];
+	/* This prevents zerocopy callbacks  to race over dealloc_ring */
+	spinlock_t callback_lock;
+	/* This prevents dealloc thread and NAPI instance to race over response
+	 * creation and pending_ring in xenvif_idx_release. In xenvif_tx_err
+	 * it only protect response creation
 	 */
-	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
+	spinlock_t response_lock;
+	pending_ring_idx_t dealloc_prod;
+	pending_ring_idx_t dealloc_cons;
+	u16 dealloc_ring[MAX_PENDING_REQS];
+	struct task_struct *dealloc_task;
+	wait_queue_head_t dealloc_wq;
+	struct timer_list dealloc_delay;
+	bool dealloc_delay_timed_out;
 	/* Use kthread for guest RX */
 	struct task_struct *task;
@@ -144,6 +150,9 @@ struct xenvif {
 	struct xen_netif_rx_back_ring rx;
 	struct sk_buff_head rx_queue;
 	RING_IDX rx_last_skb_slots;
+	bool rx_queue_purge;
+	struct timer_list wake_queue;
 	/* This array is allocated seperately as it is large */
 	struct gnttab_copy *grant_copy_op;
@@ -175,6 +184,10 @@ struct xenvif {
 	/* Statistics */
 	unsigned long rx_gso_checksum_fixup;
+	unsigned long tx_zerocopy_sent;
+	unsigned long tx_zerocopy_success;
+	unsigned long tx_zerocopy_fail;
+	unsigned long tx_frag_overflow;
 	/* Miscellaneous private stuff. */
 	struct net_device *dev;
@@ -216,9 +229,11 @@ void xenvif_carrier_off(struct xenvif *vif);
 int xenvif_tx_action(struct xenvif *vif, int budget);
-int xenvif_kthread(void *data);
+int xenvif_kthread_guest_rx(void *data);
 void xenvif_kick_thread(struct xenvif *vif);
+int xenvif_dealloc_kthread(void *data);
 /* Determine whether the needed number of slots (req) are available,
 * and set req_event if not.
 */
@@ -226,6 +241,30 @@ bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed);
 void xenvif_stop_queue(struct xenvif *vif);
+/* Callback from stack when TX packet can be released */
+void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success);
+/* Unmap a pending page and release it back to the guest */
+void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx);
+static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
+{
+	return MAX_PENDING_REQS -
+		vif->pending_prod + vif->pending_cons;
+}
+static inline bool xenvif_tx_pending_slots_available(struct xenvif *vif)
+{
+	return nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
+		< MAX_PENDING_REQS;
+}
+/* Callback from stack when TX packet can be released */
+void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success);
 extern bool separate_tx_rx_irq;
+extern unsigned int rx_drain_timeout_msecs;
+extern unsigned int rx_drain_timeout_jiffies;
 #endif /* __XEN_NETBACK__COMMON_H__ */
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -38,6 +38,7 @@
 #include <xen/events.h>
 #include <asm/xen/hypercall.h>
+#include <xen/balloon.h>
 #define XENVIF_QUEUE_LENGTH 32
 #define XENVIF_NAPI_WEIGHT  64
@@ -87,7 +88,8 @@ static int xenvif_poll(struct napi_struct *napi, int budget)
 		local_irq_save(flags);
 		RING_FINAL_CHECK_FOR_REQUESTS(&vif->tx, more_to_do);
-		if (!more_to_do)
+		if (!(more_to_do &&
+		      xenvif_tx_pending_slots_available(vif)))
 			__napi_complete(napi);
 		local_irq_restore(flags);
@@ -113,6 +115,18 @@ static irqreturn_t xenvif_interrupt(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
+static void xenvif_wake_queue(unsigned long data)
+{
+	struct xenvif *vif = (struct xenvif *)data;
+	if (netif_queue_stopped(vif->dev)) {
+		netdev_err(vif->dev, "draining TX queue\n");
+		vif->rx_queue_purge = true;
+		xenvif_kick_thread(vif);
+		netif_wake_queue(vif->dev);
+	}
+}
 static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct xenvif *vif = netdev_priv(dev);
@@ -121,7 +135,9 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	BUG_ON(skb->dev != dev);
 	/* Drop the packet if vif is not ready */
-	if (vif->task == NULL || !xenvif_schedulable(vif))
+	if (vif->task == NULL ||
+	    vif->dealloc_task == NULL ||
+	    !xenvif_schedulable(vif))
 		goto drop;
 	/* At best we'll need one slot for the header and one for each
@@ -140,8 +156,13 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * then turn off the queue to give the ring a chance to
 	 * drain.
 	 */
-	if (!xenvif_rx_ring_slots_available(vif, min_slots_needed))
+	if (!xenvif_rx_ring_slots_available(vif, min_slots_needed)) {
+		vif->wake_queue.function = xenvif_wake_queue;
+		vif->wake_queue.data = (unsigned long)vif;
 		xenvif_stop_queue(vif);
+		mod_timer(&vif->wake_queue,
+			jiffies + rx_drain_timeout_jiffies);
+	}
 	skb_queue_tail(&vif->rx_queue, skb);
 	xenvif_kick_thread(vif);
@@ -234,6 +255,28 @@ static const struct xenvif_stat {
 		"rx_gso_checksum_fixup",
 		offsetof(struct xenvif, rx_gso_checksum_fixup)
 	},
+	/* If (sent != success + fail), there are probably packets never
+	 * freed up properly!
+	 */
+	{
+		"tx_zerocopy_sent",
+		offsetof(struct xenvif, tx_zerocopy_sent),
+	},
+	{
+		"tx_zerocopy_success",
+		offsetof(struct xenvif, tx_zerocopy_success),
+	},
+	{
+		"tx_zerocopy_fail",
+		offsetof(struct xenvif, tx_zerocopy_fail)
+	},
+	/* Number of packets exceeding MAX_SKB_FRAG slots. You should use
+	 * a guest with the same MAX_SKB_FRAG
+	 */
+	{
+		"tx_frag_overflow",
+		offsetof(struct xenvif, tx_frag_overflow)
+	},
 };
 static int xenvif_get_sset_count(struct net_device *dev, int string_set)
@@ -327,6 +370,8 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	init_timer(&vif->credit_timeout);
 	vif->credit_window_start = get_jiffies_64();
+	init_timer(&vif->wake_queue);
 	dev->netdev_ops	= &xenvif_netdev_ops;
 	dev->hw_features = NETIF_F_SG |
 		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
@@ -343,8 +388,27 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	vif->pending_prod = MAX_PENDING_REQS;
 	for (i = 0; i < MAX_PENDING_REQS; i++)
 		vif->pending_ring[i] = i;
-	for (i = 0; i < MAX_PENDING_REQS; i++)
+	spin_lock_init(&vif->callback_lock);
-		vif->mmap_pages[i] = NULL;
+	spin_lock_init(&vif->response_lock);
+	/* If ballooning is disabled, this will consume real memory, so you
+	 * better enable it. The long term solution would be to use just a
+	 * bunch of valid page descriptors, without dependency on ballooning
+	 */
+	err = alloc_xenballooned_pages(MAX_PENDING_REQS,
+				       vif->mmap_pages,
+				       false);
+	if (err) {
+		netdev_err(dev, "Could not reserve mmap_pages\n");
+		return ERR_PTR(-ENOMEM);
+	}
+	for (i = 0; i < MAX_PENDING_REQS; i++) {
+		vif->pending_tx_info[i].callback_struct = (struct ubuf_info)
+			{ .callback = xenvif_zerocopy_callback,
+			  .ctx = NULL,
+			  .desc = i };
+		vif->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
+	}
+	init_timer(&vif->dealloc_delay);
 	/*
 	 * Initialise a dummy MAC address. We choose the numerically
@@ -382,12 +446,14 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 	BUG_ON(vif->tx_irq);
 	BUG_ON(vif->task);
+	BUG_ON(vif->dealloc_task);
 	err = xenvif_map_frontend_rings(vif, tx_ring_ref, rx_ring_ref);
 	if (err < 0)
 		goto err;
 	init_waitqueue_head(&vif->wq);
+	init_waitqueue_head(&vif->dealloc_wq);
 	if (tx_evtchn == rx_evtchn) {
 		/* feature-split-event-channels == 0 */
@@ -421,8 +487,8 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 		disable_irq(vif->rx_irq);
 	}
-	task = kthread_create(xenvif_kthread,
+	task = kthread_create(xenvif_kthread_guest_rx,
-			      (void *)vif, "%s", vif->dev->name);
+			      (void *)vif, "%s-guest-rx", vif->dev->name);
 	if (IS_ERR(task)) {
 		pr_warn("Could not allocate kthread for %s\n", vif->dev->name);
 		err = PTR_ERR(task);
@@ -431,6 +497,16 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 	vif->task = task;
+	task = kthread_create(xenvif_dealloc_kthread,
+			      (void *)vif, "%s-dealloc", vif->dev->name);
+	if (IS_ERR(task)) {
+		pr_warn("Could not allocate kthread for %s\n", vif->dev->name);
+		err = PTR_ERR(task);
+		goto err_rx_unbind;
+	}
+	vif->dealloc_task = task;
 	rtnl_lock();
 	if (!vif->can_sg && vif->dev->mtu > ETH_DATA_LEN)
 		dev_set_mtu(vif->dev, ETH_DATA_LEN);
@@ -441,6 +517,7 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 	rtnl_unlock();
 	wake_up_process(vif->task);
+	wake_up_process(vif->dealloc_task);
 	return 0;
@@ -474,10 +551,17 @@ void xenvif_disconnect(struct xenvif *vif)
 		xenvif_carrier_off(vif);
 	if (vif->task) {
+		del_timer_sync(&vif->wake_queue);
 		kthread_stop(vif->task);
 		vif->task = NULL;
 	}
+	if (vif->dealloc_task) {
+		del_timer_sync(&vif->dealloc_delay);
+		kthread_stop(vif->dealloc_task);
+		vif->dealloc_task = NULL;
+	}
 	if (vif->tx_irq) {
 		if (vif->tx_irq == vif->rx_irq)
 			unbind_from_irqhandler(vif->tx_irq, vif);
@@ -493,6 +577,36 @@ void xenvif_disconnect(struct xenvif *vif)
 void xenvif_free(struct xenvif *vif)
 {
+	int i, unmap_timeout = 0;
+	/* Here we want to avoid timeout messages if an skb can be legitimatly
+	 * stucked somewhere else. Realisticly this could be an another vif's
+	 * internal or QDisc queue. That another vif also has this
+	 * rx_drain_timeout_msecs timeout, but the timer only ditches the
+	 * internal queue. After that, the QDisc queue can put in worst case
+	 * XEN_NETIF_RX_RING_SIZE / MAX_SKB_FRAGS skbs into that another vif's
+	 * internal queue, so we need several rounds of such timeouts until we
+	 * can be sure that no another vif should have skb's from us. We are
+	 * not sending more skb's, so newly stucked packets are not interesting
+	 * for us here.
+	 */
+	unsigned int worst_case_skb_lifetime = (rx_drain_timeout_msecs/1000) *
+		DIV_ROUND_UP(XENVIF_QUEUE_LENGTH, (XEN_NETIF_RX_RING_SIZE / MAX_SKB_FRAGS));
+	for (i = 0; i < MAX_PENDING_REQS; ++i) {
+		if (vif->grant_tx_handle[i] != NETBACK_INVALID_HANDLE) {
+			unmap_timeout++;
+			schedule_timeout(msecs_to_jiffies(1000));
+			if (unmap_timeout > worst_case_skb_lifetime &&
+			    net_ratelimit())
+				netdev_err(vif->dev,
+					   "Page still granted! Index: %x\n",
+					   i);
+			i = -1;
+		}
+	}
+	free_xenballooned_pages(MAX_PENDING_REQS, vif->mmap_pages);
 	netif_napi_del(&vif->napi);
 	unregister_netdev(vif->dev);

--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -37,6 +37,7 @@
 #include <linux/kthread.h>
 #include <linux/if_vlan.h>
 #include <linux/udp.h>
+#include <linux/highmem.h>
 #include <net/tcp.h>
@@ -54,6 +55,13 @@
 bool separate_tx_rx_irq = 1;
 module_param(separate_tx_rx_irq, bool, 0644);
+/* When guest ring is filled up, qdisc queues the packets for us, but we have
+ * to timeout them, otherwise other guests' packets can get stucked there
+ */
+unsigned int rx_drain_timeout_msecs = 10000;
+module_param(rx_drain_timeout_msecs, uint, 0444);
+unsigned int rx_drain_timeout_jiffies;
 /*
 * This is the maximum slots a skb can have. If a guest sends a skb
 * which exceeds this limit it is considered malicious.
@@ -62,24 +70,6 @@ module_param(separate_tx_rx_irq, bool, 0644);
 static unsigned int fatal_skb_slots = FATAL_SKB_SLOTS_DEFAULT;
 module_param(fatal_skb_slots, uint, 0444);
-/*
- * To avoid confusion, we define XEN_NETBK_LEGACY_SLOTS_MAX indicating
- * the maximum slots a valid packet can use. Now this value is defined
- * to be XEN_NETIF_NR_SLOTS_MIN, which is supposed to be supported by
- * all backend.
- */
-#define XEN_NETBK_LEGACY_SLOTS_MAX XEN_NETIF_NR_SLOTS_MIN
-/*
- * If head != INVALID_PENDING_RING_IDX, it means this tx request is head of
- * one or more merged tx requests, otherwise it is the continuation of
- * previous tx request.
- */
-static inline int pending_tx_is_head(struct xenvif *vif, RING_IDX idx)
-{
-	return vif->pending_tx_info[idx].head != INVALID_PENDING_RING_IDX;
-}
 static void xenvif_idx_release(struct xenvif *vif, u16 pending_idx,
 			       u8 status);
@@ -109,6 +99,18 @@ static inline unsigned long idx_to_kaddr(struct xenvif *vif,
 	return (unsigned long)pfn_to_kaddr(idx_to_pfn(vif, idx));
 }
+/* Find the containing VIF's structure from a pointer in pending_tx_info array
+ */
+static inline struct xenvif* ubuf_to_vif(struct ubuf_info *ubuf)
+{
+	u16 pending_idx = ubuf->desc;
+	struct pending_tx_info *temp =
+		container_of(ubuf, struct pending_tx_info, callback_struct);
+	return container_of(temp - pending_idx,
+			    struct xenvif,
+			    pending_tx_info[0]);
+}
 /* This is a miniumum size for the linear area to avoid lots of
 * calls to __pskb_pull_tail() as we set up checksum offsets. The
 * value 128 was chosen as it covers all IPv4 and most likely
@@ -131,10 +133,9 @@ static inline pending_ring_idx_t pending_index(unsigned i)
 	return i & (MAX_PENDING_REQS-1);
 }
-static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
+static inline pending_ring_idx_t nr_free_slots(struct xen_netif_tx_back_ring *ring)
 {
-	return MAX_PENDING_REQS -
+	return ring->nr_ents -	(ring->sring->req_prod - ring->rsp_prod_pvt);
-		vif->pending_prod + vif->pending_cons;
 }
 bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed)
@@ -235,7 +236,9 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif *vif,
 static void xenvif_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
 				 struct netrx_pending_operations *npo,
 				 struct page *page, unsigned long size,
-				 unsigned long offset, int *head)
+				 unsigned long offset, int *head,
+				 struct xenvif *foreign_vif,
+				 grant_ref_t foreign_gref)
 {
 	struct gnttab_copy *copy_gop;
 	struct xenvif_rx_meta *meta;
@@ -277,8 +280,15 @@ static void xenvif_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
 		copy_gop->flags = GNTCOPY_dest_gref;
 		copy_gop->len = bytes;
-		copy_gop->source.domid = DOMID_SELF;
+		if (foreign_vif) {
-		copy_gop->source.u.gmfn = virt_to_mfn(page_address(page));
+			copy_gop->source.domid = foreign_vif->domid;
+			copy_gop->source.u.ref = foreign_gref;
+			copy_gop->flags |= GNTCOPY_source_gref;
+		} else {
+			copy_gop->source.domid = DOMID_SELF;
+			copy_gop->source.u.gmfn =
+				virt_to_mfn(page_address(page));
+		}
 		copy_gop->source.offset = offset;
 		copy_gop->dest.domid = vif->domid;
@@ -339,6 +349,9 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 	int old_meta_prod;
 	int gso_type;
 	int gso_size;
+	struct ubuf_info *ubuf = skb_shinfo(skb)->destructor_arg;
+	grant_ref_t foreign_grefs[MAX_SKB_FRAGS];
+	struct xenvif *foreign_vif = NULL;
 	old_meta_prod = npo->meta_prod;
@@ -379,6 +392,19 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 	npo->copy_off = 0;
 	npo->copy_gref = req->gref;
+	if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) &&
+		 (ubuf->callback == &xenvif_zerocopy_callback)) {
+		int i = 0;
+		foreign_vif = ubuf_to_vif(ubuf);
+		do {
+			u16 pending_idx = ubuf->desc;
+			foreign_grefs[i++] =
+				foreign_vif->pending_tx_info[pending_idx].req.gref;
+			ubuf = (struct ubuf_info *) ubuf->ctx;
+		} while (ubuf);
+	}
 	data = skb->data;
 	while (data < skb_tail_pointer(skb)) {
 		unsigned int offset = offset_in_page(data);
@@ -388,7 +414,9 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 			len = skb_tail_pointer(skb) - data;
 		xenvif_gop_frag_copy(vif, skb, npo,
-				     virt_to_page(data), len, offset, &head);
+				     virt_to_page(data), len, offset, &head,
+				     NULL,
+				     0);
 		data += len;
 	}
@@ -397,7 +425,9 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 				     skb_frag_page(&skb_shinfo(skb)->frags[i]),
 				     skb_frag_size(&skb_shinfo(skb)->frags[i]),
 				     skb_shinfo(skb)->frags[i].page_offset,
-				     &head);
+				     &head,
+				     foreign_vif,
+				     foreign_grefs[i]);
 	}
 	return npo->meta_prod - old_meta_prod;
@@ -455,10 +485,12 @@ static void xenvif_add_frag_responses(struct xenvif *vif, int status,
 	}
 }
-struct skb_cb_overlay {
+struct xenvif_rx_cb {
 	int meta_slots_used;
 };
+#define XENVIF_RX_CB(skb) ((struct xenvif_rx_cb *)(skb)->cb)
 void xenvif_kick_thread(struct xenvif *vif)
 {
 	wake_up(&vif->wq);
@@ -474,7 +506,6 @@ static void xenvif_rx_action(struct xenvif *vif)
 	LIST_HEAD(notify);
 	int ret;
 	unsigned long offset;
-	struct skb_cb_overlay *sco;
 	bool need_to_notify = false;
 	struct netrx_pending_operations npo = {
@@ -513,9 +544,8 @@ static void xenvif_rx_action(struct xenvif *vif)
 		} else
 			vif->rx_last_skb_slots = 0;
-		sco = (struct skb_cb_overlay *)skb->cb;
+		XENVIF_RX_CB(skb)->meta_slots_used = xenvif_gop_skb(skb, &npo);
-		sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
+		BUG_ON(XENVIF_RX_CB(skb)->meta_slots_used > max_slots_needed);
-		BUG_ON(sco->meta_slots_used > max_slots_needed);
 		__skb_queue_tail(&rxq, skb);
 	}
@@ -529,7 +559,6 @@ static void xenvif_rx_action(struct xenvif *vif)
 	gnttab_batch_copy(vif->grant_copy_op, npo.copy_prod);
 	while ((skb = __skb_dequeue(&rxq)) != NULL) {
-		sco = (struct skb_cb_overlay *)skb->cb;
 		if ((1 << vif->meta[npo.meta_cons].gso_type) &
 		    vif->gso_prefix_mask) {
@@ -540,19 +569,21 @@ static void xenvif_rx_action(struct xenvif *vif)
 			resp->offset = vif->meta[npo.meta_cons].gso_size;
 			resp->id = vif->meta[npo.meta_cons].id;
-			resp->status = sco->meta_slots_used;
+			resp->status = XENVIF_RX_CB(skb)->meta_slots_used;
 			npo.meta_cons++;
-			sco->meta_slots_used--;
+			XENVIF_RX_CB(skb)->meta_slots_used--;
 		}
 		vif->dev->stats.tx_bytes += skb->len;
 		vif->dev->stats.tx_packets++;
-		status = xenvif_check_gop(vif, sco->meta_slots_used, &npo);
+		status = xenvif_check_gop(vif,
+					  XENVIF_RX_CB(skb)->meta_slots_used,
+					  &npo);
-		if (sco->meta_slots_used == 1)
+		if (XENVIF_RX_CB(skb)->meta_slots_used == 1)
 			flags = 0;
 		else
 			flags = XEN_NETRXF_more_data;
@@ -589,13 +620,13 @@ static void xenvif_rx_action(struct xenvif *vif)
 		xenvif_add_frag_responses(vif, status,
 					  vif->meta + npo.meta_cons + 1,
-					  sco->meta_slots_used);
+					  XENVIF_RX_CB(skb)->meta_slots_used);
 		RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&vif->rx, ret);
 		need_to_notify |= !!ret;
-		npo.meta_cons += sco->meta_slots_used;
+		npo.meta_cons += XENVIF_RX_CB(skb)->meta_slots_used;
 		dev_kfree_skb(skb);
 	}
@@ -645,9 +676,12 @@ static void xenvif_tx_err(struct xenvif *vif,
 			  struct xen_netif_tx_request *txp, RING_IDX end)
 {
 	RING_IDX cons = vif->tx.req_cons;
+	unsigned long flags;
 	do {
+		spin_lock_irqsave(&vif->response_lock, flags);
 		make_tx_response(vif, txp, XEN_NETIF_RSP_ERROR);
+		spin_unlock_irqrestore(&vif->response_lock, flags);
 		if (cons == end)
 			break;
 		txp = RING_GET_REQUEST(&vif->tx, cons++);
@@ -759,180 +793,168 @@ static int xenvif_count_requests(struct xenvif *vif,
 	return slots;
 }
-static struct page *xenvif_alloc_page(struct xenvif *vif,
-				      u16 pending_idx)
+struct xenvif_tx_cb {
+	u16 pending_idx;
+};
+#define XENVIF_TX_CB(skb) ((struct xenvif_tx_cb *)(skb)->cb)
+static inline void xenvif_tx_create_gop(struct xenvif *vif,
+					u16 pending_idx,
+					struct xen_netif_tx_request *txp,
+					struct gnttab_map_grant_ref *gop)
 {
-	struct page *page;
+	vif->pages_to_map[gop-vif->tx_map_ops] = vif->mmap_pages[pending_idx];
+	gnttab_set_map_op(gop, idx_to_kaddr(vif, pending_idx),
+			  GNTMAP_host_map | GNTMAP_readonly,
+			  txp->gref, vif->domid);
+	memcpy(&vif->pending_tx_info[pending_idx].req, txp,
+	       sizeof(*txp));
+}
-	page = alloc_page(GFP_ATOMIC|__GFP_COLD);
+static inline struct sk_buff *xenvif_alloc_skb(unsigned int size)
-	if (!page)
+{
+	struct sk_buff *skb =
+		alloc_skb(size + NET_SKB_PAD + NET_IP_ALIGN,
+			  GFP_ATOMIC | __GFP_NOWARN);
+	if (unlikely(skb == NULL))
 		return NULL;
-	vif->mmap_pages[pending_idx] = page;
-	return page;
+	/* Packets passed to netif_rx() must have some headroom. */
+	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
+	/* Initialize it here to avoid later surprises */
+	skb_shinfo(skb)->destructor_arg = NULL;
+	return skb;
 }
-static struct gnttab_copy *xenvif_get_requests(struct xenvif *vif,
+static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
-					       struct sk_buff *skb,
+							struct sk_buff *skb,
-					       struct xen_netif_tx_request *txp,
+							struct xen_netif_tx_request *txp,
-					       struct gnttab_copy *gop)
+							struct gnttab_map_grant_ref *gop)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	skb_frag_t *frags = shinfo->frags;
-	u16 pending_idx = *((u16 *)skb->data);
+	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
-	u16 head_idx = 0;
+	int start;
-	int slot, start;
+	pending_ring_idx_t index;
-	struct page *page;
+	unsigned int nr_slots, frag_overflow = 0;
-	pending_ring_idx_t index, start_idx = 0;
-	uint16_t dst_offset;
-	unsigned int nr_slots;
-	struct pending_tx_info *first = NULL;
 	/* At this point shinfo->nr_frags is in fact the number of
 	 * slots, which can be as large as XEN_NETBK_LEGACY_SLOTS_MAX.
 	 */
+	if (shinfo->nr_frags > MAX_SKB_FRAGS) {
+		frag_overflow = shinfo->nr_frags - MAX_SKB_FRAGS;
+		BUG_ON(frag_overflow > MAX_SKB_FRAGS);
+		shinfo->nr_frags = MAX_SKB_FRAGS;
+	}
 	nr_slots = shinfo->nr_frags;
 	/* Skip first skb fragment if it is on same page as header fragment. */
 	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
-	/* Coalesce tx requests, at this point the packet passed in
+	for (shinfo->nr_frags = start; shinfo->nr_frags < nr_slots;
-	 * should be <= 64K. Any packets larger than 64K have been
+	     shinfo->nr_frags++, txp++, gop++) {
-	 * handled in xenvif_count_requests().
+		index = pending_index(vif->pending_cons++);
-	 */
+		pending_idx = vif->pending_ring[index];
-	for (shinfo->nr_frags = slot = start; slot < nr_slots;
+		xenvif_tx_create_gop(vif, pending_idx, txp, gop);
-	     shinfo->nr_frags++) {
+		frag_set_pending_idx(&frags[shinfo->nr_frags], pending_idx);
-		struct pending_tx_info *pending_tx_info =
+	}
-			vif->pending_tx_info;
-		page = alloc_page(GFP_ATOMIC|__GFP_COLD);
+	if (frag_overflow) {
-		if (!page)
+		struct sk_buff *nskb = xenvif_alloc_skb(0);
-			goto err;
+		if (unlikely(nskb == NULL)) {
+			if (net_ratelimit())
-		dst_offset = 0;
+				netdev_err(vif->dev,
-		first = NULL;
+					   "Can't allocate the frag_list skb.\n");
-		while (dst_offset < PAGE_SIZE && slot < nr_slots) {
+			return NULL;
-			gop->flags = GNTCOPY_source_gref;
+		}
-			gop->source.u.ref = txp->gref;
+		shinfo = skb_shinfo(nskb);
-			gop->source.domid = vif->domid;
+		frags = shinfo->frags;
-			gop->source.offset = txp->offset;
-			gop->dest.domid = DOMID_SELF;
-			gop->dest.offset = dst_offset;
-			gop->dest.u.gmfn = virt_to_mfn(page_address(page));
-			if (dst_offset + txp->size > PAGE_SIZE) {
-				/* This page can only merge a portion
-				 * of tx request. Do not increment any
-				 * pointer / counter here. The txp
-				 * will be dealt with in future
-				 * rounds, eventually hitting the
-				 * `else` branch.
-				 */
-				gop->len = PAGE_SIZE - dst_offset;
-				txp->offset += gop->len;
-				txp->size -= gop->len;
-				dst_offset += gop->len; /* quit loop */
-			} else {
-				/* This tx request can be merged in the page */
-				gop->len = txp->size;
-				dst_offset += gop->len;
-				index = pending_index(vif->pending_cons++);
-				pending_idx = vif->pending_ring[index];
-				memcpy(&pending_tx_info[pending_idx].req, txp,
-				       sizeof(*txp));
-				/* Poison these fields, corresponding
-				 * fields for head tx req will be set
-				 * to correct values after the loop.
-				 */
-				vif->mmap_pages[pending_idx] = (void *)(~0UL);
-				pending_tx_info[pending_idx].head =
-					INVALID_PENDING_RING_IDX;
-				if (!first) {
-					first = &pending_tx_info[pending_idx];
-					start_idx = index;
-					head_idx = pending_idx;
-				}
-				txp++;
-				slot++;
-			}
-			gop++;
+		for (shinfo->nr_frags = 0; shinfo->nr_frags < frag_overflow;
+		     shinfo->nr_frags++, txp++, gop++) {
+			index = pending_index(vif->pending_cons++);
+			pending_idx = vif->pending_ring[index];
+			xenvif_tx_create_gop(vif, pending_idx, txp, gop);
+			frag_set_pending_idx(&frags[shinfo->nr_frags],
+					     pending_idx);
 		}
-		first->req.offset = 0;
+		skb_shinfo(skb)->frag_list = nskb;
-		first->req.size = dst_offset;
-		first->head = start_idx;
-		vif->mmap_pages[head_idx] = page;
-		frag_set_pending_idx(&frags[shinfo->nr_frags], head_idx);
 	}
-	BUG_ON(shinfo->nr_frags > MAX_SKB_FRAGS);
 	return gop;
-err:
+}
-	/* Unwind, freeing all pages and sending error responses. */
-	while (shinfo->nr_frags-- > start) {
+static inline void xenvif_grant_handle_set(struct xenvif *vif,
-		xenvif_idx_release(vif,
+					   u16 pending_idx,
-				frag_get_pending_idx(&frags[shinfo->nr_frags]),
+					   grant_handle_t handle)
-				XEN_NETIF_RSP_ERROR);
+{
+	if (unlikely(vif->grant_tx_handle[pending_idx] !=
+		     NETBACK_INVALID_HANDLE)) {
+		netdev_err(vif->dev,
+			   "Trying to overwrite active handle! pending_idx: %x\n",
+			   pending_idx);
+		BUG();
 	}
-	/* The head too, if necessary. */
+	vif->grant_tx_handle[pending_idx] = handle;
-	if (start)
+}
-		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_ERROR);
-	return NULL;
+static inline void xenvif_grant_handle_reset(struct xenvif *vif,
+					     u16 pending_idx)
+{
+	if (unlikely(vif->grant_tx_handle[pending_idx] ==
+		     NETBACK_INVALID_HANDLE)) {
+		netdev_err(vif->dev,
+			   "Trying to unmap invalid handle! pending_idx: %x\n",
+			   pending_idx);
+		BUG();
+	}
+	vif->grant_tx_handle[pending_idx] = NETBACK_INVALID_HANDLE;
 }
 static int xenvif_tx_check_gop(struct xenvif *vif,
 			       struct sk_buff *skb,
-			       struct gnttab_copy **gopp)
+			       struct gnttab_map_grant_ref **gopp)
 {
-	struct gnttab_copy *gop = *gopp;
+	struct gnttab_map_grant_ref *gop = *gopp;
-	u16 pending_idx = *((u16 *)skb->data);
+	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct pending_tx_info *tx_info;
 	int nr_frags = shinfo->nr_frags;
 	int i, err, start;
-	u16 peek; /* peek into next tx request */
+	struct sk_buff *first_skb = NULL;
 	/* Check status of header. */
 	err = gop->status;
 	if (unlikely(err))
 		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_ERROR);
+	else
+		xenvif_grant_handle_set(vif, pending_idx , gop->handle);
 	/* Skip first skb fragment if it is on same page as header fragment. */
 	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
+check_frags:
 	for (i = start; i < nr_frags; i++) {
 		int j, newerr;
-		pending_ring_idx_t head;
 		pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
 		tx_info = &vif->pending_tx_info[pending_idx];
-		head = tx_info->head;
 		/* Check error status: if okay then remember grant handle. */
-		do {
+		newerr = (++gop)->status;
-			newerr = (++gop)->status;
-			if (newerr)
-				break;
-			peek = vif->pending_ring[pending_index(++head)];
-		} while (!pending_tx_is_head(vif, peek));
 		if (likely(!newerr)) {
+			xenvif_grant_handle_set(vif, pending_idx , gop->handle);
 			/* Had a previous error? Invalidate this fragment. */
 			if (unlikely(err))
-				xenvif_idx_release(vif, pending_idx,
+				xenvif_idx_unmap(vif, pending_idx);
-						   XEN_NETIF_RSP_OKAY);
 			continue;
 		}
@@ -942,20 +964,45 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 		/* Not the first error? Preceding frags already invalidated. */
 		if (err)
 			continue;
 		/* First error: invalidate header and preceding fragments. */
-		pending_idx = *((u16 *)skb->data);
+		if (!first_skb)
-		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
+			pending_idx = XENVIF_TX_CB(skb)->pending_idx;
+		else
+			pending_idx = XENVIF_TX_CB(skb)->pending_idx;
+		xenvif_idx_unmap(vif, pending_idx);
 		for (j = start; j < i; j++) {
 			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
-			xenvif_idx_release(vif, pending_idx,
+			xenvif_idx_unmap(vif, pending_idx);
-					   XEN_NETIF_RSP_OKAY);
 		}
 		/* Remember the error: invalidate all subsequent fragments. */
 		err = newerr;
 	}
+	if (skb_has_frag_list(skb)) {
+		first_skb = skb;
+		skb = shinfo->frag_list;
+		shinfo = skb_shinfo(skb);
+		nr_frags = shinfo->nr_frags;
+		start = 0;
+		goto check_frags;
+	}
+	/* There was a mapping error in the frag_list skb. We have to unmap
+	 * the first skb's frags
+	 */
+	if (first_skb && err) {
+		int j;
+		shinfo = skb_shinfo(first_skb);
+		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
+		start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
+		for (j = start; j < shinfo->nr_frags; j++) {
+			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
+			xenvif_idx_unmap(vif, pending_idx);
+		}
+	}
 	*gopp = gop + 1;
 	return err;
 }
@@ -965,6 +1012,10 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	int nr_frags = shinfo->nr_frags;
 	int i;
+	u16 prev_pending_idx = INVALID_PENDING_IDX;
+	if (skb_shinfo(skb)->destructor_arg)
+		prev_pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 	for (i = 0; i < nr_frags; i++) {
 		skb_frag_t *frag = shinfo->frags + i;
@@ -974,6 +1025,17 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
 		pending_idx = frag_get_pending_idx(frag);
+		/* If this is not the first frag, chain it to the previous*/
+		if (unlikely(prev_pending_idx == INVALID_PENDING_IDX))
+			skb_shinfo(skb)->destructor_arg =
+				&vif->pending_tx_info[pending_idx].callback_struct;
+		else if (likely(pending_idx != prev_pending_idx))
+			vif->pending_tx_info[prev_pending_idx].callback_struct.ctx =
+				&(vif->pending_tx_info[pending_idx].callback_struct);
+		vif->pending_tx_info[pending_idx].callback_struct.ctx = NULL;
+		prev_pending_idx = pending_idx;
 		txp = &vif->pending_tx_info[pending_idx].req;
 		page = virt_to_page(idx_to_kaddr(vif, pending_idx));
 		__skb_fill_page_desc(skb, i, page, txp->offset, txp->size);
@@ -981,10 +1043,15 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
 		skb->data_len += txp->size;
 		skb->truesize += txp->size;
-		/* Take an extra reference to offset xenvif_idx_release */
+		/* Take an extra reference to offset network stack's put_page */
 		get_page(vif->mmap_pages[pending_idx]);
-		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
 	}
+	/* FIXME: __skb_fill_page_desc set this to true because page->pfmemalloc
+	 * overlaps with "index", and "mapping" is not set. I think mapping
+	 * should be set. If delivered to local stack, it would drop this
+	 * skb in sk_filter unless the socket has the right to use it.
+	 */
+	skb->pfmemalloc	= false;
 }
 static int xenvif_get_extras(struct xenvif *vif,
@@ -1104,16 +1171,14 @@ static bool tx_credit_exceeded(struct xenvif *vif, unsigned size)
 static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 {
-	struct gnttab_copy *gop = vif->tx_copy_ops, *request_gop;
+	struct gnttab_map_grant_ref *gop = vif->tx_map_ops, *request_gop;
 	struct sk_buff *skb;
 	int ret;
-	while ((nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
+	while (xenvif_tx_pending_slots_available(vif) &&
-		< MAX_PENDING_REQS) &&
 	       (skb_queue_len(&vif->tx_queue) < budget)) {
 		struct xen_netif_tx_request txreq;
 		struct xen_netif_tx_request txfrags[XEN_NETBK_LEGACY_SLOTS_MAX];
-		struct page *page;
 		struct xen_netif_extra_info extras[XEN_NETIF_EXTRA_TYPE_MAX-1];
 		u16 pending_idx;
 		RING_IDX idx;
@@ -1189,8 +1254,7 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 			    ret < XEN_NETBK_LEGACY_SLOTS_MAX) ?
 			PKT_PROT_LEN : txreq.size;
-		skb = alloc_skb(data_len + NET_SKB_PAD + NET_IP_ALIGN,
+		skb = xenvif_alloc_skb(data_len);
-				GFP_ATOMIC | __GFP_NOWARN);
 		if (unlikely(skb == NULL)) {
 			netdev_dbg(vif->dev,
 				   "Can't allocate a skb in start_xmit.\n");
@@ -1198,9 +1262,6 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 			break;
 		}
-		/* Packets passed to netif_rx() must have some headroom. */
-		skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
 		if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
 			struct xen_netif_extra_info *gso;
 			gso = &extras[XEN_NETIF_EXTRA_TYPE_GSO - 1];
@@ -1212,31 +1273,11 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 			}
 		}
-		/* XXX could copy straight to head */
+		xenvif_tx_create_gop(vif, pending_idx, &txreq, gop);
-		page = xenvif_alloc_page(vif, pending_idx);
-		if (!page) {
-			kfree_skb(skb);
-			xenvif_tx_err(vif, &txreq, idx);
-			break;
-		}
-		gop->source.u.ref = txreq.gref;
-		gop->source.domid = vif->domid;
-		gop->source.offset = txreq.offset;
-		gop->dest.u.gmfn = virt_to_mfn(page_address(page));
-		gop->dest.domid = DOMID_SELF;
-		gop->dest.offset = txreq.offset;
-		gop->len = txreq.size;
-		gop->flags = GNTCOPY_source_gref;
 		gop++;
-		memcpy(&vif->pending_tx_info[pending_idx].req,
+		XENVIF_TX_CB(skb)->pending_idx = pending_idx;
-		       &txreq, sizeof(txreq));
-		vif->pending_tx_info[pending_idx].head = index;
-		*((u16 *)skb->data) = pending_idx;
 		__skb_put(skb, data_len);
@@ -1264,17 +1305,82 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 		vif->tx.req_cons = idx;
-		if ((gop-vif->tx_copy_ops) >= ARRAY_SIZE(vif->tx_copy_ops))
+		if ((gop-vif->tx_map_ops) >= ARRAY_SIZE(vif->tx_map_ops))
 			break;
 	}
-	return gop - vif->tx_copy_ops;
+	return gop - vif->tx_map_ops;
 }
+/* Consolidate skb with a frag_list into a brand new one with local pages on
+ * frags. Returns 0 or -ENOMEM if can't allocate new pages.
+ */
+static int xenvif_handle_frag_list(struct xenvif *vif, struct sk_buff *skb)
+{
+	unsigned int offset = skb_headlen(skb);
+	skb_frag_t frags[MAX_SKB_FRAGS];
+	int i;
+	struct ubuf_info *uarg;
+	struct sk_buff *nskb = skb_shinfo(skb)->frag_list;
+	vif->tx_zerocopy_sent += 2;
+	vif->tx_frag_overflow++;
+	xenvif_fill_frags(vif, nskb);
+	/* Subtract frags size, we will correct it later */
+	skb->truesize -= skb->data_len;
+	skb->len += nskb->len;
+	skb->data_len += nskb->len;
+	/* create a brand new frags array and coalesce there */
+	for (i = 0; offset < skb->len; i++) {
+		struct page *page;
+		unsigned int len;
+		BUG_ON(i >= MAX_SKB_FRAGS);
+		page = alloc_page(GFP_ATOMIC|__GFP_COLD);
+		if (!page) {
+			int j;
+			skb->truesize += skb->data_len;
+			for (j = 0; j < i; j++)
+				put_page(frags[j].page.p);
+			return -ENOMEM;
+		}
+		if (offset + PAGE_SIZE < skb->len)
+			len = PAGE_SIZE;
+		else
+			len = skb->len - offset;
+		if (skb_copy_bits(skb, offset, page_address(page), len))
+			BUG();
+		offset += len;
+		frags[i].page.p = page;
+		frags[i].page_offset = 0;
+		skb_frag_size_set(&frags[i], len);
+	}
+	/* swap out with old one */
+	memcpy(skb_shinfo(skb)->frags,
+	       frags,
+	       i * sizeof(skb_frag_t));
+	skb_shinfo(skb)->nr_frags = i;
+	skb->truesize += i * PAGE_SIZE;
+	/* remove traces of mapped pages and frag_list */
+	skb_frag_list_init(skb);
+	uarg = skb_shinfo(skb)->destructor_arg;
+	uarg->callback(uarg, true);
+	skb_shinfo(skb)->destructor_arg = NULL;
+	skb_shinfo(nskb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	kfree_skb(nskb);
+	return 0;
+}
 static int xenvif_tx_submit(struct xenvif *vif)
 {
-	struct gnttab_copy *gop = vif->tx_copy_ops;
+	struct gnttab_map_grant_ref *gop = vif->tx_map_ops;
 	struct sk_buff *skb;
 	int work_done = 0;
@@ -1283,7 +1389,7 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		u16 pending_idx;
 		unsigned data_len;
-		pending_idx = *((u16 *)skb->data);
+		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 		txp = &vif->pending_tx_info[pending_idx].req;
 		/* Check the remap error code. */
@@ -1298,14 +1404,16 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		memcpy(skb->data,
 		       (void *)(idx_to_kaddr(vif, pending_idx)|txp->offset),
 		       data_len);
+		vif->pending_tx_info[pending_idx].callback_struct.ctx = NULL;
 		if (data_len < txp->size) {
 			/* Append the packet payload as a fragment. */
 			txp->offset += data_len;
 			txp->size -= data_len;
+			skb_shinfo(skb)->destructor_arg =
+				&vif->pending_tx_info[pending_idx].callback_struct;
 		} else {
 			/* Schedule a response immediately. */
-			xenvif_idx_release(vif, pending_idx,
+			xenvif_idx_unmap(vif, pending_idx);
-					   XEN_NETIF_RSP_OKAY);
 		}
 		if (txp->flags & XEN_NETTXF_csum_blank)
@@ -1315,6 +1423,17 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		xenvif_fill_frags(vif, skb);
+		if (unlikely(skb_has_frag_list(skb))) {
+			if (xenvif_handle_frag_list(vif, skb)) {
+				if (net_ratelimit())
+					netdev_err(vif->dev,
+						   "Not enough memory to consolidate frag_list!\n");
+				skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+				kfree_skb(skb);
+				continue;
+			}
+		}
 		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {
 			int target = min_t(int, skb->len, PKT_PROT_LEN);
 			__pskb_pull_tail(skb, target - skb_headlen(skb));
@@ -1327,6 +1446,9 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		if (checksum_setup(vif, skb)) {
 			netdev_dbg(vif->dev,
 				   "Can't setup checksum in net_tx_action\n");
+			/* We have to set this flag to trigger the callback */
+			if (skb_shinfo(skb)->destructor_arg)
+				skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
 			kfree_skb(skb);
 			continue;
 		}
@@ -1352,17 +1474,134 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		work_done++;
+		/* Set this flag right before netif_receive_skb, otherwise
+		 * someone might think this packet already left netback, and
+		 * do a skb_copy_ubufs while we are still in control of the
+		 * skb. E.g. the __pskb_pull_tail earlier can do such thing.
+		 */
+		if (skb_shinfo(skb)->destructor_arg) {
+			skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+			vif->tx_zerocopy_sent++;
+		}
 		netif_receive_skb(skb);
 	}
 	return work_done;
 }
+void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success)
+{
+	unsigned long flags;
+	pending_ring_idx_t index;
+	struct xenvif *vif = ubuf_to_vif(ubuf);
+	/* This is the only place where we grab this lock, to protect callbacks
+	 * from each other.
+	 */
+	spin_lock_irqsave(&vif->callback_lock, flags);
+	do {
+		u16 pending_idx = ubuf->desc;
+		ubuf = (struct ubuf_info *) ubuf->ctx;
+		BUG_ON(vif->dealloc_prod - vif->dealloc_cons >=
+			MAX_PENDING_REQS);
+		index = pending_index(vif->dealloc_prod);
+		vif->dealloc_ring[index] = pending_idx;
+		/* Sync with xenvif_tx_dealloc_action:
+		 * insert idx then incr producer.
+		 */
+		smp_wmb();
+		vif->dealloc_prod++;
+	} while (ubuf);
+	wake_up(&vif->dealloc_wq);
+	spin_unlock_irqrestore(&vif->callback_lock, flags);
+	if (RING_HAS_UNCONSUMED_REQUESTS(&vif->tx) &&
+	    xenvif_tx_pending_slots_available(vif)) {
+		local_bh_disable();
+		napi_schedule(&vif->napi);
+		local_bh_enable();
+	}
+	if (likely(zerocopy_success))
+		vif->tx_zerocopy_success++;
+	else
+		vif->tx_zerocopy_fail++;
+}
+static inline void xenvif_tx_dealloc_action(struct xenvif *vif)
+{
+	struct gnttab_unmap_grant_ref *gop;
+	pending_ring_idx_t dc, dp;
+	u16 pending_idx, pending_idx_release[MAX_PENDING_REQS];
+	unsigned int i = 0;
+	dc = vif->dealloc_cons;
+	gop = vif->tx_unmap_ops;
+	/* Free up any grants we have finished using */
+	do {
+		dp = vif->dealloc_prod;
+		/* Ensure we see all indices enqueued by all
+		 * xenvif_zerocopy_callback().
+		 */
+		smp_rmb();
+		while (dc != dp) {
+			BUG_ON(gop - vif->tx_unmap_ops > MAX_PENDING_REQS);
+			pending_idx =
+				vif->dealloc_ring[pending_index(dc++)];
+			pending_idx_release[gop-vif->tx_unmap_ops] =
+				pending_idx;
+			vif->pages_to_unmap[gop-vif->tx_unmap_ops] =
+				vif->mmap_pages[pending_idx];
+			gnttab_set_unmap_op(gop,
+					    idx_to_kaddr(vif, pending_idx),
+					    GNTMAP_host_map,
+					    vif->grant_tx_handle[pending_idx]);
+			/* Btw. already unmapped? */
+			xenvif_grant_handle_reset(vif, pending_idx);
+			++gop;
+		}
+	} while (dp != vif->dealloc_prod);
+	vif->dealloc_cons = dc;
+	if (gop - vif->tx_unmap_ops > 0) {
+		int ret;
+		ret = gnttab_unmap_refs(vif->tx_unmap_ops,
+					NULL,
+					vif->pages_to_unmap,
+					gop - vif->tx_unmap_ops);
+		if (ret) {
+			netdev_err(vif->dev, "Unmap fail: nr_ops %x ret %d\n",
+				   gop - vif->tx_unmap_ops, ret);
+			for (i = 0; i < gop - vif->tx_unmap_ops; ++i) {
+				if (gop[i].status != GNTST_okay)
+					netdev_err(vif->dev,
+						   " host_addr: %llx handle: %x status: %d\n",
+						   gop[i].host_addr,
+						   gop[i].handle,
+						   gop[i].status);
+			}
+			BUG();
+		}
+	}
+	for (i = 0; i < gop - vif->tx_unmap_ops; ++i)
+		xenvif_idx_release(vif, pending_idx_release[i],
+				   XEN_NETIF_RSP_OKAY);
+}
 /* Called after netfront has transmitted */
 int xenvif_tx_action(struct xenvif *vif, int budget)
 {
 	unsigned nr_gops;
-	int work_done;
+	int work_done, ret;
 	if (unlikely(!tx_work_todo(vif)))
 		return 0;
@@ -1372,7 +1611,11 @@ int xenvif_tx_action(struct xenvif *vif, int budget)
 	if (nr_gops == 0)
 		return 0;
-	gnttab_batch_copy(vif->tx_copy_ops, nr_gops);
+	ret = gnttab_map_refs(vif->tx_map_ops,
+			      NULL,
+			      vif->pages_to_map,
+			      nr_gops);
+	BUG_ON(ret);
 	work_done = xenvif_tx_submit(vif);
@@ -1383,45 +1626,18 @@ static void xenvif_idx_release(struct xenvif *vif, u16 pending_idx,
 			       u8 status)
 {
 	struct pending_tx_info *pending_tx_info;
-	pending_ring_idx_t head;
+	pending_ring_idx_t index;
-	u16 peek; /* peek into next tx request */
+	unsigned long flags;
-	BUG_ON(vif->mmap_pages[pending_idx] == (void *)(~0UL));
-	/* Already complete? */
-	if (vif->mmap_pages[pending_idx] == NULL)
-		return;
 	pending_tx_info = &vif->pending_tx_info[pending_idx];
+	spin_lock_irqsave(&vif->response_lock, flags);
-	head = pending_tx_info->head;
+	make_tx_response(vif, &pending_tx_info->req, status);
+	index = pending_index(vif->pending_prod);
-	BUG_ON(!pending_tx_is_head(vif, head));
+	vif->pending_ring[index] = pending_idx;
-	BUG_ON(vif->pending_ring[pending_index(head)] != pending_idx);
+	/* TX shouldn't use the index before we give it back here */
+	mb();
-	do {
+	vif->pending_prod++;
-		pending_ring_idx_t index;
+	spin_unlock_irqrestore(&vif->response_lock, flags);
-		pending_ring_idx_t idx = pending_index(head);
-		u16 info_idx = vif->pending_ring[idx];
-		pending_tx_info = &vif->pending_tx_info[info_idx];
-		make_tx_response(vif, &pending_tx_info->req, status);
-		/* Setting any number other than
-		 * INVALID_PENDING_RING_IDX indicates this slot is
-		 * starting a new packet / ending a previous packet.
-		 */
-		pending_tx_info->head = 0;
-		index = pending_index(vif->pending_prod++);
-		vif->pending_ring[index] = vif->pending_ring[info_idx];
-		peek = vif->pending_ring[pending_index(++head)];
-	} while (!pending_tx_is_head(vif, peek));
-	put_page(vif->mmap_pages[pending_idx]);
-	vif->mmap_pages[pending_idx] = NULL;
 }
@@ -1469,23 +1685,74 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,
 	return resp;
 }
+void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx)
+{
+	int ret;
+	struct gnttab_unmap_grant_ref tx_unmap_op;
+	gnttab_set_unmap_op(&tx_unmap_op,
+			    idx_to_kaddr(vif, pending_idx),
+			    GNTMAP_host_map,
+			    vif->grant_tx_handle[pending_idx]);
+	/* Btw. already unmapped? */
+	xenvif_grant_handle_reset(vif, pending_idx);
+	ret = gnttab_unmap_refs(&tx_unmap_op, NULL,
+				&vif->mmap_pages[pending_idx], 1);
+	BUG_ON(ret);
+	xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
+}
 static inline int rx_work_todo(struct xenvif *vif)
 {
-	return !skb_queue_empty(&vif->rx_queue) &&
+	return (!skb_queue_empty(&vif->rx_queue) &&
-	       xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
+	       xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots)) ||
+	       vif->rx_queue_purge;
 }
 static inline int tx_work_todo(struct xenvif *vif)
 {
 	if (likely(RING_HAS_UNCONSUMED_REQUESTS(&vif->tx)) &&
-	    (nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
+	    xenvif_tx_pending_slots_available(vif))
-	     < MAX_PENDING_REQS))
 		return 1;
 	return 0;
 }
+static void xenvif_dealloc_delay(unsigned long data)
+{
+	struct xenvif *vif = (struct xenvif *)data;
+	vif->dealloc_delay_timed_out = true;
+	wake_up(&vif->dealloc_wq);
+}
+static inline bool tx_dealloc_work_todo(struct xenvif *vif)
+{
+	if (vif->dealloc_cons != vif->dealloc_prod) {
+		if ((nr_free_slots(&vif->tx) > 2 * XEN_NETBK_LEGACY_SLOTS_MAX) &&
+		    (vif->dealloc_prod - vif->dealloc_cons < MAX_PENDING_REQS / 4) &&
+		    !vif->dealloc_delay_timed_out) {
+			if (!timer_pending(&vif->dealloc_delay)) {
+				vif->dealloc_delay.function =
+					xenvif_dealloc_delay;
+				vif->dealloc_delay.data = (unsigned long)vif;
+				mod_timer(&vif->dealloc_delay,
+					  jiffies + msecs_to_jiffies(1));
+			}
+			return false;
+		}
+		del_timer_sync(&vif->dealloc_delay);
+		vif->dealloc_delay_timed_out = false;
+		return true;
+	}
+	return false;
+}
 void xenvif_unmap_frontend_rings(struct xenvif *vif)
 {
 	if (vif->tx.sring)
@@ -1543,7 +1810,7 @@ static void xenvif_start_queue(struct xenvif *vif)
 		netif_wake_queue(vif->dev);
 }
-int xenvif_kthread(void *data)
+int xenvif_kthread_guest_rx(void *data)
 {
 	struct xenvif *vif = data;
 	struct sk_buff *skb;
@@ -1555,12 +1822,19 @@ int xenvif_kthread(void *data)
 		if (kthread_should_stop())
 			break;
+		if (vif->rx_queue_purge) {
+			skb_queue_purge(&vif->rx_queue);
+			vif->rx_queue_purge = false;
+		}
 		if (!skb_queue_empty(&vif->rx_queue))
 			xenvif_rx_action(vif);
 		if (skb_queue_empty(&vif->rx_queue) &&
-		    netif_queue_stopped(vif->dev))
+		    netif_queue_stopped(vif->dev)) {
+			del_timer_sync(&vif->wake_queue);
 			xenvif_start_queue(vif);
+		}
 		cond_resched();
 	}
@@ -1572,6 +1846,28 @@ int xenvif_kthread(void *data)
 	return 0;
 }
+int xenvif_dealloc_kthread(void *data)
+{
+	struct xenvif *vif = data;
+	while (!kthread_should_stop()) {
+		wait_event_interruptible(vif->dealloc_wq,
+					 tx_dealloc_work_todo(vif) ||
+					 kthread_should_stop());
+		if (kthread_should_stop())
+			break;
+		xenvif_tx_dealloc_action(vif);
+		cond_resched();
+	}
+	/* Unmap anything remaining*/
+	if (tx_dealloc_work_todo(vif))
+		xenvif_tx_dealloc_action(vif);
+	return 0;
+}
 static int __init netback_init(void)
 {
 	int rc = 0;
@@ -1589,6 +1885,8 @@ static int __init netback_init(void)
 	if (rc)
 		goto failed_init;
+	rx_drain_timeout_jiffies = msecs_to_jiffies(rx_drain_timeout_msecs);
 	return 0;
 failed_init: