- 11 Nov, 2011 40 commits
-
-
Andrea Arcangeli authored
commit 405e44f2 upstream. "page" may have changed to point to the next hugepage after the loop completed, The references have been taken on the head page, so the put_page must happen there too. This is a longstanding issue pre-thp inclusion. It's totally unclear how these page_cache_add_speculative and pte_val(pte) != pte_val(*ptep) checks are necessary across all the powerpc gup_fast code, when x86 doesn't need any of that: there's no way the page can be freed with irq disabled so we're guaranteed the atomic_inc will happen on a page with page_count > 0 (so not needing the speculative check). The pte check is also meaningless on x86: no need to rollback on x86 if the pte changed, because the pte can still change a CPU tick after the check succeeded and it won't be rolled back in that case. The important thing is we got a reference on a valid page that was mapped there a CPU tick ago. So not knowing the soft tlb refill code of ppc64 in great detail I'm not removing the "speculative" page_count increase and the pte checks across all the code, but unless there's a strong reason for it they should be later cleaned up too. If a pte can change from huge to non-huge (like it could happen with THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs only so I'm not altering that. Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by:
David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Andrea Arcangeli authored
commit 2839bdc1 upstream. This part of gup_fast doesn't seem capable of handling hugetlbfs ptes, those should be handled by gup_hugepd only, so these checks are superfluous. Plus if this wasn't a noop, it would have oopsed because, the insistence of using the speculative refcounting would trigger a VM_BUG_ON if a tail page was encountered in the page_cache_get_speculative(). Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by:
David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Oliver Hartkopp authored
commit 12d0d0d3 upstream. The commit aabdcb0b ("can bcm: fix tx_setup off-by-one errors") fixed only a part of the original problem reported by Andre Naujoks. It turned out that the original code needed to be re-ordered to reduce complexity and to finally fix the reported frame counting issues. Signed-off-by:
Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Andiry Xu authored
commit 6fd45621 upstream. When the link state changes, xHC will report a port status change event and set the PORT_PLC bit, for both USB3 and USB2 root hub ports. The PLC will be cleared by usbcore for USB3 root hub ports, but not for USB2 ports, because they do not report USB_PORT_STAT_C_LINK_STATE in wPortChange. Clear it for USB2 root hub ports in handle_port_status(). Signed-off-by:
Andiry Xu <andiry.xu@amd.com> Signed-off-by:
Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Andiry Xu authored
commit d2f52c9e upstream. Introduce xhci_test_and_clear_bit() to clear RWC bit in PORTSC register. Signed-off-by:
Andiry Xu <andiry.xu@amd.com> Signed-off-by:
Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Sarah Sharp authored
commit 2dc37539 upstream. Some alternate interface settings have no endpoints associated with them. This shows up in some USB webcams, particularly the Logitech HD 1080p, which uses the uvcvideo driver. If a driver switches between two alt settings with no endpoints, there is no need to issue a configure endpoint command, because there is no endpoint information to update. The only time a configure endpoint command with just the add slot flag set makes sense is when the driver is updating hub characteristics in the slot context. However, that code never calls xhci_check_bandwidth, so we should be safe not issuing a command if only the slot context add flag is set. Signed-off-by:
Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Seth Forshee authored
commit f02fe890 upstream. Scanning cannot be run during suspend or hibernation, but if usb-stor-scan freezes another thread waiting on scanning to complete may fail to freeze. However, if usb-stor-scan is left freezable without ever actually freezing then the freezer will wait on it to exit, and threads waiting for scanning to finish will no longer be blocked. One problem with this approach is that usb-stor-scan has a delay to wait for devices to settle (which is currently the only point where it can freeze). To work around this we can request that the freezer send a fake signal when freezing, then use interruptible sleep to wake the thread early when freezing happens. To make this happen, the following changes are made to usb-stor-scan: * Use set_freezable_with_signal() instead of set_freezable() to request a fake signal when freezing * Use wait_event_interruptible_timeout() instead of wait_event_freezable_timeout() to avoid freezing Signed-off-by:
Seth Forshee <seth.forshee@canonical.com> Acked-by:
Alan Stern <stern@rowland.harvard.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Oliver Neukum authored
commit c510eae3 upstream. This device declares itself to be vendor specific It therefore needs to be added to the device table to make btusb bind. Signed-off-by:
Oliver Neukum <oneukum@suse.de> Signed-off-by:
Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Jurgen Kramer authored
commit f78b6826 upstream. Today I noticed that the usb bluetooth adapter (BCM2046B1) on my 2011 mac mini was not working. I've created a patch to get it going. Signed-off-by:
Jurgen Kramer <gtmkramer@xs4all.nl> Signed-off-by:
Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Steven.Li authored
commit 2d25f8b4 upstream. The new Ath3k needs to download patch and radio table, and it keeps same PID/VID even after downloading the patch and radio table. This patch is to use the bcdDevice (Device Release Number) to judge whether the chip has been patched or not. The init bcdDevice value of the chip is 0x0001, this value increases after patch and radio table downloading. Signed-off-by:
Steven.Li <yongli@qca.qualcomm.com> Signed-off-by:
Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Ricardo Mendoza authored
commit 8e7c3d2e upstream. Blacklist Toshiba-branded AR3011 based AR5B195 [0930:0215] and add to ath3k.c for firmware loading. Signed-off-by:
Ricardo Mendoza <ricmm@gentoo.org> Signed-off-by:
Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Pieter-Augustijn Van Malleghem authored
commit a63b723d upstream. This patch against current git adds the hardware ID for the Apple MacBookAir4,1, released in July 2011. The device features a BCM2046 USB chip. The patch was inspired by the previous modifications adding support for the MacBookAir3,x. Signed-off-by:
Pieter-Augustijn Van Malleghem <p-a@scarlet.be> Signed-off-by:
Gustavo F. Padovan <padovan@profusion.mobi> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Marek Vasut authored
commit bca0beb9 upstream. The AX88772B uses only 11 bits of the header for the actual size. The other bits are used for something else. This causes dmesg full of messages: asix_rx_fixup() Bad Header Length This patch trims the check to only 11 bits. I believe on older chips, the remaining 5 top bits are unused. Signed-off-by:
Marek Vasut <marek.vasut@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Marek Vasut authored
commit bc466e67 upstream. Signed-off-by:
Marek Vasut <marek.vasut@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Andiry Xu authored
commit c2d7b49f upstream. When a xHC host is unable to handle isochronous transfer in the interval, it reports a Missed Service Error event and skips some tds. Currently xhci driver handles MSE event in the following ways: 1. When encounter a MSE event, set ep->skip flag, update event ring dequeue pointer and return. 2. When encounter the next event on this ep, the driver will run the do-while loop, fetch td from ep's td_list to find the td corresponding to this event. All tds missed are marked as short transfer(-EXDEV). The do-while loop will end in two ways: 1. If the td pointed by the event trb is found; 2. If the ep ring's td_list is empty. However, if a buggy HW reports some unpredicted event (for example, an overrun event following a MSE event while the ep ring is actually not empty), the driver will never find the td, and it will loop until the td_list is empty. Unfortunately, the spinlock is dropped when give back a urb in the do-while loop. During the spinlock released period, the class driver may still submit urbs and add tds to the td_list. This may cause disaster, since the td_list will never be empty and the loop never ends, and the system hangs. To fix this, count the number of TDs on the ep ring before skipping TDs, and quit the loop when skipped that number of tds. This guarantees the do-while loop will end after certain number of cycles, and driver will not be trapped in an infinite loop. Signed-off-by:
Andiry Xu <andiry.xu@amd.com> Signed-off-by:
Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Kavan Smith authored
commit 02009afc upstream. Add USB product ID for iPhone 4 CDMA Verizon Tested on at least 2 devices Signed-off-by:
Kavan Smith <kavansmith82@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Sarah Sharp authored
commit 8a9af4fd upstream. usb_ifnum_to_if() can return NULL if the USB device does not have a configuration installed (usb_device->actconfig == NULL), or if we can't find the interface number in the installed configuration. Return an error instead of crashing. Signed-off-by:
Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Josh Boyer authored
commit 75bc8ef5 upstream. The cdc_ncm driver still has a few places where stack variables are passed to the cdc_ncm_do_request function. This triggers a stack trace in lib/dma-debug.c if the CONFIG_DEBUG_DMA_API option is set. Adjust these calls to pass parameters that have been allocated with kzalloc. Signed-off-by:
Josh Boyer <jwboyer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Artur Zimmer authored
commit ce7e9065 upstream. Here is a patch for a new PID (zeitcontrol-device mifare-reader FT232BL(like FT232BM but lead free)). Signed-off-by:
Artur Zimmer <artur128@3dzimmer.de> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Florian Echtler authored
commit 2f1def26 upstream. A new device ID pair is added for Sierra Wireless MC8305. Signed-off-by:
Florian Echtler <floe@butterbrot.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Arvid Brodin authored
commit 17d3e145 upstream. Signed-off-by:
Arvid Brodin <arvid.brodin@enea.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Boris Todorov authored
commit 77636c86 upstream. The sequence to put port in test mode is not complete. According EHCI specification all enabled ports must be put in suspend. Signed-off-by:
Boris Todorov <boris.st.todorov@gmail.com> Acked-by:
Alan Stern <stern@rowland.harvard.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
huajun li authored
commit c2e2a313 upstream. Executing cmd 'rmmod rtl8150' does not return(if your device connects to host), the root cause is tasklet_disable() causes tasklet_kill() block, remove it from rtl8150_disconnect(). Signed-off-by:
Huajun Li <huajun.li.lee@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Vasanthy Kolluri authored
commit b880a954 upstream. Signed-off-by:
Christian Benvenuti <benve@cisco.com> Signed-off-by:
Danny Guo <dannguo@cisco.com> Signed-off-by:
Vasanthy Kolluri <vkolluri@cisco.com> Signed-off-by:
Roopa Prabhu <roprabhu@cisco.com> Signed-off-by:
David Wang <dwang2@cisco.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Cc: Chun-Yi Lee <jlee@suse.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Eric Sandeen authored
commit 6d6a4351 upstream. Ceph users reported that when using Ceph on ext4, the filesystem would often become corrupted, containing inodes with incorrect i_blocks counters. I managed to reproduce this with a very hacked-up "streamtest" binary from the Ceph tree. Ceph is doing a lot of xattr writes, to out-of-inode blocks. There is also another thread which does sync_file_range and close, of the same files. The problem appears to happen due to this race: sync/flush thread xattr-set thread ----------------- ---------------- do_writepages ext4_xattr_set ext4_da_writepages ext4_xattr_set_handle mpage_da_map_blocks ext4_xattr_block_set set DELALLOC_RESERVE ext4_new_meta_blocks ext4_mb_new_blocks if (!i_delalloc_reserved_flag) vfs_dq_alloc_block ext4_get_blocks down_write(i_data_sem) set i_delalloc_reserved_flag ... up_write(i_data_sem) if (i_delalloc_reserved_flag) vfs_dq_alloc_block_nofail In other words, the sync/flush thread pops in and sets i_delalloc_reserved_flag on the inode, which makes the xattr thread think that it's in a delalloc path in ext4_new_meta_blocks(), and add the block for a second time, after already having added it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks The real problem is that we shouldn't be using the DELALLOC_RESERVED state flag, and instead we should be passing EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of using an inode state flag. We'll fix this for now with using i_data_sem to prevent this race, but this is really not the right way to fix things. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Theodore Ts'o authored
commit 5930ea64 upstream. ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads that point to directory blocks assigned to the directory inode. However, the function calls ext4_handle_dirty_metadata with the inode of the file that's being added to the directory, not the directory inode itself. Therefore, correct the code to dirty the directory buffers with the directory inode, not the file inode. Signed-off-by:
Darrick J. Wong <djwong@us.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Darrick J. Wong authored
commit f9287c1f upstream. ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir". Unfortunately, dir_block belongs to the newly created directory (which is "inode"), not the parent directory (which is "dir"). Fix the incorrect association. Signed-off-by:
Darrick J. Wong <djwong@us.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Darrick J. Wong authored
commit bcaa9929 upstream. When ext4_rename performs a directory rename (move), dir_bh is a buffer that is modified to update the '..' link in the directory being moved (old_inode). However, ext4_handle_dirty_metadata is called with the old parent directory inode (old_dir) and dir_bh, which is incorrect because dir_bh does not belong to the parent inode. Fix this error. Signed-off-by:
Darrick J. Wong <djwong@us.ibm.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Theodore Ts'o authored
commit 1cd9f097 upstream. This doesn't make much sense, and it exposes a bug in the kernel where attempts to create a new file in an append-only directory using O_CREAT will fail (but still leave a zero-length file). This was discovered when xfstests #79 was generalized so it could run on all file systems. Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Clifton Barnes authored
commit 0e053fcb upstream. Fixes the deadlock when inserting and removing the ds2780. Signed-off-by:
Clifton Barnes <cabarnes@indesign-llc.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Clifton Barnes authored
commit 9fe678fa upstream. Adds a nolock function to the w1 interface to avoid locking the mutex if needed. Signed-off-by:
Clifton Barnes <cabarnes@indesign-llc.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Clifton Barnes authored
commit 853eee72 upstream. Simply creates one point to call the w1 interface. Signed-off-by:
Clifton Barnes <cabarnes@indesign-llc.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Juan Gutierrez authored
commit 93b465c2 upstream. Since we're using non-atomic radix tree allocations, we should be protecting the tree using a mutex and not a spinlock. Non-atomic allocations and process context locking is good enough, as the tree is manipulated only when locks are registered/ unregistered/requested/freed. The locks themselves are still protected by spinlocks of course, and mutexes are not involved in the locking/unlocking paths. Signed-off-by:
Juan Gutierrez <jgutierrez@ti.com> [ohad@wizery.com: rewrite the commit log, #include mutex.h, add minor commentary] [ohad@wizery.com: update register/unregister parts in hwspinlock.txt] Signed-off-by:
Ohad Ben-Cohen <ohad@wizery.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Alexandre Bounine authored
commit e0c87bd9 upstream. Modify Ethernet addess macros to be compatible with BE/LE platforms Signed-off-by:
Alexandre Bounine <alexandre.bounine@idt.com> Cc: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Johannes Berg authored
Upstream commit effd4d9a. Since the dawn of its time, iwlwifi has used interruptible waits to wait for synchronous commands and firmware loading. This leads to "interesting" bugs, because it can't actually handle the interruptions; for example when a command sending is interrupted it will assume the command completed fully, and then leave it pending, which leads to all kinds of trouble when the command finishes later. Since there's no easy way to gracefully deal with interruptions, fix the driver to not use interruptible waits. This at least fixes the error iwlagn 0000:02:00.0: Error: Response NULL in 'REPLY_SCAN_ABORT_CMD' I have seen in P2P testing, but it is likely that there are other errors caused by this. Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Signed-off-by:
Wey-Yi Guy <wey-yi.w.guy@intel.com> Signed-off-by:
John W. Linville <linville@tuxdriver.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Linus Torvalds authored
commit 1117f72e upstream. The CLOEXE bit is magical, and for performance (and semantic) reasons we don't actually maintain it in the file descriptor itself, but in a separate bit array. Which means that when we show f_flags, the CLOEXE status is shown incorrectly: we show the status not as it is now, but as it was when the file was opened. Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit array. Uli needs this in order to re-implement the pfiles program: "For normal file descriptors (not sockets) this was the last piece of information which wasn't available. This is all part of my 'give Solaris users no reason to not switch' effort. I intend to offer the code to the util-linux-ng maintainers." Requested-by:
Ulrich Drepper <drepper@akkadia.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Jiri Kosina authored
commit a3defbe5 upstream. The case of address space randomization being disabled in runtime through randomize_va_space sysctl is not treated properly in load_elf_binary(), resulting in SIGKILL coming at exec() time for certain PIE-linked binaries in case the randomization has been disabled at runtime prior to calling exec(). Handle the randomize_va_space == 0 case the same way as if we were not supporting .text randomization at all. Based on original patch by H.J. Lu and Josh Boyer. Signed-off-by:
Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by:
Josh Boyer <jwboyer@redhat.com> Acked-by:
Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Andrea Arcangeli authored
commit 70b50f94 upstream. Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Reported-by:
Michel Lespinasse <walken@google.com> Reviewed-by:
Michel Lespinasse <walken@google.com> Reviewed-by:
Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
David Vrabel authored
[ Upstream commit d0e5d832 ] If a VM is saved and restored (or migrated) the netback driver will no longer process any Tx packets from the frontend. xenvif_up() does not schedule the processing of any pending Tx requests from the front end because the carrier is off. Without this initial kick the frontend just adds Tx requests to the ring without raising an event (until the ring is full). This was caused by 47103041 (net: xen-netback: convert to hw_features) which reordered the calls to xenvif_up() and netif_carrier_on() in xenvif_connect(). Signed-off-by:
David Vrabel <david.vrabel@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by:
Ian Campbell <ian.campbell@citrix.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Willem de Bruijn authored
[ Upstream commit 7091fbd8 ] This is a minor change. Up until kernel 2.6.32, getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, ...) would return total and dropped packets since its last invocation. The introduction of socket queue overflow reporting [1] changed drop rate calculation in the normal packet socket path, but not when using a packet ring. As a result, the getsockopt now returns different statistics depending on the reception method used. With a ring, it still returns the count since the last call, as counts are incremented in tpacket_rcv and reset in getsockopt. Without a ring, it returns 0 if no drops occurred since the last getsockopt and the total drops over the lifespan of the socket otherwise. The culprit is this line in packet_rcv, executed on a drop: drop_n_acct: po->stats.tp_drops = atomic_inc_return(&sk->sk_drops); As it shows, the new drop number it taken from the socket drop counter, which is not reset at getsockopt. I put together a small example that demonstrates the issue [2]. It runs for 10 seconds and overflows the queue/ring on every odd second. The reported drop rates are: ring: 16, 0, 16, 0, 16, ... non-ring: 0, 15, 0, 30, 0, 46, 0, 60, 0 , 74. Note how the even ring counts monotonically increase. Because the getsockopt adds tp_drops to tp_packets, total counts are similarly reported cumulatively. Long story short, reinstating the original code, as the below patch does, fixes the issue at the cost of additional per-packet cycles. Another solution that does not introduce per-packet overhead is be to keep the current data path, record the value of sk_drops at getsockopt() at call N in a new field in struct packetsock and subtract that when reporting at call N+1. I'll be happy to code that, instead, it's just more messy. [1] http://patchwork.ozlabs.org/patch/35665/ [2] http://kernel.googlecode.com/files/test-packetsock-getstatistics.cSigned-off-by:
Willem de Bruijn <willemb@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-